(9) DATA ANALYSIS
-
data analysis has several steps (in the order
presented)
-
Reduce the complexity of information
-
e.g. in a counseling session, there is a
volumn of data, but it is usually too complex to use for research or evaluation
unless it is systematically organized
-
There is too much information in a program
evaluation to make sense unless the data are organized
-
describe the sample
-
e.g. if you collect information on all the
clients in an agency, you need some systematic way to describe their characteristics--race,
gender, age, level of functioning, etc.
-
if research question was descriptive, answer
the research question
-
the numbers used to describe the sample is
usually calleddescriptive statistics
-
estimate parameters and/or test
hypotheses
-
both are probabilistic and involve calculation
of error
-
parameter error: confidence interval
-
e.g. M=6.5 (±0.4)
-
"estimated mean is 6.5, between 6.1 and 6.9
at the 95% level of confidence
-
In 95 out of 100 samples this size the mean
will be between 6.1 and 6.9
-
hypothesis error: p-value
-
e.g. M1=6.5,
M2=7.4 (t=2.68, p<0.05)
-
You can get a difference of 0.9 between groups
by chance alone in no more than 5 out of 100 samples with groups of this
size
-
The numbers associated with sampling error
and hypothesis testing are usually called inferential
statistics
-
Most statistics social workers use in practice
are descriptive
-
to reduce the complexity of the information
gathered:
-
put in CASE x VARIABLE matrix (also called
a dataset orspreadsheet)
-
eg: hypthesize that intensive case management
& job training will reduce the time mothers spend unemployed when compared
to traditional casework
Case
No ___ ___ ___
Service (circle
one)
(1) ICM
(2) casework
Days unemployed ___ ___ ___ ___
Number of Children ___ ___
Number in Household ___ ___
Gender (1) Female
(2) Male
Race Identification (1) African-American
(2) Asian-PI
(3) Hispanic
(4) Native American
(5) Mixed ____ & ____
(6) Other __________________
|
Case No
(N)
|
Service
(N)
|
Days
Unemp
(R)
|
# Children
(R)
|
# Household
(R)
|
Gender
(N)
|
Race
(N)
|
| 001 |
1 |
125 |
1 |
2 |
1 |
1 |
| 002 |
2 |
273 |
4 |
7 |
2 |
6 |
| 003 |
2 |
200 |
2 |
4 |
2 |
4 |
| 004 |
1 |
12 |
2 |
3 |
1 |
3 |
-
Describing the sample is done w/
-
language (qualitative)
-
e.g. it was difficult for many of the subjects
to get transportation to and from the training program
-
statistics (quantitive)
-
e.g. ICM reduced the days unemployed compared
to the casework condition (MICM=121
days, MC=148 days, t=3.75,
df=267, p<0.05)
-
images (qualitative-quantitative)
-
e.g. The ICM condition reduced unemployability
by 18%
-
qualitative description enables you to describe
the richness & depth of experience, and to identify themes
-
"thick description"
-
"Mary S. had been off work for nearly three
years. With multiple impairments such as mental illness, addiction, and
illiteracy, it was unlikely she would ever get - and keep - a job.
-
quantitative description enables you to identify
central tendencies and variation with precision; saves space
-
function of statistics: describing the sample,
making inferences about the population
-
DESCRIBING
THE SAMPLE according to its variables in numerical terms
-
can use actual numbers or percentages
-
how described depends on the level of measurement
of variables
-
can do univariate, bivariate,
and multivariate descriptions
-
univariate
=>1 variable
-
What is the average number of unemployed
days?
-
bivariate
=> relationship between 2 variables
-
What is the average number of unemployed
days for women in ICM and for women in Casework?
-
multivariate
=> effect of more than one variable on a dependent variable
-
Considering gender, race, number in household
and number of children, and intervention condition, which is the best predictor
of days umemployed?
-
Univariate
description
-
measures of central tendency
-
mean
(ratio of the sum of an attribute to the number of cases being summed
-
median
(the number which divides the sample in half)
-
mode
(the most frequently occurring number)
-
use mean when variable is interval or ratio
AND when use of the mean does not present a distorted or skewed
picture
-
eg: 5 people in a group, and their ages are
20, 21, 22, 23, and 64
-
mean age = 30
-
median age = 22, a better measure of central
tendency
-
better use median when data are skewed
-
skew => mean is substantially different than
the median
-
"positive skew" => mean>>median
-
"negative skew" => mean<<median
-
in above example of 5 group members, data
are positively skewed because M>MD
-
measures
of dispersion
-
in above example, just using EITHER the mean
age or the median age to describe the age would be deceiving: need an indicator
of diversity, difference, spread, or variation in age
-
range
(the difference between the highest and lowest value of a variable)
-
range = H-L+1
-
64-20+1 = 45 years
-
"ranged from 20 to 64"
-
range is meaningful as a number only for
interval level variable, or for ordinal variable being treated as if it
were interval (e.g. IQ, test scores, attitude measures, clinical evaluation
tools, etc.)
-
standard
deviation
-
s = average difference between individual
scores and the mean
-
s is not usually done by hand, except on
a calculator
-
s = 19.0 yrs
-
variance
-
s2
= the total amount of difference in the variable
-
s2
is not usually done by hand
-
s2
= 362.5
-
other measures of dispersion for ordinal,
categorical variables
-
Bivariate
Description: measures of relationship between two important variables
-
relationship between categorical variable(s)
and an interval dependent variable => ANOVA
(F-test)
-
e.g. describe relationship between days unemployed
and race
-
MW=95,
MA-A=120, MH/L=124,
MO=98
-
question: are these means significantly different?
Use ANOVA to find out
-
if the categorical variable has only two
values (e.g. gender), we use a special case of ANOVA called the
t-test, or means test
-
e.g.
describe relationship between days unemployed and gender
-
MM=85,
MW=120
-
question: is 85 different than 120? Use t-test
to find out
-
linear relationship between two interval
variables =>
regression coefficient (B)
-
e.g.
describe relationship between days unemployed andnumber
of children
-
MUNEMPLOYED=100,
MCHILDREN=2.3
-
question: is unemployment directly related
to number of children? Use linear regression to find the best straight
line (y = bx + c) which describes the relationship between these two variables,
where b is the regression coefficient (slope of the line) and c is the
constant (the value of y when x is zero, or where the line crosses the
y-axis) (unemployment=b*children +constant;
-
e.g. unemployment = #kids*1.5 + 81 days,
so we predict someone with 4 kids would be unemployed 81+4*1.5 or 87 days
-
a special case of b is Pearson
correlation coefficient [r = b(sx/sy)]
which, unlike B, does not depend of the units of measurement
-
e.g. r=0.25 between days unemployed and the
number of children
-
relationship between two categorical or nominal
variable => chi-square
(2)
-
e.g. relationship between improvement and
gender
-
We put the data in a
contingency table and use 2
to suggest whether there is a relationship between between the two variable
or whether they are independent of one another
-
in above example, 2
tests the (null) hypothesis that there is no relationship between gender
and outcome, e.g.
Improved
No Change Worse
Men 25 30 10
Women 40 30 5
Note that 25/65 (38%) of men improve and
40/75 (53%) of women improve, but is that difference significant? Use 2
to find out (it is NOT significant: the critical value for df=2, p<0.05
is 5.99, and the value for this outcome is 3.79, not enough)
-
Multivariate
Description
-
relationship between two or more categorical
variables and an interval dependent variable => ANOVA
(F-test)
-
if there are two or more dependent
variables, e.g. attitude about work fare before and after an education
program, we use a more general form of ANOVA called multivariate analysis
of variariance, or MANOVA
-
relationship between two or more interval
variables and an interval dependent variables => multiple
regression (B)
-
If
the dependent variable is categorical, must not use simple linear regression,
but logistic regression
-
Beyond Description: Inference
about the connection between sample and population
-
level of confidence in your description
-
95% LOC as norm (p<0.05)
-
2
= 6.02 is a description
of the relationship between two nominal variables
-
2
= 6.02, p<0.01 is an inference
that you can only get a 2
as large as 6.02 in
this population by chance 1 out of every hundred times
-
p<0.01 is called the alpha value, and
it is set by a researcher, who must decide how much chance he or she is
willing to live with, ie. how much confidence
is needed
-
interval of confidence is the range of values
where the statistic may actually lie in
the population for a given level of confidence
-
to infer the mean: Mpop
= Msamp ± (sd of
mean) (LOC)
Where sd of
mean = s/n1/2
and LOC=z
= M ± (s/n1/2)
z
-
z99=2.56
z95=1.96
-
e.g. if mean age = 30.0 years, and
if s=2.5 yrs, and
if the sample (n=100) is drawn from a normal
population
-
mean (@LOC=95%) = 30.0 ± (2.5/10)
(1.96)
= 30.0 ±
.49
=> between 29.51 and 30.49
-
the interval 29.51 and 30.49 is the 95% confidence
interval for that sample
-
testing hypotheses
-
hypotheses are usually of two forms:
-
Between group H1:
there is a difference between two groups
-
H0
: there is NO difference between two groups
-
Within-group
H1:
there is an association between two variables
-
H0
: there is NO association between
two variables
-
the same statistics are used to test inferences
to the population as were used to describe the sample, except now, a level
of confidence is attached
-
difference
between groups
-
nominal dv -- coefficient of dispersion
-
ordinal dv -- Mann-Whitney test or Wilcoxyn
test
-
interval dv -- t test
-
ANOVA/MANOVA
-
relation
between two variables
-
nominal variables: Chi Square (2)
-
ordinal variables: Spearman correlation coefficient
()
-
interval variables: Pearson correlation coefficient
(r)
-
regression coefficient
-
error in hypothesis testing
|
Accept Null
(Say there
is no relationship or no difference) |
Reject Null
(Say there is a relationship or a difference) |
| Null
is correctin population
(there is no relationship or no difference) |
No error
|
Type 1 error
"false positive"
constitutionally protected
|
| Null
is incorrectin population
(there is a relationship or a difference) |
Type II
|
No error
|
-
TYPE I ERROR or "false positive"
-
eg. IN RE H0:
there is no difference in the likelyhood a battered woman will terminate
an abusive relationship depending on whether or not she is employed full
time
-
result: find that the differences are 36%
of employed terminate but only 25% of unemployed terminate
-
believe this is important (CLINICALLY SIGNIFICANT
-
accept this as evidence that null or no difference
should be rejected, and therefore accept hypothesis that employment effect
the decision to leave.
-
do a statistical test and find chi-square
= 1.90 (n.s. @95%, but say "close enough!"
-
TYPE II ERROR
-
eg. IN RE H0:
there is no difference in the likelyhood a battered woman will terminate
an abusive relationship depending on whether or not she is employed full
time based on a sample of ten women
-
find that 3 employed and 2 unemployed terminate-this
doesn't seem like much of a difference
-
do a statistical test and find chi-square
= 1.1 (n.s. @95%, so say "not even close!"