SAMPLING
-
Still have a problem of controlling
error. Some-not all--error can be controlled by
-
random assignment
-
matching
-
statistics
-
always have error based on the
sample selected
-
What do I mean by sample and
why is this an important issue?
-
sample => measurement target,
person or persons (or other things) actually measured
-
sample population
-
population =>
-
group to which the characteristics
of the sample is inferred
-
larger group from which the
sample is drawn
-
statistically, the sample is
a subset of the population
-
much of research addresses the
question of how similar or dissimilar the sample is to the population it
supposedly represents
-
two key sampling issues: representativeness
& adequacy
-
representativeness (=> sample
generalizability)
-
To what extent is this sample
like the population to which I would like to be able to generalize?
-
e.g. how alike are students
in this classroom (sample) to Illinois social workers?
-
adequacy
-
Is this sample big enough to
pick up the effect I am interested in?
-
sample adequacy sample size
-
power: the ability to
detect an effect
-
e.g.: seeing a flashlight on
a sunny day and on a dark night
-
adequacy is always related to
size or numbers
-
in the sense that random
assignment to
a control group solves many of our problems with internal validity, random
selection of
a sample solves many more problems, especially with external validity (generalizability)
-
random assignment and random
selection are both tools designed
to control error through distributing its probability of occurring throughout
the sample
-
special name for a sample where
random selection has been employed: probability sample
-
probability sample => I KNOW
the chance of an individual in the population being included in the sample
-
non-probability sample => I
DON'T KNOW the chance of an individual in the populatiion being included
in the sample
-
Probability Sample
-
sample and the population (review)
-
population: the larger group
(unmeasured) to which you
seek to generalize
-
sample: a subgroup of that population
that is actually measured
-
new term: ERROR:
-
error = the difference in a
variable between its measured value in the sample and its theoretical (unmeasured
or actual) value in the population
-
eg. what is the average age
of social workers in Illinois using a sample of students in SocW360?
-
Measure the mean age of a sample
of social workers in this class and found it was 22.4yr
-
The actual mean age of
social workers in Illinois (which I cannot by definition find out)
was 39.8yr
-
the amount error in my sample
is 39.8 - 22.4 = 17.4 years (or 17.4/39.8 = 44% sampling error)
-
typically, sampling error is
below 6%
-
Q: why do I have so large a
sampling error in this case???
-
sampling error =
The difference between the population statistic and the sample statistic,
based on:
-
number
in sample
-
The larger the sample, the smaller
the sampling error
-
proportion
of sample to population
-
The larger the proportion sampled,
the smaller the sampling error
-
how much confidence
you want to have in your result (95% is the social science standard)
-
The more confidence you want
to have in your results, the larger the sampling error
-
e.g. for 95 out of 100 samples
of this size, the specified parameter (e.g. mean) will be within the limits
of the sampling error
-
e.g. if I say that, based on
a sample of 1,100 adults, President Bush's popularity rating is 85% ±
3% with a 95% level of confidence (p<0.05), that means that in 95 out
of 100 samples of 1,100 adults, the estimated popularity rating is going
to be between 82% and 88%
-
homogeneity
of the population
-
The more diverse the population,
the larger sample you need
-
note table 6.2 p.138 which describes
how large a sample you need (at 95% confidence level) based on
-
sampling errors of 3%, 5%, and
10%
-
population heterogeneities of
80-20 (homogenous) and 50-50 (heterogeneous)
-
different popultion sizes from
100 to 100,000,000
-
typical poll is ± 3%
(1,200)
-
e.g. In a survey, voters preferred
George Bush to Al Gore by 48% to 42%, so the margin was 6%. At the 95%
LOC, and with a 3% margin of error, we would say: In 95 out of 100 surveys
of 1,200 people, the true Bush-Gore margin will be somewhere between 3%
and 9%
-
note: in fact, in the 2000 election
it was a lot less than that, and in the opposite direction, with Gore getting
the majority of votes but with a margin < 1%
-
typical social science survey
is ± 5% (giving a spread of 10%) using a sample of about 400.
-
Other typical ways of determining
appropriate sample size:
-
30 cases per independent variable
cells (central limit theorum)
-
at n=30, the error begins to
approximate a normal curve; below n=30, the curve is too abnormal
-
"normal" means we can use familiar
(easier) forms of data analysis
-
e.g. if you wanted to study
the effct of race (4 categories) and gender (2 categories) on Y, you would
have 4 X 2 = 8 cells, so you need a minimum of 8 X 30 = 240 cases to say
anything meaningful about the effects of race and gender on Y
-
note: you could reduce the number
or race cells to 3, and that reduces the necessary sample to 180.
-
100 cases minimum for any survey
(rule of thumb)
-
probability sampling
-
define sampling
frame
= list from which a sample is drawn
-
simple random sample (SRS)
-
draw a % sample from a sampling
frame using randomization method
-
problem is the list (could be
huge)
-
systematic sample
-
select every Kth person on a
list
-
easier than SRS, loses nothing
in terms of error
-
stratified
-
stratify 3-5 levels on key variable(s)
to make sure there is adequate representation across those key variable(s)
-
eg SWAB study
-
eg if culture/race was important,
would select x% anglo, x@ af-amer, x% native, x% latino, x% asian
-
note: in SRS, these would not
all be X% but differ according to population
-
cluster (area, multistage)
-
sampling frame is developed
in stages
-
eg: opinion about workfare held
by Illinois residents
-
level 1: Cook and a random sample
of 10 of remaining 100 Illinois counties
-
question: why is Cook automatically
in?
-
level 2: break 11 counties into
townships, number them, and take 10% SRS of all townships
-
level 3: township grids: enumerate
them
-
level 4: 10% sample of grids
-
level 5: enumerate streets in
each grid: 10% sample of streets
-
level 6: count houses on street:
SRS of 1 house
-
level 7: interview the person
over 17 with birthday closest to the date
-
non-probability sampling
-
availability (convenience, accidental)
-
snowball
-
e.g. Illinois Protocol study
-
quota--get x number of each
people in each cell
-
purposive (judgement) sample--use
of key informants
-
e.g. Illinois Protocol study
-
dimensional: including 1 person
in each key variable
-
eg my study of women (n=11)
|
Index Man An Addict
|
Index Man Not An Addict
|
|
Abused
|
Not Abused
|
Abused
|
Not Abused
|
|
Recovered |
Not
Recovered |
Recovered |
Not
Recovered |
Continued
Abuse |
Discontinued
Abuse |
|
|
| etch |
drug |
etch |
drug |
etch |
drug |
etch |
drug |
|
|
|
|
|
|
|
|