2 Margin of Error, Confidence Intervals, and Sample Size
Tải bản đầy đủ - 0trang
76
Chapter 3
opinion on current topics of interest. As we have said, if properly conducted,
these surveys are amazingly accurate.
When a survey is used to ﬁnd a proportion based on a sample of only a few
thousand individuals, the obvious question is how close that proportion comes
to the truth for the entire population. The margin of error is a measure of the
accuracy of a sample proportion. It provides an upper limit on the difference
between the sample proportion and the population proportion that holds for at
least 95% of simple random samples of a speciﬁc size. In other words, the difference between the sample proportion and the unknown value of the population proportion is less than the margin of error at least 95% of the time, or in
at least 19 of every 20 situations. A conservative estimate of the margin of error
for a sample proportion is calculated as 1> 1n, where n is the sample size. To express results in terms of percentages instead of proportions, simply multiply
everything by 100.
definition
The conservative margin of error for a sample proportion is calculated by
using the formula 1> 1n, where n represents the sample size, the number of
people in the sample. The amount by which the sample proportion differs from
the true population proportion is less than this quantity in at least 95% of all
random samples. Survey results reported in the media are usually expressed as
percentages. The conservative margin of error for a sample percentage is
1
ϫ 100%
1n
In Chapter 10, we will learn a more complicated and precise formula for the
margin of error that depends on the actual sample proportion. It will never give
an answer larger than the formula given here, so we call this formula a conservative margin of error. For simplicity, many media reports use the conservative
formula. For example, with a sample of 1600 people, we will usually get an estimate that is accurate to within 1> 11600 ϭ 1>40 ϭ 0.025 ϭ 2.5% of the truth.
Conﬁdence Intervals
You might see results such as “Fifty-ﬁve percent of respondents support the
President’s economic plan. The margin of error for this survey is plus or minus
2.5 percentage points.” This means that it is almost certain that between 52.5%
and 57.5% of the entire population supports the plan. In other words, add and
subtract the margin of error to the sample value (55% in this example) to create
an interval. If you were to follow this method every time you read the results
of a properly conducted survey, the interval would miss covering the truth
only about 1 in 20 times (5%) and would cover the truth the remaining 95% of
the time.
A conﬁdence interval is an interval of values that estimates an unknown
population value. In the preceding paragraph, the percentage of all U.S. adults
favoring the President’s economic policy is the unknown value we wish to estimate. The conﬁdence interval estimate of this value is 52.5% to 57.5%, calculated as sample percentage Ϯ margin of error.
Sampling: Surveys and How to Ask Questions
77
The word conﬁdence refers to the fraction or percentage of random samples
for which a conﬁdence interval procedure gives an interval that includes the
unknown value of a population parameter. For the procedure described here,
the conﬁdence is at least 95%, meaning that intervals calculated in this manner
will capture the population proportion for at least 95% of all properly selected
random samples.
definition
95% Conﬁdence Inter val for Population Proportion
For about 95% of properly conducted sample surveys, the interval
sample proportion Ϫ
1
1
to sample proportion ϩ
1n
1n
will contain the actual population proportion. This interval is called an approximate “95% conﬁdence interval” for the population proportion. In Chapter 10, a more precise formula will be given.
Another way of writing the approximate 95% conﬁdence interval is
sample proportion Ϯ
1
1n
To report the results in percentages instead of proportions, multiply everything
by 100%.
Here are three examples of polls, with the margin of error and an approximate 95% conﬁdence interval for each one. In each case, the poll is based on a
nationwide random sample of Americans aged 18 and older.
Example 3.3
The Importance of Religion for Adult Americans For a CNN/USA Today/
Gallup Poll conducted on September 2 to 4, 2002, a random sample of n ϭ 1003
adult Americans was asked, “How important would you say religion is in your
own life: very important, fairly important, or not very important?” (http://www
.pollingreport.com/religion.htm, September 17, 2002). The percentages that
selected each response were as follows:
Very important
Fairly important
Not very important
No opinion
65%
23%
12%
0%
Conservative margin of error:
1
ϭ .03 or .03 ϫ 100% ϭ 3%
11003
Approximate 95% conﬁdence intervals for the proportion and percentage of
all adult Americans who would say religion is very important are
Proportion: .65 Ϯ .03 or [.65 Ϫ .03 to .65 ϩ .03] or .62 to .68
Percentage: 65% Ϯ 3% or [65% Ϫ 3% to 65% ϩ 3%] or 62% to 68% ■
Example 3.4
Would You Eat Those Modiﬁed Tomatoes? An ABCNews.com poll conducted by TNS Intersearch on June 13 to 17, 2001, posed the following question
78
Chapter 3
Watch a video example at http://
1pass.thomson.com or on your CD.
to a random sample of n ϭ 1024 adult Americans: “Scientists can change the
genes in some food crops and farm animals to make them grow faster or bigger
and be more resistant to bugs, weeds and disease. Do you think this genetically modiﬁed food, also known as bio-engineered food, is or is not safe to eat?”
(http://www.pollingreport.com/science.htm, September 17, 2002). The percentages that selected each response were as follows:
Is safe
Is not safe
No opinion
35%
52%
13%
Conservative margin of error:
1
ϭ .03 or .03 ϫ 100% ϭ 3%
11024
Approximate 95% conﬁdence intervals for the proportion and percentage
of all adult Americans who would say genetically modiﬁed food is not safe to
eat are
Proportion: .52 Ϯ .03 or [.52 Ϫ .03 to .52 ϩ .03] or .49 to .55
Percentage: 52% Ϯ 3% or [52% Ϫ 3% to 52% ϩ 3%] or 49% to 55% ■
Example 3.5
Cloning Human Beings A Pew Research Center for the People & the Press
and Pew Forum on Religion & Public Life survey conducted by Princeton Survey Research Associates between February 25 and March 10, 2002, asked a random sample of n ϭ 2002 adult Americans, “Do you favor or oppose scientiﬁc
experimentation on the cloning of human beings?” (http://www.pollingreport
.com/science.htm, September 17, 2002). The percentages that selected each response were as follows:
Favor
Oppose
Don’t know/Refused
17%
77%
6%
Conservative margin of error:
1
ϭ .022 or .022 ϫ 100% ϭ 2.2%
12002
Approximate 95% conﬁdence intervals for the proportion and percentage of
all adult Americans who would say they favor this experimentation are
Proportion: .17 Ϯ .022 or [.17 Ϫ .022 to .17 ϩ .022] or .148 to .192
Percentage: 17% Ϯ 2.2% or [17% Ϫ 2.2% to 17% ϩ 2.2%] or
14.8% to 19.2% ■
Interpreting the Conﬁdence Intervals in Examples 3.3 to 3.5
In each of the three examples of surveys, an approximate 95% conﬁdence interval was found. There is no way to know whether all, some, or none of these
intervals actually covers the population value of interest. For instance, the interval from 62% to 68% may or may not capture the percentage of adult Americans who considered religion to be very important in their lives in September
2002. But in the long run, this procedure will produce intervals that capture the
unknown population values about 95% of the time, as long as it is used with
properly conducted surveys. This long-run performance is usually expressed
after an interval is computed by saying that we are 95% conﬁdent that the population value is covered by the interval. We will learn more about how to in-
Sampling: Surveys and How to Ask Questions
79
terpret a conﬁdence interval and the accompanying conﬁdence level (such as
95%) in Chapter 10.
Choosing a Sample Size for a Survey
Table 3.1 Relationship Between
Sample Size and Margin
of Error for 95% Conﬁdence
Sample
Size n
Margin of
Error ؍1/ 1n
10,100
10,400
10,625
1,1000
1,1600
1,2500
10,000
.10 (10%)
.05 (5%)
.04 (4%)
.032 (3.2%)
.025 (2.5%)
.02 (2%)
.01 (1%)
When surveys are planned, the choice of a sample size is an important issue.
One commonly used strategy is to use a sample size that provides a desired
margin of error for a 95% conﬁdence interval. Table 3.1 displays the margin of
error for several different sample sizes. The margin of error calculations were
done by using the “conservative” formula 1> 1n. This is commonly done by
polling organizations because, before the sample is observed, there is no way to
know what sample proportion to use in the more exact margin of error formula.
With a table like Table 3.1, researchers can pick a sample size that provides suitable accuracy for any sample proportion, within the constraints of the time and
money available for the survey.
Two important features of Table 3.1 are
1. When the sample size is increased, the margin of error decreases.
2. When a large sample size is made even larger, the improvement in accuracy is relatively small. For example, when the sample size is increased
from 2500 to 10,000, the margin of error decreases only from 2% to 1%. In
general, cutting the margin of error in half requires a fourfold increase in
sample size.
Polling organizations determine a sample size that is accurate enough for
their purposes and is also economical. Many national surveys use a sample size
of about 1000, which, as you can see from Table 3.1, makes the margin of error
roughly 3%. This is a reasonable degree of accuracy for most questions asked in
these surveys.
Some federal government surveys utilize much larger sample sizes, sometimes as large as n ϭ 120,000, to make accurate estimates of quantities such
as the unemployment rate. Also, when researchers want to make accurate
estimates for subgroups within the population, they have to use a very large
overall sample size. For instance, to get information from approximately 1000
African-American women in the 18- to 29-year-old age group, a random sample
of 120,000 Americans might be necessary.
The Effect of Population Size
3.2 Exercises are on page 108.
You might wonder how the number of people in the population affects the accuracy of a survey. The surprising answer is that for most sample surveys, the
number of people in the population has almost no influence on the accuracy of
sample estimates. The margin of error for a sample size of 1000 is about 3%
whether the number of people in the population is 30,000 or 200 million.
The formulas for margin of error in this chapter were derived by assuming
that the number of units in the population is essentially inﬁnite. In practice,
as long as the population is at least ten times as large as the sample, we consider only how sample size affects accuracy and we ignore the speciﬁc size of
the population. For small populations, a “ﬁnite population correction” is used.
We will not discuss it in this book, but you can consult any book on survey sampling for details.
80
Chapter 3
t h o u g h t q u e s t i o n 3 . 2 Suppose that a survey of 400 students at your school is conducted to assess student opinion about a new academic honesty policy. Based on Table 3.1, about
what will be the margin of error for the poll? How many students attend your school?
Given this ﬁgure, do you think the values in Table 3.1 should be used to estimate the margin of error for a survey of students at your school? Explain.*
3.3 Choosing a Simple Random Sample
The ability of a relatively small sample to accurately reﬂect a huge population
does not happen haphazardly. It happens only if proper sampling methods are
used. A probability sampling plan is one in which everyone in the population has a speciﬁed probability to be selected for the sample. The basic idea is
that everyone in the population must have a speciﬁed chance of making it into
the sample.
The most basic probability sampling plan is to use a simple random
sample. Remember that with a simple random sample, every conceivable
group of units of the required size has the same chance of being the selected
sample. In this section, we discuss how to choose a simple random sample. A
variety of other probability sampling methods will be discussed in Section 3.4.
Choosing a simple random sample is somewhat like choosing the winning
numbers in many state lotteries. For instance, in the Pennsylvania Match 6 lottery game, six numbers are randomly selected from the choices 1, 2, . . . , 49.
Every possible set of six numbers is equally likely to be the winning set. There
are actually 13,983,816 different possible sets, which is why the odds of any
speciﬁc individual guessing the winning set are so small!
Similarly, the chances of any particular group of units getting selected to
be the random sample from a large population is quite small, but whatever
group is selected is likely to be representative. For instance, in a simple random
sample of 1000 people in your state or country, it is extremely unlikely that you
and your next-door neighbor would both be selected. But it is extremely likely
that someone in the sample will be representative of each of you, having similar opinions to yours.
By the way, if there were 100 million people in a population, the number
of different possible samples of 1000 individuals would be incomprehensively
large. It would take about one and a half pages of this book to show the value.
Approximately, the number of possible different samples of 1000 individuals selected from a population of 100 million is 247 followed by 5430 zeros.
To actually produce a simple random sample, you need only two things.
First, you need a list of the units in the population. For instance, in drawing
the winning lottery numbers, the list of units is the numbers 1, 2, . . . , 49. In
selecting a simple random sample of students from your school, the list of units
is all students in the school (which the registrar can usually produce).
Second, you need a source of random numbers. Computer programs such
as Minitab and Excel or the right calculator can be used to generate random
numbers. Random numbers can also be found in tables called tables of ran-
*H I N T : The sample size is n ϭ 400. Is the number of students at your school more than ten times
the sample size?
Sampling: Surveys and How to Ask Questions
81
dom digits. If the population isn’t very large, physical methods can be used,
such as in lotteries in which the numbers are written on small, hollow plastic
balls and six of them are physically selected.
Table 3.2 illustrates a portion of a table of random digits. It is organized into
numbered rows to make it easier to ﬁnd speciﬁc sections of the table. There are
only ten rows in Table 3.2, so they are labeled by using the single digits 0 to 9.
A larger table would have to use longer identifying numbers for the rows. The
digits are grouped into columns of ﬁve for easier reading. These tables are generated by the equivalent of writing the digits from 0 to 9 on slips of paper, mixing them well, choosing one, and then repeating this process over and over
again with replacement. No single digit, pair of consecutive digits, triplet of digits, and so on is any more or less likely to occur than any others.
Table 3.2 A Table of Random Digits
Row
0
1
2
3
4
5
6
7
8
9
00157
38354
59785
27475
28966
98879
50735
16332
83845
12522
37071
03533
46030
10484
35427
34072
87442
40139
41159
20743
79553
95514
63753
24616
09495
04189
16057
64701
67120
28607
31062
03091
53067
13466
11567
31672
02883
46355
56273
63013
42411
75324
79710
41618
56534
33357
22656
62340
67519
60346
79371
40182
52555
08551
60365
53191
44133
22011
93389
71005
25506
17302
72307
18314
02736
09807
90599
47257
83590
90348
69135
64224
10223
57700
32700
85796
91793
74877
12944
86615
Using Table 3.2 to Choose a Simple Random Sample
Here are the steps for selecting a simple random sample using Table 3.2.
1. Number the units in the population, using the same number of digits
for each one.
Example: Suppose there are 270 students in a class and the teacher wants
to choose a simple random sample of 10 of them to call on in class. Number the students from 001 to 270.
2. Choose a starting point in Table 3.2. You can close your eyes and point or
use any other method as long as you have not studied the table and then
chosen numbers that give a favorable sample for your purpose.
Example: Start in row 3, column 2 (10484 . . .).
3. From the starting point, read across the row to get numbers with the correct
number of digits to identify a unit. Continue with consecutive rows. For instance, if the units are numbered 001 to 270, read three-digit numbers.
Example: Reading three-digit numbers starting with row 3, column 2 results
in 104, 842, 461, 613, 466, 416, 180, 855, 118, 314, 577, 002, 896, and so on.
4. If a number is in the range of the unit numbers, select that unit number.
Otherwise, continue along the row, choosing more potential unit numbers until you have a sample of the desired size. (But see step 5 for a more
efﬁcient method.) Because each unit can be used only once, if a unit number occurs that has already been selected, simply ignore that number and
continue.
82
Chapter 3
Example: Unit numbers can only be 001 to 270, so most of the numbers
that are chosen are simply ignored. Select units 104, 180, 118, 002, then
keep going until the required 10 students have been selected.
5. Step 4 is very inefﬁcient. To make it more efﬁcient, reassign some of the
higher numbers onto the range of unit numbers in a way that still ensures
that each unit number has an equal chance of selection. For instance,
suppose the units are numbered 001 to 270. If a number between 301
and 570 is chosen, subtract 300 and use it. If a number between 601 and
870 is chosen, subtract 600 and use it. As in step 4, if a unit number occurs more than once, simply ignore subsequent occurrences after the
ﬁrst one. For instance, in the scenario just given, if unit 017 has already
been selected, then any subsequent occurrence of 017, 317, or 617 would
be discarded.
Example: Now the string generated earlier is more useful. Let’s see what
decision follows each three-digit entry:
Three-Digit
Number
Using Step 4,
Choose Unit
104
842
461
613
466
416
180
855
118
314
577
002
104
Discard
Discard
Discard
Discard
Discard
180
Discard
118
Discard
Discard
002
Using Step 5,
Choose Unit
104
242 (subtracted 600)
161 (subtracted 300)
013 (subtracted 600)
166 (subtracted 300)
116 (subtracted 300)
180
255 (subtracted 600)
118
014 (subtracted 300)
Not needed, 10 already selected
Not needed
Using the method in step 4, the ﬁrst four students who are selected would
be those numbered 104, 180, 118, and 002, and the process would continue until six more were selected. Using the method in step 5, the sample of ten units
would include 104, 242, 161, 013, 166, 116, 180, 255, 118, and 014, and unit 002
would not be needed.
MINITAB t i p
Picking a Random Sample
●
To create a column of ID numbers, use CalcbMake Patterned Datab
Simple Set of Numbers. In the dialog box, specify a column for storing the
ID numbers, and specify the ﬁrst and last possible ID number for the
population.
●
To sample values from a column, use CalcbRandom DatabSample from
Columns. In the dialog box, specify how many items (rows) will be selected from a particular column, and specify a column where the sample
will be stored.
Note: Items can be randomly selected from a column of names or data values, so it may not be necessary to assign ID numbers to the units in the
population in order to select a sample.
Sampling: Surveys and How to Ask Questions
Example 3.6
Watch a video example at http://
1pass.thomson.com or on your CD.
83
Representing the Heights of British Women In Example 2.17 of Chapter 2,
we examined data on the heights of 199 British women. Suppose you had a list
of these 199 women and wanted to choose ten of them to test-drive a sporty, but
small, automobile model and give their opinions about its comfort. The heights
of the women in the sample should be representative of the range of heights in
the larger group. You would not want your sample of ten to include only short
women or only tall women. Here’s how you could choose a simple random
sample:
1. Assign an ID number from 001 to 199 to each woman.
2. Use a table of random digits, a computer, or a calculator to randomly select ten numbers between 001 and 199, and sample the heights of the
women with those ID numbers.
We used the statistical package Minitab to choose a simple random sample of
heights, then used Table 3.2 to choose another one. The samples are listed next
along with their sample means and the list of ten random numbers between 001
and 199 that generated them (with leading zeros dropped). In Chapter 2, the
heights were given in millimeters, but here they have been converted to inches.
Sample 1 (Minitab)
ID numbers of the women selected: 176, 10, 1, 40, 85, 162, 46, 69, 77, 154
Heights: 60.6, 63.4, 62.6, 65.7, 69.3, 68.7, 61.8, 64.6, 60.8, 59.9;
mean ϭ 63.7 inches
Sample 2 (Table 3.2)
ID numbers of the women selected: 41, 93, 167, 33, 157, 131, 110, 180, 185,
196
Heights: 59.4, 66.5, 63.8, 62.6, 65.0, 60.2, 67.3, 59.8, 67.7, 61.8;
mean ϭ 63.4 inches
3.3 Exercises are on page 109.
Sample 2 was selected by starting with the third set of ﬁve digits in the row labeled 5 (04189). Numbers from 001 to 199 can be used directly; for numbers between 201 and 399, subtract 200; for numbers between 401 and 599, subtract 400;
and so on. The only numbers that would need to be discarded in using this
method are 000, 200, 400, 600, and 800. You can test your understanding of the
use of Table 3.2 by starting at 04189 (third set in row 5), selecting ten consecutive sets of three digits, and determining whether you correctly identify the ten
women in sample 2.
As you can see, each sample is different, but each should be representative
of the whole collection of 199 women. Within each sample, the range of heights
is from about 60 inches to about 69 inches. The sample means are both close to
the mean height for the larger group, which was given in Chapter 2 as 1602 mm,
or about 63 inches. ■
3.4 Other Sampling Methods
For large populations, it may not be practical to take a simple random sample
because it may be difﬁcult to get a numbered list of the units. For instance, if a
polling organization wanted to take a simple random sample of all voters or all
84
Chapter 3
adults in a country or region, the organization would need to get a numbered
list of them, which is simply an impossible task in most cases. Instead, it relies
on more complicated sampling methods, all of which are good substitutes for
simple random sampling in most situations. In fact, they often have advantages
over simple random sampling. For instance, one of these other methods, stratiﬁed random sampling, can be used to increase the chance that the sample represents important subgroups within the population.
To visually illustrate the various sampling plans discussed in this chapter,
let’s suppose that a college administration would like to survey a sample of students living in dormitories. The college has undergraduate and graduate dormitories. The undergraduate dormitories have three ﬂoors each with 12 rooms per
ﬂoor. The graduate dormitories have ﬁve ﬂoors each with 8 rooms per ﬂoor.
Figure 3.1 illustrates a simple random sample of 30 rooms in the dormitories. Any collection of 30 rooms has an equal chance of being the selected
sample. Notice that for the sample illustrated, there are 12 undergraduate
rooms and 18 graduate rooms in the sample.
Undergraduate dormitories
12 rooms per floor
Graduate dormitories
8 rooms per floor
Third floor
Second floor
First floor
Fifth floor
Fourth floor
Third floor
Second floor
First floor
Figure 3.1 ❚ A simple random sample of 30 dorm rooms
Stratiﬁed Random Sampling
A stratiﬁed random sample is collected by ﬁrst dividing the population of units
into strata (subgroups of the population) and then taking a simple random
sample from each one. The strata are subgroups that seem important to represent properly in the sample and might differ with regard to values of the response variable(s) measured. For example, public opinion pollsters often take
separate random samples from each region (strata) of the country so that they
can spot regional differences as well as improve the likelihood that all regions
are properly represented in the overall national sample. Or political pollsters
may separately sample from each political party to compare opinions by party
and to be sure that each party is properly represented.
Figure 3.2 illustrates a stratiﬁed sample for the college survey. There are two
strata: the undergraduate and graduate dorms. A random sample of 15 rooms is
taken from each of the two strata. Each collection of 15 undergraduate rooms
has an equal chance of being the selected sample for the undergraduate dorms,
Sampling: Surveys and How to Ask Questions
Undergraduate dormitories
12 rooms per floor
85
Graduate dormitories
8 rooms per floor
Third floor
Second floor
First floor
Fifth floor
Fourth floor
Third floor
Second floor
First floor
Figure 3.2 ❚ A stratiﬁed sample of 15 undergraduate and 15 graduate dorm rooms
and each collection of 15 graduate rooms has an equal chance of being the selected sample for the graduate dorms. But the total of 15 rooms to be sampled
within each stratum (undergraduate or graduate) is ﬁxed before the sample is
selected.
A principal beneﬁt of stratiﬁed sampling is that it can be used to improve the
chance that the selected sample properly represents important subgroups in
the population. It also is used to create more accurate estimates of population
values than we might get from using a simple random sampling method.
So far, we have been focusing on measuring categorical variables, such as
opinions or traits people might have. Surveys are also used to measure quantitative variables, such as age at ﬁrst intercourse or number of cigarettes smoked per
day. We are often interested in the population average for such measurements.
The accuracy with which we can estimate the average depends on the natural variability among the measurements. The less variable they are, the more
precisely we can assess the population average on the basis of the sample values. For instance, if everyone in a relatively large sample reports that the age at
ﬁrst intercourse was between 16 years 3 months and 16 years 4 months, then
we can be relatively sure that the average age in the population is close to that.
On the other hand, if the reported ages range from 13 years to 25 years, then we
cannot pinpoint the average age for the population nearly as accurately.
Stratiﬁed sampling can help to solve this problem. Suppose we could ﬁgure
out how to stratify in such a way that there is little natural variability in the answers within each strata. We could then get an accurate estimate for each stratum and combine these estimates to get a much more precise answer for the
overall group.
Cluster Sampling
In cluster sampling, the population is divided into clusters (subgroups), but
rather than sampling within each cluster, we select a random sample of clusters
and include only members of these selected clusters in the sample. After clusters are selected, it may be that all members of a cluster are included in the
86
Chapter 3
Undergraduate dormitories
12 rooms per floor
Graduate dormitories
8 rooms per floor
Third floor
Second floor
First floor
Fifth floor
Fourth floor
Third floor
Second floor
First floor
Figure 3.3 ❚ A cluster sample in which ﬁve ﬂoors (clusters) are randomly selected
sample or that some units are then randomly sampled from each of the selected
clusters. A cluster typically comprises units that are physically close to each
other in some way, such as the students living on one ﬂoor of a college dormitory, all individuals listed on a single page of a telephone directory, or all passengers on a particular airplane ﬂight.
Figure 3.3 illustrates a cluster sampling plan for the college survey. Each
ﬂoor of each dormitory is a cluster in this particular plan. A random sample of
ﬁve ﬂoors is selected from the 24 ﬂoors of the three undergraduate and three
graduate dorms. Any collection of ﬁve ﬂoors has an equal chance of being the
selected sample of ﬂoors. Once the ﬁve ﬂoors have been selected, all students
on the ﬁve selected ﬂoors are surveyed. This plan is efﬁcient logistically because
the data collection team will have to visit only ﬁve different dormitory ﬂoors to
collect data.
Cluster sampling is often confused with stratiﬁed sampling, but it is a radically different concept and can be much easier to accomplish. In most applications of stratiﬁed sampling, the population is divided into a few large strata,
such as regions of the country, and a small subset is then randomly sampled
from each of the strata. In most applications of cluster sampling, the population
is divided into small clusters, such as city blocks, a large number of clusters are
randomly sampled, and either everyone or a sample is measured in those clusters selected. In stratiﬁed sampling, all subgroups (strata) are represented in the
sample. In cluster sampling, some clusters within the population are included
in the sample, and others are not.
One obvious advantage of cluster sampling is that you need only a list of
clusters instead of a list of all individual units. City blocks are commonly used as
clusters in surveys that require door-to-door interviews. To measure customer
satisfaction, airlines sometimes randomly sample a set of ﬂights, then distribute a survey to everyone on those ﬂights. Each ﬂight is a cluster. It is clearly
much easier for the airline to choose a random sample of ﬂights than it would
be to identify and locate a random sample of individual passengers to whom to
distribute surveys.
If cluster sampling is used, the analysis must proceed differently because
there may be similarities among the members of the clusters that must be taken