Appendix 5b. Expected Value and Variance of a Continuous Random Variable
Tải bản đầy đủ
RANDOM VARIABLES AND PROBABILIT Y DISTRIBUTIONS
251
First, to show the area under f(x) equals one, we note
∫0
10
1
1
—– dx = —– (10 – 0) = 1
10
10
To calculate the expected value, we write
( )
1
1 10
10
E(X) = ∫ x —– dx = —– ∫ xdx
0
10
10 0
1 x2 10
= —– —–⏐
10 2 0
1 102 02
= —– —– – — = 5
10 2
2
(
)
The variance is
( )
1
10
V(X) = ∫ x2 —– dx – [E(X)]2
0
10
1 x3 10
= —– —–⏐ – 52
10 3 0
(
)
1 1000
= —– ——– – 25 = 8.33
10
3
FURTHER READING
Introductory statistics textbooks for geographers seldom discuss the concept of random variables in detail, nor do they give an extended treatment of probability distribution models. Introductory statistics textbooks for business and economics students and some texts for applied
statistics courses usually give a more complete presentation of these topics. Three representative examples are Neter et al. (1982); Pfaffenberger and Patterson (1981); and Winkler and
Hays (1975). The development of the Poisson probability distribution using the analogy to a
dot map only hints at the widespread use of probability models in this situation. Some additional material on quadrat analysis and a related technique known as nearest neighbor analysis
is provided in Chapter 14. Extended treatment of these topics can be found in Getis and Boots
(1978), Taylor (1977), and Unwin (1981).
A. Getis and B. Boots, Models of Spatial Processes (Cambridge, UK: Cambridge University Press,
1978).
J. Neter, W. Wasserman, and G. Whitmore, Applied Statistics (Boston: Allyn and Bacon, 1982).
R. Pfaffenberger and J. Patterson, Statistical Methods for Business and Economics (Homewood, IL:
Richard Irwin, 1981).
252
I N F E R E N T I A L S TAT I S T I C S
P. Taylor, Quantitative Methods in Geography (Boston: Houghton Mifflin, 1977).
D. Unwin, Introductory Spatial Analysis (London: Methuen, 1981).
R. Winkler, and W. L. Hays, Statistics: Probability Inference and Decision Making, 2nd ed. (New
York: Holt, 1975).
PROBLEMS
1. Explain the meaning of the following terms:
• Statistical experiment or random trial
• Elementary outcome of an experiment
• Conditional probability
• Probability distribution
• Expected value of a random variable
• Standard normal distribution
• Bivariate probability function
• Binomial coefficients
•
•
•
•
•
•
•
•
Sample space
Event
Random variable
Standard score
Variance of a random variable
Covariance
Bernoulli trial
Pascal’s triangle
2. Differentiate between an objective and a subjective interpretation of the concept of probability. Give an example of each. Why are objective interpretations preferred in statistical
analysis?
3. Residents of a city are asked to rank the desirability of five different neighborhoods A, B,
C, D, and E. How many different ways can they be ranked, assuming there can be no ties?
4. A traveling salesperson beginning a trip at city A must visit (in order) cities X, Y, and Z before returning to A. Several roads connect each pair of cities. There are four different ways
of traveling between cities A and X, three routes between X and Y, five between Y and Z,
and two between Z and A. By how many different routes can the salesperson complete the
trip?
5. Construct an outcome tree outlining the elementary outcomes for three tosses of a coin.
What is the probability that three consecutive tails occur, assuming the coin is fair?
6. The license plates in a certain jurisdiction have six alphanumeric digits. The first digit is
A, B, or C. The second digit is N, S, E, or W. The last four digits are restricted to the integers 0, 1, 2, . . . , 9. How many different license plates are possible?
7. The probability that it rains on a given day in July is 0.1.
a. Assuming independent trials, what is the probability that it does not rain for three consecutive days?
b. Again assuming independent trials, what is the probability that one day of rain is followed by two days without rain?
c. Is the assumption of independent trials in this experiment reasonable?
8. A retail geographer surveys 200 shoppers after each has visited one of three shopping centers A, B, or C. She records whether each made a purchase, with a yes (Y) or no (N). The
survey results are as follows:
RANDOM VARIABLES AND PROBABILIT Y DISTRIBUTIONS
a.
b.
c.
d.
Center
Y
N
Total
A
B
C
Total
25
10
65
100
25
50
25
100
50
60
90
200
253
Find P(C).
Find P(A ∪ B).
Find P(B ∪ Y).
Find P(A ∩ B| N).
Explain in your own words the meanings of parts (a) to (d).
9. Differentiate between the following:
a. A discrete and a continuous random variable
b. A probability mass function and a probability density function
c. A normal probability distribution and the standard normal distribution
d. A marginal probability distribution and a conditional probability distribution
e. A probability mass function and a cumulative mass function
10. Let X be a random variable with the following probability distribution:
a.
b.
c.
d.
x
P(x)
0
1
2
3
0.40
0.30
0.15
0.15
Verify that this is a valid probability distribution model.
Determine E(X).
Determine V(X).
What is the mode of X?
11. Graph the probability mass functions for the following discrete probability distribution
models:
a. Uniform: k = 3, 5, and 10
b. Binomial: n = 5, π = 0.20; n = 5, π = 0.50; n = 5, π = 0.70.
c. Poisson: λ = 0.20, 0.40 and 0.60
12. The maximum temperature reached on any day can be classified as above freezing (a success) or below freezing (a failure). In a certain city of eastern North America, January
weather statistics indicate the probability a January day will be above freezing is 0.3. Use
the binomial distribution to determine the following probabilities:
a. Exactly 2 of the next 7 January days will be above freezing.
b. More than 5 of the next 7 days will be above freezing.
c. There will be at least 1 day above freezing in the next 7 days.
d. All 7 days in the next week will be above freezing.
e. Is this a reasonable application of the binomial distribution? Why or why not?
6
Sampling
In Chapter 1, statistical methodology was conveniently divided into descriptive statistics and inferential statistics. In inferential statistics, a descriptive characteristic of
a sample is linked with probability theory, so that a researcher can generalize the results of a study of a few individuals to some larger group. This idea was made more
explicit in Chapter 5, where the notions of a random variable and its probability distribution were defined. At the core of inferential statistics is the distinction between
a population and a sample. From a statistical universe or population, a small subset
of individuals is often selected for detailed study. This sample is used to estimate the
value of some population characteristic or to answer a question about a particular
characteristic of the population. However, to make such inferences, the sample must
be collected in a specific way. It is not possible to make statistically reliable inferences from any sample. Whereas street corner interviews, for example, tend to make
interesting news items, they may not reflect the views of the population they are supposed to represent.
Ideally, it would be best to have a sample that is a good representation of the
population from which it has been drawn. High-quality inferences are made by using
high-quality samples. Unfortunately, unless we know everything about the population,
say from a census, we have no way of knowing if we do have a representative sample. The very act of sampling thus introduces some uncertainty into our inferences,
simply because the sample may not be representative of the population. This is known
as sampling error. Suppose, for example, we wished to sample the students at a university and determine the number of hours the average student spends studying in any
given week. By mere chance, we may just select a sample that includes more industrious students than average students and thus overestimates the amount of time spent
studying. We might be led to believe that the average student spent more time studying than is, in fact, the case. Our only safeguard against such sampling error is to select a larger sample. The larger the sample, the more likely it includes a true cross
section of the population—that is, the more likely it is to be representative of the population. Notice that sampling error is not a “mistake,” such as choosing the “wrong”
254
SAMPLING
255
sample or some other methodological failing. All samples deviate from the population in some way; thus sampling error is always present. The associated uncertainty is
the price one pays for using a subset of the population rather than the entire population.
The appeal of statistics is not that it removes uncertainty, but rather that it permits
inference in the presence of uncertainty.
DEFINITION: SAMPLING ERROR
Sampling error is uncertainty that arises from working with a sample rather than
the entire population.
Besides sampling error, there is another reason our sample may not be representative
of a population: sampling bias. This occurs when the way in which the sample was
collected is itself biased. In the example of university student study habits, a sample
would surely be biased if it were selected on the basis of interviews of students leaving
the university library late in the evening!
DEFINITION: SAMPLING BIAS
Sampling bias occurs when the procedures used to select the sample tend to
favor the inclusion of individuals in the population with certain population
characteristics.
Sampling bias can usually be avoided, or at least minimized, by selecting an appropriate sampling plan. Errors in recording, editing, and processing sample data can also
be limited by various checks. When data are collected through mail questionnaires, a
form of bias due to nonresponse often occurs. The respondents to the questionnaire
may not be representative of the overall population. Many studies have found their
respondents to be more highly educated, wealthier, and more interested in the subject
of the questionnaire than members of the population at large. Since the quality of the
inferences from a sample depends so much on the sample itself, any researcher must
carefully select a sampling plan capable of minimizing, or at least controlling to acceptable limits, both sampling error and sampling bias.
Sampling techniques with this characteristic are the focus of this chapter. First
the advantages of sampling are enumerated in Section 6.1. Why do we favor a sample
over a complete census of a population?
In Section 6.2 an extremely useful fourstep procedure for sampling is outlined. The tasks defined in these steps are encountered in every sampling problem. Section 6.3 identifies various types of samples. Only
specific types of samples can be used to generalize to a population—with a known
degree of risk. The most commonly used type, the simple random sample, is explained
in Section 6.4. This sampling strategy is then compared to a few other sample designs.
Section 6.5 introduces the concept of a sampling distribution. The sampling distribution of sample statistics such as the mean (X¯ ) and proportion P is central to both the
estimation and the hypothesis-testing procedures of statistical inference. Finally, issues arising in geographic sampling are presented in Section 6.6.
256
I N F E R E N T I A L S TAT I S T I C S
6.1. Why Do We Sample?
Seldom must we collect information from all members of a population in order make
reliable statements about the characteristics or attributes of that population. Often a
sample constituting only a small percentage of the total population is sufficient for
such inferences. There are several reasons for choosing a sample rather than a census
of an entire population.
1. Usually, it is not necessary to take a complete census. Valid, reliable generalizations about the characteristics of a population can be made with a sample of modest size—if the sample is properly taken. The uncertainty inherent in generalizing
from the few to the many not only is within acceptable limits, but sometimes is even
less than the uncertainties that arise when we try to precisely control the enormous
amount of data generated from a complete enumeration of an extremely large population. It is simply far easier to check the data of a small sample than those of a large
population.
2. The time, cost, and effort of collecting data from a sample are usually substantially less than are required to collect the same information from a larger population. The workforce or available financial resources usually constrain a researcher from
taking a full census.
3. The population of interest may be infinite, and therefore sampling is the only
alternative. We could, for example, consider the population to be the water temperature at a certain depth at a given reach of a particular stream. There are an infinite
number of times when we could record the water temperature—even in a small time
interval. Since space itself can be treated as a continuous variable, there are an infinite number of places in any area where a set of sample measurements could be taken.
This issue is explored more fully in Section 6.6.
4. The act of sampling may be destructive. To estimate the mean lifetime of
light bulbs, for example, any light bulb in a sample must be tested until it is no longer
of use. A census of the light bulbs produced by a manufacturer would destroy that producer’s entire stock!
5. The population may be only hypothetical. In the case of the light bulb manufacturer, the real population of interest is the set of bulbs that the manufacturer will produce in the future. At the time any sample is taken, this population is not observable.
6. The population may be empirically definable, but not practically available to
a researcher. Not all slopes in a study region may be accessible to a geomorphologist
interested in studying the dynamics of freeze-thaw weathering. Even an experienced
climber may find only a few suitable sites for study.
7. Information from a population census can be quickly outdated. Given the
volatility of political polls, it would be unwise to determine the party supported by
each member of a population. Repeated censuses of this type would be impossible,
and sufficient accuracy can be obtained by using only a small proportion of voters.
Repeated polls of this type are a usual occurrence now.
8. When the topic of the study requires an in-depth study of individuals in the
population, only a small sample may be possible. By restricting attention to only a
257
SAMPLING
few individuals, extremely comprehensive information can be collected. A study of the
residential mobility of people living in a large city might require detailed questions
concerning the history of moves, characteristics of the current and past residences,
motivations for each move, the search process used to locate new residences, and
characteristics of the household itself. Probing for sufficient detail in all these areas
precludes the possibility of a complete census of the population. Such a task would
certainly be beyond the resources of most institutions. For a variety of reasons, then,
many research questions must be answered through the use of a small sample from a
population. Providing the sample is collected properly, valid conclusions about key
characteristics of the population can be drawn, with only a surprisingly small degree
of uncertainty.
6.2. Steps in the Sampling Process
Having decided that no suitable data exist to answer a research question, and concluding that a sample is the only feasible method of collecting the necessary data, the
researcher must specify a sampling plan. Rushing out to collect the data as quickly as
possible is often the worst thing the researcher could do. It is far better to follow the
simple five-step sampling procedure illustrated in Figure 6-1. Many potential problems
not anticipated by the researcher can be addressed and successfully solved before a
considerable effort has been put into the actual task of data collection. In literally hundreds, if not thousands, of studies, insufficient time and care in devising a sampling
plan have led to the collection of large datasets with only limited possibilities for statistical inference. Let us consider each of these five steps in turn.
1. Definition of the population
2. Construction of a sampling
frame
3. Selection of a sampling
design
4. Specification of the
information to be collected
5. Collection of data
FIGURE 6-1. Steps in the sampling process.
258
I N F E R E N T I A L S TAT I S T I C S
Definition of the Population
The first step is to define the population. What at first glance might appear to be a
trivial task often proves to be an extremely difficult chore. It is easy to conceive of a
statistical population as a collection of individual elements that may be individual
people, objects, or even locations. However, to actually identify which individuals
should be included in the population and which should be excluded is not so simple.
To see some of the potential issues and difficulties, consider the problem of defining
the population for a study of the elderly in a city. A number of practical questions
immediately surface:
1. How will we distinguish the elderly from the nonelderly? by age? If so, what
age? age 60? age 65?
2. Or will the elderly be defined by occupational categories? Should we restrict
ourselves to retired persons? Or should we restrict the population to individuals over 65 years of age and retired ?
3. Will we include all elderly, or those living independently, that is, not in a
long-term care home?
4. Is the study concerned with elderly individuals or households of the elderly?
What about mixed households with both elderly and nonelderly members?
As you can see, even if we can conceptualize the population of interest, arriving at a
strict, operationally useful definition may require considerable thought and difficult
choices. In the study of the elderly, it is still necessary to define a time frame and a
geographical limit to the study region.
Construction of a Sampling Frame
Once we have chosen the specific definition to be used in identifying the individuals
of a population, it is necessary to construct a sampling frame.
DEFINITION: SAMPLING FRAME
A sampling frame (also called a population frame) is an ordered list of the individuals in a population.
The sampling frame has two key properties. First, it must include all individuals in
the population; that is, it must be exhaustive. Second, each individual element of the
population must appear once and only once on the list. Obtaining a sampling frame
for a particular population may itself be a time-consuming task. It is usually easy to
compile a list of all current students at a university in a given academic year from
existing academic records or transcripts. But what if the population of interest is not
regularly monitored in any way? Where, for example, could a list of all the elderly
residents of a city be obtained? It may be possible to extract a fairly complete list of
the elderly by examining the list of recipients of Social Security or old-age assistance
from a government agency. But would the list include all the elderly? What about
noncitizens or residents otherwise ineligible for this type of aid?
SAMPLING
259
As a second example, consider the use of telephone surveys for evaluation of
voter preferences for political parties. Although the population of interest is all eligible voters, the population actually sampled is composed of those residents with telephones—or, more accurately, the set of individuals who answer these phones. Clearly,
these two groups overlap a great deal, but they are not exactly the same. Restricting
ourselves to those with listed telephone numbers may exclude some relatively
wealthy residents with unlisted numbers as well as some poorer households without
telephones. It is now becoming increasingly common for some households to have
only cell phones, and so they would never appear in a conventional telephone book
or listing. It is therefore useful to distinguish the target population from the sampled
population.
DEFINITION: TARGET AND SAMPLED POPULATIONS
The target population is the set of all individuals relevant to a particular study.
The sampled population consists of all the individuals listed in the sampling
frame.
Obviously, it is desirable to have the sampled and target populations as nearly identical as possible. When they do differ, it is extremely important to know the particular
way(s) that they vary, since this is a form of sampling bias. It is sometimes necessary
to qualify the inferences made by using a sampled population when it differs in significant ways from the target population. This is equivalent to recognizing the limitations imposed on the study by the sampling frame.
Selection of a Sampling Design
Next we must decide how we are going to select individuals from the sampling frame
to include in the sample.
DEFINITION: SAMPLE DESIGN
A sample design is a procedure used to select individuals from the sampling
frame for the sample.
There are several ways to select a sample. We could simply take the first n individuals
listed in the sample frame. Or we could select the last n individuals or every kth individual on the list until we get n members for the sample. There are many types of
samples and sample designs. Because of the importance of this step, it is described in
depth in Sections 6.3 and 6.4. At this point it is sufficient to note that a random sample is an extremely useful design in statistical analysis. Individuals to be included in
the sample are chosen by using some procedure incorporating chance. The mechanical devices used in many lotteries are one example. An urn is filled with identical balls,
one for each member of the sampled population. The sample is chosen by selecting
balls from a well-mixed urn, one at a time.
The important characteristic of this type of sample is that we know the probability of each individual in the population being included in the sample. In this case,
260
I N F E R E N T I A L S TAT I S T I C S
each individual has an equal chance of being included. A number of variations of this
design are explained in Section 6.4.
Specification of the Information to Be Collected
This step can usually be accomplished at any point prior to beginning data collection.
The particular format used to collect data must be rigorously defined and pretested by
using a pilot sample.
DEFINITION: PILOT SAMPLE, OR PRETEST
A pilot sample, or pretest, is an extended test of data collection procedures to
be used in a study in advance of the main data collection effort.
In a field study, the pretest can be used to check instruments, data loggers, and all
other logistics. For surveys—mail, telephone, or personal interview—the pretest can
sometimes reveal deficiencies for any of the following reasons: difficulty in locating
population members; dealing with an abnormally high percentage of refusals or incomplete questionnaires; problems in questionnaire wording, question sequence, or
format; unanticipated responses; or inadequately trained interviewers.
Collection of Data
Once all the problems indicated in the pretest have been successfully solved, the task
of data collection can begin. At this stage, careful tabulation and editing are particularly important if we wish to minimize nonsampling error.
6.3. Types of Samples
In this section, we expand on the ideas discussed in the third step of the sampling
process, the selection of a sampling design. Sampling designs can be conveniently
divided into two classes: probability samples and nonprobability samples. Simple random sampling is one type of probability sample.
DEFINITION: PROBABILITY SAMPLE
A probability sample is one in which the probability of any individual member
of the population being picked for the sample can be determined.
Because we know only the probability that an individual is included in the sample,
an element of chance, or uncertainty, is introduced into any inferences made from the
sample. Expressed simply, it could happen that a particular sample is quite unrepresentative of the population it is supposed to reflect. The advantage of a probability
sample is not that there is no uncertainty in the results, but rather that we can assign
261
SAMPLING
Samples
Nonprobability
samples
Probability
samples
Simple
random
Cluster
Judgmental
or purposeful
Volunteer
Quota
Stratified
Independent
random
Systematic
Convenience or
accessibility
FIGURE 6-2. Types of samples.
a quantitative uncertainty value to the results. The five principal types of probability
samples are shown in Figure 6-2. Because of their importance in statistical inference,
these samples are discussed in depth in the following section.
Nonprobability samples may also be excellent or poor representations of the
population. The difficulty is that whether it is a good or bad sample can never be
determined. Four types of nonprobability samples are sometimes used to collect sample data: judgmental, convenience, quota, and volunteer.
DEFINITION: JUDGMENTAL, OR PURPOSEFUL, SAMPLE
This type of sample, also known as a purposeful sample, is one in which personal judgment is used to decide which individuals of a population are to be
included in the sample. These are individuals whom the investigator feels can
best serve the purpose of the sample.
Obviously, a very skillful investigator with considerable knowledge of a population
can sometimes generate extremely useful data from such a sample. If the deliberate
choices turn out to be poor, however, poor inferences are made. The risk is not known.
Purposeful samples are sometimes selected in pretest, or pilot, samples. A range of
respondents, including both “typical” and “unusual” individuals, is chosen. This sample is used to get an idea of what the full range of questions in the survey should be.
In addition, it is not unusual to use these interviews to preview the types of answers
respondents are likely to give in the actual survey. Many times, this information can be
used to significantly improve the survey instrument. It would be completely erroneous
to utilize such a sample to draw conclusions about the whole population. Because they
include only easily identifiable members of a population, convenience or accessibility samples are subject to sampling bias. These individuals are almost always special
or different from other population members in some way. Street corner interviews
rarely reflect overall opinion; only certain individuals will allow themselves to be interviewed by the media.