Tải bản đầy đủ

Appendix 5b. Expected Value and Variance of a Continuous Random Variable

RANDOM VARIABLES AND PROBABILIT Y DISTRIBUTIONS

251

First, to show the area under f(x) equals one, we note

∫0

10

1

1

—– dx = —– (10 – 0) = 1

10

10

To calculate the expected value, we write

( )

1

1 10

10

E(X) = ∫ x —– dx = —– ∫ xdx

0

10

10 0

1 x2 10

= —– —–⏐

10 2 0

1 102 02

= —– —– – — = 5

10 2

2

(

)

The variance is

( )

1

10

V(X) = ∫ x2 —– dx – [E(X)]2

0

10

1 x3 10

= —– —–⏐ – 52

10 3 0

(

)

1 1000

= —– ——– – 25 = 8.33

10

3

FURTHER READING

Introductory statistics textbooks for geographers seldom discuss the concept of random variables in detail, nor do they give an extended treatment of probability distribution models. Introductory statistics textbooks for business and economics students and some texts for applied

statistics courses usually give a more complete presentation of these topics. Three representative examples are Neter et al. (1982); Pfaffenberger and Patterson (1981); and Winkler and

Hays (1975). The development of the Poisson probability distribution using the analogy to a

dot map only hints at the widespread use of probability models in this situation. Some additional material on quadrat analysis and a related technique known as nearest neighbor analysis

is provided in Chapter 14. Extended treatment of these topics can be found in Getis and Boots

(1978), Taylor (1977), and Unwin (1981).

A. Getis and B. Boots, Models of Spatial Processes (Cambridge, UK: Cambridge University Press,

1978).

J. Neter, W. Wasserman, and G. Whitmore, Applied Statistics (Boston: Allyn and Bacon, 1982).

R. Pfaffenberger and J. Patterson, Statistical Methods for Business and Economics (Homewood, IL:

Richard Irwin, 1981).

252

I N F E R E N T I A L S TAT I S T I C S

P. Taylor, Quantitative Methods in Geography (Boston: Houghton Mifflin, 1977).

D. Unwin, Introductory Spatial Analysis (London: Methuen, 1981).

R. Winkler, and W. L. Hays, Statistics: Probability Inference and Decision Making, 2nd ed. (New

York: Holt, 1975).

PROBLEMS

1. Explain the meaning of the following terms:

• Statistical experiment or random trial

• Elementary outcome of an experiment

• Conditional probability

• Probability distribution

• Expected value of a random variable

• Standard normal distribution

• Bivariate probability function

• Binomial coefficients

•

•

•

•

•

•

•

•

Sample space

Event

Random variable

Standard score

Variance of a random variable

Covariance

Bernoulli trial

Pascal’s triangle

2. Differentiate between an objective and a subjective interpretation of the concept of probability. Give an example of each. Why are objective interpretations preferred in statistical

analysis?

3. Residents of a city are asked to rank the desirability of five different neighborhoods A, B,

C, D, and E. How many different ways can they be ranked, assuming there can be no ties?

4. A traveling salesperson beginning a trip at city A must visit (in order) cities X, Y, and Z before returning to A. Several roads connect each pair of cities. There are four different ways

of traveling between cities A and X, three routes between X and Y, five between Y and Z,

and two between Z and A. By how many different routes can the salesperson complete the

trip?

5. Construct an outcome tree outlining the elementary outcomes for three tosses of a coin.

What is the probability that three consecutive tails occur, assuming the coin is fair?

6. The license plates in a certain jurisdiction have six alphanumeric digits. The first digit is

A, B, or C. The second digit is N, S, E, or W. The last four digits are restricted to the integers 0, 1, 2, . . . , 9. How many different license plates are possible?

7. The probability that it rains on a given day in July is 0.1.

a. Assuming independent trials, what is the probability that it does not rain for three consecutive days?

b. Again assuming independent trials, what is the probability that one day of rain is followed by two days without rain?

c. Is the assumption of independent trials in this experiment reasonable?

8. A retail geographer surveys 200 shoppers after each has visited one of three shopping centers A, B, or C. She records whether each made a purchase, with a yes (Y) or no (N). The

survey results are as follows:

RANDOM VARIABLES AND PROBABILIT Y DISTRIBUTIONS

a.

b.

c.

d.

Center

Y

N

Total

A

B

C

Total

25

10

65

100

25

50

25

100

50

60

90

200

253

Find P(C).

Find P(A ∪ B).

Find P(B ∪ Y).

Find P(A ∩ B| N).

Explain in your own words the meanings of parts (a) to (d).

9. Differentiate between the following:

a. A discrete and a continuous random variable

b. A probability mass function and a probability density function

c. A normal probability distribution and the standard normal distribution

d. A marginal probability distribution and a conditional probability distribution

e. A probability mass function and a cumulative mass function

10. Let X be a random variable with the following probability distribution:

a.

b.

c.

d.

x

P(x)

0

1

2

3

0.40

0.30

0.15

0.15

Verify that this is a valid probability distribution model.

Determine E(X).

Determine V(X).

What is the mode of X?

11. Graph the probability mass functions for the following discrete probability distribution

models:

a. Uniform: k = 3, 5, and 10

b. Binomial: n = 5, π = 0.20; n = 5, π = 0.50; n = 5, π = 0.70.

c. Poisson: λ = 0.20, 0.40 and 0.60

12. The maximum temperature reached on any day can be classified as above freezing (a success) or below freezing (a failure). In a certain city of eastern North America, January

weather statistics indicate the probability a January day will be above freezing is 0.3. Use

the binomial distribution to determine the following probabilities:

a. Exactly 2 of the next 7 January days will be above freezing.

b. More than 5 of the next 7 days will be above freezing.

c. There will be at least 1 day above freezing in the next 7 days.

d. All 7 days in the next week will be above freezing.

e. Is this a reasonable application of the binomial distribution? Why or why not?

6

Sampling

In Chapter 1, statistical methodology was conveniently divided into descriptive statistics and inferential statistics. In inferential statistics, a descriptive characteristic of

a sample is linked with probability theory, so that a researcher can generalize the results of a study of a few individuals to some larger group. This idea was made more

explicit in Chapter 5, where the notions of a random variable and its probability distribution were defined. At the core of inferential statistics is the distinction between

a population and a sample. From a statistical universe or population, a small subset

of individuals is often selected for detailed study. This sample is used to estimate the

value of some population characteristic or to answer a question about a particular

characteristic of the population. However, to make such inferences, the sample must

be collected in a specific way. It is not possible to make statistically reliable inferences from any sample. Whereas street corner interviews, for example, tend to make

interesting news items, they may not reflect the views of the population they are supposed to represent.

Ideally, it would be best to have a sample that is a good representation of the

population from which it has been drawn. High-quality inferences are made by using

high-quality samples. Unfortunately, unless we know everything about the population,

say from a census, we have no way of knowing if we do have a representative sample. The very act of sampling thus introduces some uncertainty into our inferences,

simply because the sample may not be representative of the population. This is known

as sampling error. Suppose, for example, we wished to sample the students at a university and determine the number of hours the average student spends studying in any

given week. By mere chance, we may just select a sample that includes more industrious students than average students and thus overestimates the amount of time spent

studying. We might be led to believe that the average student spent more time studying than is, in fact, the case. Our only safeguard against such sampling error is to select a larger sample. The larger the sample, the more likely it includes a true cross

section of the population—that is, the more likely it is to be representative of the population. Notice that sampling error is not a “mistake,” such as choosing the “wrong”

254

SAMPLING

255

sample or some other methodological failing. All samples deviate from the population in some way; thus sampling error is always present. The associated uncertainty is

the price one pays for using a subset of the population rather than the entire population.

The appeal of statistics is not that it removes uncertainty, but rather that it permits

inference in the presence of uncertainty.

DEFINITION: SAMPLING ERROR

Sampling error is uncertainty that arises from working with a sample rather than

the entire population.

Besides sampling error, there is another reason our sample may not be representative

of a population: sampling bias. This occurs when the way in which the sample was

collected is itself biased. In the example of university student study habits, a sample

would surely be biased if it were selected on the basis of interviews of students leaving

the university library late in the evening!

DEFINITION: SAMPLING BIAS

Sampling bias occurs when the procedures used to select the sample tend to

favor the inclusion of individuals in the population with certain population

characteristics.

Sampling bias can usually be avoided, or at least minimized, by selecting an appropriate sampling plan. Errors in recording, editing, and processing sample data can also

be limited by various checks. When data are collected through mail questionnaires, a

form of bias due to nonresponse often occurs. The respondents to the questionnaire

may not be representative of the overall population. Many studies have found their

respondents to be more highly educated, wealthier, and more interested in the subject

of the questionnaire than members of the population at large. Since the quality of the

inferences from a sample depends so much on the sample itself, any researcher must

carefully select a sampling plan capable of minimizing, or at least controlling to acceptable limits, both sampling error and sampling bias.

Sampling techniques with this characteristic are the focus of this chapter. First

the advantages of sampling are enumerated in Section 6.1. Why do we favor a sample

over a complete census of a population?

In Section 6.2 an extremely useful fourstep procedure for sampling is outlined. The tasks defined in these steps are encountered in every sampling problem. Section 6.3 identifies various types of samples. Only

specific types of samples can be used to generalize to a population—with a known

degree of risk. The most commonly used type, the simple random sample, is explained

in Section 6.4. This sampling strategy is then compared to a few other sample designs.

Section 6.5 introduces the concept of a sampling distribution. The sampling distribution of sample statistics such as the mean (X¯ ) and proportion P is central to both the

estimation and the hypothesis-testing procedures of statistical inference. Finally, issues arising in geographic sampling are presented in Section 6.6.

256

I N F E R E N T I A L S TAT I S T I C S

6.1. Why Do We Sample?

Seldom must we collect information from all members of a population in order make

reliable statements about the characteristics or attributes of that population. Often a

sample constituting only a small percentage of the total population is sufficient for

such inferences. There are several reasons for choosing a sample rather than a census

of an entire population.

1. Usually, it is not necessary to take a complete census. Valid, reliable generalizations about the characteristics of a population can be made with a sample of modest size—if the sample is properly taken. The uncertainty inherent in generalizing

from the few to the many not only is within acceptable limits, but sometimes is even

less than the uncertainties that arise when we try to precisely control the enormous

amount of data generated from a complete enumeration of an extremely large population. It is simply far easier to check the data of a small sample than those of a large

population.

2. The time, cost, and effort of collecting data from a sample are usually substantially less than are required to collect the same information from a larger population. The workforce or available financial resources usually constrain a researcher from

taking a full census.

3. The population of interest may be infinite, and therefore sampling is the only

alternative. We could, for example, consider the population to be the water temperature at a certain depth at a given reach of a particular stream. There are an infinite

number of times when we could record the water temperature—even in a small time

interval. Since space itself can be treated as a continuous variable, there are an infinite number of places in any area where a set of sample measurements could be taken.

This issue is explored more fully in Section 6.6.

4. The act of sampling may be destructive. To estimate the mean lifetime of

light bulbs, for example, any light bulb in a sample must be tested until it is no longer

of use. A census of the light bulbs produced by a manufacturer would destroy that producer’s entire stock!

5. The population may be only hypothetical. In the case of the light bulb manufacturer, the real population of interest is the set of bulbs that the manufacturer will produce in the future. At the time any sample is taken, this population is not observable.

6. The population may be empirically definable, but not practically available to

a researcher. Not all slopes in a study region may be accessible to a geomorphologist

interested in studying the dynamics of freeze-thaw weathering. Even an experienced

climber may find only a few suitable sites for study.

7. Information from a population census can be quickly outdated. Given the

volatility of political polls, it would be unwise to determine the party supported by

each member of a population. Repeated censuses of this type would be impossible,

and sufficient accuracy can be obtained by using only a small proportion of voters.

Repeated polls of this type are a usual occurrence now.

8. When the topic of the study requires an in-depth study of individuals in the

population, only a small sample may be possible. By restricting attention to only a

257

SAMPLING

few individuals, extremely comprehensive information can be collected. A study of the

residential mobility of people living in a large city might require detailed questions

concerning the history of moves, characteristics of the current and past residences,

motivations for each move, the search process used to locate new residences, and

characteristics of the household itself. Probing for sufficient detail in all these areas

precludes the possibility of a complete census of the population. Such a task would

certainly be beyond the resources of most institutions. For a variety of reasons, then,

many research questions must be answered through the use of a small sample from a

population. Providing the sample is collected properly, valid conclusions about key

characteristics of the population can be drawn, with only a surprisingly small degree

of uncertainty.

6.2. Steps in the Sampling Process

Having decided that no suitable data exist to answer a research question, and concluding that a sample is the only feasible method of collecting the necessary data, the

researcher must specify a sampling plan. Rushing out to collect the data as quickly as

possible is often the worst thing the researcher could do. It is far better to follow the

simple five-step sampling procedure illustrated in Figure 6-1. Many potential problems

not anticipated by the researcher can be addressed and successfully solved before a

considerable effort has been put into the actual task of data collection. In literally hundreds, if not thousands, of studies, insufficient time and care in devising a sampling

plan have led to the collection of large datasets with only limited possibilities for statistical inference. Let us consider each of these five steps in turn.

1. Definition of the population

2. Construction of a sampling

frame

3. Selection of a sampling

design

4. Specification of the

information to be collected

5. Collection of data

FIGURE 6-1. Steps in the sampling process.

258

I N F E R E N T I A L S TAT I S T I C S

Definition of the Population

The first step is to define the population. What at first glance might appear to be a

trivial task often proves to be an extremely difficult chore. It is easy to conceive of a

statistical population as a collection of individual elements that may be individual

people, objects, or even locations. However, to actually identify which individuals

should be included in the population and which should be excluded is not so simple.

To see some of the potential issues and difficulties, consider the problem of defining

the population for a study of the elderly in a city. A number of practical questions

immediately surface:

1. How will we distinguish the elderly from the nonelderly? by age? If so, what

age? age 60? age 65?

2. Or will the elderly be defined by occupational categories? Should we restrict

ourselves to retired persons? Or should we restrict the population to individuals over 65 years of age and retired ?

3. Will we include all elderly, or those living independently, that is, not in a

long-term care home?

4. Is the study concerned with elderly individuals or households of the elderly?

What about mixed households with both elderly and nonelderly members?

As you can see, even if we can conceptualize the population of interest, arriving at a

strict, operationally useful definition may require considerable thought and difficult

choices. In the study of the elderly, it is still necessary to define a time frame and a

geographical limit to the study region.

Construction of a Sampling Frame

Once we have chosen the specific definition to be used in identifying the individuals

of a population, it is necessary to construct a sampling frame.

DEFINITION: SAMPLING FRAME

A sampling frame (also called a population frame) is an ordered list of the individuals in a population.

The sampling frame has two key properties. First, it must include all individuals in

the population; that is, it must be exhaustive. Second, each individual element of the

population must appear once and only once on the list. Obtaining a sampling frame

for a particular population may itself be a time-consuming task. It is usually easy to

compile a list of all current students at a university in a given academic year from

existing academic records or transcripts. But what if the population of interest is not

regularly monitored in any way? Where, for example, could a list of all the elderly

residents of a city be obtained? It may be possible to extract a fairly complete list of

the elderly by examining the list of recipients of Social Security or old-age assistance

from a government agency. But would the list include all the elderly? What about

noncitizens or residents otherwise ineligible for this type of aid?

SAMPLING

259

As a second example, consider the use of telephone surveys for evaluation of

voter preferences for political parties. Although the population of interest is all eligible voters, the population actually sampled is composed of those residents with telephones—or, more accurately, the set of individuals who answer these phones. Clearly,

these two groups overlap a great deal, but they are not exactly the same. Restricting

ourselves to those with listed telephone numbers may exclude some relatively

wealthy residents with unlisted numbers as well as some poorer households without

telephones. It is now becoming increasingly common for some households to have

only cell phones, and so they would never appear in a conventional telephone book

or listing. It is therefore useful to distinguish the target population from the sampled

population.

DEFINITION: TARGET AND SAMPLED POPULATIONS

The target population is the set of all individuals relevant to a particular study.

The sampled population consists of all the individuals listed in the sampling

frame.

Obviously, it is desirable to have the sampled and target populations as nearly identical as possible. When they do differ, it is extremely important to know the particular

way(s) that they vary, since this is a form of sampling bias. It is sometimes necessary

to qualify the inferences made by using a sampled population when it differs in significant ways from the target population. This is equivalent to recognizing the limitations imposed on the study by the sampling frame.

Selection of a Sampling Design

Next we must decide how we are going to select individuals from the sampling frame

to include in the sample.

DEFINITION: SAMPLE DESIGN

A sample design is a procedure used to select individuals from the sampling

frame for the sample.

There are several ways to select a sample. We could simply take the first n individuals

listed in the sample frame. Or we could select the last n individuals or every kth individual on the list until we get n members for the sample. There are many types of

samples and sample designs. Because of the importance of this step, it is described in

depth in Sections 6.3 and 6.4. At this point it is sufficient to note that a random sample is an extremely useful design in statistical analysis. Individuals to be included in

the sample are chosen by using some procedure incorporating chance. The mechanical devices used in many lotteries are one example. An urn is filled with identical balls,

one for each member of the sampled population. The sample is chosen by selecting

balls from a well-mixed urn, one at a time.

The important characteristic of this type of sample is that we know the probability of each individual in the population being included in the sample. In this case,

260

I N F E R E N T I A L S TAT I S T I C S

each individual has an equal chance of being included. A number of variations of this

design are explained in Section 6.4.

Specification of the Information to Be Collected

This step can usually be accomplished at any point prior to beginning data collection.

The particular format used to collect data must be rigorously defined and pretested by

using a pilot sample.

DEFINITION: PILOT SAMPLE, OR PRETEST

A pilot sample, or pretest, is an extended test of data collection procedures to

be used in a study in advance of the main data collection effort.

In a field study, the pretest can be used to check instruments, data loggers, and all

other logistics. For surveys—mail, telephone, or personal interview—the pretest can

sometimes reveal deficiencies for any of the following reasons: difficulty in locating

population members; dealing with an abnormally high percentage of refusals or incomplete questionnaires; problems in questionnaire wording, question sequence, or

format; unanticipated responses; or inadequately trained interviewers.

Collection of Data

Once all the problems indicated in the pretest have been successfully solved, the task

of data collection can begin. At this stage, careful tabulation and editing are particularly important if we wish to minimize nonsampling error.

6.3. Types of Samples

In this section, we expand on the ideas discussed in the third step of the sampling

process, the selection of a sampling design. Sampling designs can be conveniently

divided into two classes: probability samples and nonprobability samples. Simple random sampling is one type of probability sample.

DEFINITION: PROBABILITY SAMPLE

A probability sample is one in which the probability of any individual member

of the population being picked for the sample can be determined.

Because we know only the probability that an individual is included in the sample,

an element of chance, or uncertainty, is introduced into any inferences made from the

sample. Expressed simply, it could happen that a particular sample is quite unrepresentative of the population it is supposed to reflect. The advantage of a probability

sample is not that there is no uncertainty in the results, but rather that we can assign

261

SAMPLING

Samples

Nonprobability

samples

Probability

samples

Simple

random

Cluster

Judgmental

or purposeful

Volunteer

Quota

Stratified

Independent

random

Systematic

Convenience or

accessibility

FIGURE 6-2. Types of samples.

a quantitative uncertainty value to the results. The five principal types of probability

samples are shown in Figure 6-2. Because of their importance in statistical inference,

these samples are discussed in depth in the following section.

Nonprobability samples may also be excellent or poor representations of the

population. The difficulty is that whether it is a good or bad sample can never be

determined. Four types of nonprobability samples are sometimes used to collect sample data: judgmental, convenience, quota, and volunteer.

DEFINITION: JUDGMENTAL, OR PURPOSEFUL, SAMPLE

This type of sample, also known as a purposeful sample, is one in which personal judgment is used to decide which individuals of a population are to be

included in the sample. These are individuals whom the investigator feels can

best serve the purpose of the sample.

Obviously, a very skillful investigator with considerable knowledge of a population

can sometimes generate extremely useful data from such a sample. If the deliberate

choices turn out to be poor, however, poor inferences are made. The risk is not known.

Purposeful samples are sometimes selected in pretest, or pilot, samples. A range of

respondents, including both “typical” and “unusual” individuals, is chosen. This sample is used to get an idea of what the full range of questions in the survey should be.

In addition, it is not unusual to use these interviews to preview the types of answers

respondents are likely to give in the actual survey. Many times, this information can be

used to significantly improve the survey instrument. It would be completely erroneous

to utilize such a sample to draw conclusions about the whole population. Because they

include only easily identifiable members of a population, convenience or accessibility samples are subject to sampling bias. These individuals are almost always special

or different from other population members in some way. Street corner interviews

rarely reflect overall opinion; only certain individuals will allow themselves to be interviewed by the media.

## 2009 james e burt gerald m barber david l rigby elementary statistics for geographers the guilford press (2009)

## Appendix 3b. An Iterative Algorithm for Determining the Weighted or Unweighted Euclidean Median

## Appendix 4b. Least Squares Solution via Elementary Calculus

## Appendix 5a. Counting Rules for Computing Probabilities

## Appendix 11a. Derivation of Equation 11-11 from Equation 11-10

## IV. PATTERNS IN SPACE AND TIME

Tài liệu liên quan