Tải bản đầy đủ

Appendix 3b. An Iterative Algorithm for Determining the Weighted or Unweighted Euclidean Median

151

D E S C R I B I N G DATA W I T H S TAT I S T I C S

(Xe , Ye ) found by the algorithm to be within 0.01, we set TOL = 0.01. Since it is an

approximating algorithm, the method is more efficient if a “good” starting point is selected. A good point is one that is close to the answer. The bivariate mean (X¯, Y¯ ) and

weighted mean (X¯w, Y¯w) are good and obvious choices for a starting point. So (X1, Y1)

= (X¯, Y¯ ). The algorithm can be described as the following sequence of steps:

1. Calculate the distance from each point (Xi, Yi) to the current estimate of the

median location

t)2 + (Y

where dit = √

⎯⎯⎯⎯⎯⎯

(Xi – X⎯⎯⎯⎯⎯⎯

⎯⎯⎯

Y t )2

i –⎯⎯⎯

where dit is the distance from point i to the median during the tth iteration.

2. Determine the values Kit = wi /dit. Note that in the unweighted case all values of wi are set equal to 1.

3. Calculate a new estimate of the median from

n

Σ Kit Xi

i=1

X t+1 = ———–

n

Σ Kit

i=1

n

Σ KitYi

i=1

Y t+1 = ———–

n

Σ Kit

i=1

4. Check to see whether the location has changed between iterations. If |X t+1 –

X t | and |Y t+1 – Y t | ≤ TOL, stop. Otherwise, set X t = X t+1 and Y t = Y t1 and go

to step 1.

The algorithm usually converges in a small number of iterations. For the nine points

of Figure 3-11, for the unweighted case, the following steps summarize the results of

the algorithm initialized with TOL = 0.10.

Iteration

Xt

Yt

1

2

3

4

44.867

43.865

43.672

43.710

45.517

44.060

43.204

42.742

The location of the unweighted median is thus estimated as (43.71, 42.74).

REFERENCES

American Association of University Professors, 2002.

H. W. Kuhn and R. E. Kuenne, “An Efficient Algorithm for the Numerical Solution of the Generalized Weber Problem in Spatial Economics,” Journal of Regional Science 4 (1962), 21–33.

S. M. Stigler, “Do Robust Estimators Work with Real Data?,” Annals of Statistics 5 (1977), 1055–

1078.

152

D E S C R I P T I V E S TAT I S T I C S

FURTHER READING

Virtually all elementary statistics textbooks for geographers include a discussion of at least part

of the material presented in this chapter. Students often find it useful to consult other textbooks

to reinforce their understanding of key concepts. Four textbooks intended for geography students are those by Clark and Hosking (1986), Ebdon (1985), Griffith and Amrhein (1991), and

O’Brien (1992). A useful first reference on geostatistics is Neft (1966), while Isaaks and Srivastava (1989) is more comprehensive and more complex. Extensions of statistical analysis to

directional data are provided by Mardia (1972). Geographical applications of graph theory and

network analysis are reviewed in Tinkler (1979) and Haggett and Chorley (1969), respectively.

W. A. V. Clark and P. L Hosking, Statistical Methods for Geographers (New York: Wiley, 1986).

D. Ebdon, Statistics in Geography, 2nd ed. (Oxford: Basil Blackwell, 1985).

D. A. Griffith and C. G. Amrhein Statistical Analysis for Geographers (Englewood Cliffs, NJ:

Prentice-Hall, 1991).

P. Haggett and R. J. Chorley, Network Analysis in Geography (London: Edward Arnold, 1969).

E. H. Isaaks and R. Mohan Srivastava, Applied Geostatistics (Oxford: Oxford University Press,

1989).

D. Neft, Statistical Analysis for Areal Distributions (Philadelphia: Regional Science Research Unit,

1966).

L. O’Brien, Introducing Quantitative Geography (London: Routledge, 1992).

K. J. Tinkler, “Graph Theory,” Progress in Human Geography 3 (1979), 85–116.

DATASETS USED IN THIS CHAPTER

The following data sets employed in this chapter can be found on the website for this book.

cavendish.html

dodata.html

traffic.html

PROBLEMS

1. Explain the following terms:

• Mean

• Median

• Percentile

• Interquartile range

• Standard deviation

• Coefficient of variation

• Skewness

• Kurtosis

•

•

•

•

•

•

•

Location quotient

Lorenz curve

Gini coefficient

Mean center

Standard distance

Moving average

Smoothing

153

D E S C R I B I N G DATA W I T H S TAT I S T I C S

2. When is the mean not a good measure of central tendency? Give an example of a variable

that might best be summarized by some other measure of central tendency.

3. Consider the following five observations: –8, 14, –2, 3, 5.

a. Calculate the mean.

b. Find the median.

c. Show that the Properties 1, 2, and 3 hold for this set of numbers.

4. Life expectancy at birth (in years) is shown for a random sample of 60 countries in 1979.

The data are drawn from the United Nations Human Development Report:

76.7 64 71.1 40.1 52.3 64.5 74.4 78.6 69.5 62.6 57.9 42.4 69.5 73 50.8 78 71 66.6 48.6

76.9 70 68.9 70 72.4 65.1 39.3 69.6 47.9 73.1 72.9 58 69.9 44.9 51.7 74 69.4 70.5 57.3 45

65.4 78.5 73.9 73 64.4 69.5 45.2 78.1 78.5 71.4 64 48.5 64 66.3 72.9 78.1 47.2 72.4 68.2

72 76 78.1

a. Calculate the following descriptive statistics for these data: mean, variance, skewness

and kurtosis.

b. Write a brief summary of your findings about the distribution of life expectancy in these

countries.

5. Using the data of Problem 4, determine

a. Range

b. First and third quartiles

c. The interquartile range

d. The 60th percentile

e. The 10th percentile

f. The 90th percentile

6. The following data describe the distribution of annual household income in three regions

of a country:

Region

X¯

s

A

B

C

$42,000

33,000

27,000

$16,000

13,000

11,000

a. In which region is income the most evenly spread?

b. In which region is income the least evenly spread?

7. The following data represent a time series developed over 45 consecutive time periods

(read the data by rows):

569 416 422 565 484 520 573 518 501 505 468 382 310 334 359 372 439 446 349 395

461 511 583 590 620 578 534 631 600 438 534 467 457 392 467 500 493 410 412 416

403 433 459 467

154

D E S C R I P T I V E S TAT I S T I C S

a.

b.

c.

d.

Smooth these data using a three-term moving average.

Smooth these data using a five-term moving average.

Graph the original data and the two smoothed sequences from (a) and (b).

Graph the fluctuating component from each of the two sequences.

8. Consider a geographic area divided into four regions, north, south, east, and west. The following table lists the population of these regions in three racial categories: Asian, Black,

and White:

Population

a.

b.

c.

d.

Region

Asian

Black

White

Total

North

South

East

West

Total

600

200

150

100

1050

200

300

150

300

950

700

400

250

200

1550

1500

900

550

600

3550

Calculate location quotients for each region and each racial group in the population.

Calculate the coefficient of localization for each group.

Draw the Lorenz curves for each of the racial groups on a single graph.

Write a brief paragraph describing the spatial distribution of the three groups using your

answers to (a), (b), and (c).

9. What is the smallest possible value for a location quotient? When can it occur? What is

the largest possible value for a location quotient?

10. Consider the following 12 coordinate pairs and weights:

X coordinate

Y coordinate

Weight

60

45

70

55

65

70

80

45

30

55

70

0

80

45

60

60

75

45

60

75

70

50

65

40

4

5

6

7

4

3

2

2

2

1

1

1

a. Locate each point, using the given coordinates, on a piece of graph paper.

b. Calculate the mean center (assume all weights are equal to 1). Calculate the weighted

mean center. Calculate the Manhattan median. Locate each measure on the graph

paper.

D E S C R I B I N G DATA W I T H S TAT I S T I C S

155

c. Calculate the standard distance. Draw a circle with radius equal to the standard distance

centered on the mean center.

d. (Optional) Use the Kuehn and Kuenne algorithm (see Appendix 3b) to determine the

Euclidean median.

11. For the census of the United States (www.census.gov) or Canada (www.statisticscanada.ca),

or an equivalent source, generate the value of a variable for a region containing at least

30 subareas. For example, you could look at some characteristic of the population for the

counties within a state or province.

a. Calculate the mean and variance for the variable of interest across the subareas.

b. Group contiguous subareas in pairs so that there are 15 or more zones; calculate the

mean and variance for the variable of interest across the newly constructed zones.

c. Group contiguous zones in part (b) to get eight or more areas; find the mean and variance of the variable of interest across these areas.

d. Continue the process of aggregating regions into larger geographical units until there

are only two subregions. At each step calculate the mean and variance of the variable of

interest across the areas.

e. The mean of the variable at each step should be equal. Why?

f. Construct a graph of the variance of the variable (Y-axis) versus the number of subareas (X-axis). What does it reveal?

12. For the same data used in Problem 11b, generate five different groupings in which subareas are joined in pairs. Calculate the variance of the variable of interest for each grouping. Do your results confirm the existence of the problem of modifiable units?

4

Statistical Relationships

Chapters 2 and 3 examined how to display, interpret, and describe the distribution

of a single variable. In this chapter, we extend the discussion by exploring statistical

relationships between variables. A few moments of reflection should confirm the importance of bivariate analysis, or thinking about how the values assumed by one variable are related to those assumed by another variable. Examples abound:

1. Is human activity, measured perhaps by global economic output, related to

the atmospheric temperature of the earth?

2. Is the rate of taxation related to economic growth?

3. Is drug X an effective treatment for disease Y?

4. Is the value of the Dow Jones Industrial Average stock market index on one

day linked to its value on another day?

5. Is average household income in a state related to average household income

in neighboring states? Alternatively, do states with high (low) average incomes tend to cluster in space?

Just as we used graphical and numerical techniques to explore the distribution of values of a single variable, we can use related techniques to investigate and describe the

relationship between variables. While the nature of a relationship, or the statistical dependence, between two quantitative variables will be our focus, we will also explore

how to examine the relationship between a quantitative variable and a qualitative variable. Investigation of the interaction between more than two variables, or what we

call multivariate analysis, occupies later chapters of the book.

Bivariate data comprise a set of observations each one of which is associated

with a pair of values for two variables, X and Y. For example, college applicants often

are characterized by their SAT scores and their GPA. Atmospheric scientists are interested in tracking levels of carbon dioxide and the temperature of the earth’s atmosphere over a series of observational units, typically years. Development economists

might compare per capita income with life expectancy across countries. In this case,

the observations would be individual countries.

156

157

S TAT I S T I C A L R E L AT I O N S H I P S

The analysis of bivariate data in this chapter is organized as follows. Section

4.1 reviews the question of statistical dependence and highlights some of the main

issues that will concern us throughout much of the rest of the chapter. In Section 4.2

we show how the graphical techniques introduced in Chapter 2 can be used to compare two distributions. These distributions may represent observations from different

variables, or they may represent different samples on a single variable. We end the

section looking at scatterplots that display the joint distributions of two variables. The

correlation coefficient is introduced in Section 4.3 as a measure of the strength of

the relationship between two quantitative variables. Section 4.4 discusses the closely

related concept of regression. In Section 4.5 we extend the concept of correlation to

investigate the relationship between successive values of a single time-series variable,

what we call temporal or serial autocorrelation. Exploration of spatial autocorrelation,

the relationship between values of a single variable distributed over space, is left to

Chapter 14.

It is important to look at your data before engaging in bivariate analysis. Looking at your data can be done efficiently using the graphical methods discussed in

Chapter 2. Examining the distributions of each of your variables independently often

provides insights into how they might be employed in bivariate studies. Such examination can also alert you to potential issues that limit which bivariate techniques can

be used, for example, if a variable has a non-normal distribution.

4.1. Relationships and Dependence

In statistics, when we say that two variables are related, we mean that the value(s) assumed by one of the variables provides at least some information about the value(s)

assumed by the other variable. It might be that one variable has a causal influence on

the other variable, or it might be that both variables are influenced by yet another variable or by a combination of other variables. Statistics can’t speak to how relationships

arise, it merely provides methods for detecting and analyzing relationships. In what follows, therefore, we will use the word “influence” simply to mean that one variable is

connected to another, either directly or indirectly through the action of other variables.

Consider Table 4-1, which shows the values of two discrete, integer-valued,

quantitative variables, X and Y. Variable X can assume the values 1, 2, and 3, while

variable Y can assume the values 1, 2, 3, and 4. For each value of X there are 100

TABLE 4-1

Dependent Variables: A Bivariate Sample of 300 Observations

Y=1

Y=2

Y=3

Y=4

All Y

X=1

X=2

X=3

5

25

35

15

30

30

25

30

30

55

15

5

100

100

100

All X

65

75

85

75

300

158

D E S C R I P T I V E S TAT I S T I C S

TABLE 4-2

Independent Variables: A Bivariate Sample of 300 Observations

Y=1

Y=2

Y=3

Y=4

All Y

X=1

X=2

X=3

25

25

25

30

30

30

30

30

30

15

15

15

100

100

100

All X

75

90

90

45

300

observations as shown in the row totals. Notice that for X = 1, the most common value

of Y is 4, occurring 55 out of 100 times. If we knew that X = 1, we could guess Y = 4

and be right more often than not. By contrast, for X = 3, that value of Y occurs only 5%

of the time. Guessing that Y = 4 when X = 3 would be a mistake; we would be better off

guessing that Y = 1. If we had no information about X, we would use the bottom row

of the table. In that case, the best guess we can make is Y = 3, but that value occurs only

slightly more often than Y = 2 and Y = 4. This is an example where the value of one

variable (X) provides information about another (Y). These variables are dependent, in

the sense that the distribution of one variable depends on the value of the other variable.

Suppose that the two variables X and Y are independent. In that case we might

see the distribution shown in Table 4-2. Notice that the distribution of Y is the same

for all values of X. Thus, the value of X provides no information about the value of Y:

the variables are statistically independent. Note that this does not mean that all values

of Y are equally likely. Indeed, in Table 4-2 a value of Y = 3 occurs twice as often as

Y = 4. Independence means that there is no change in the relative occurrence of the

different values of one variable as values of the other variable change.

DEFINITION: STATISTICAL DEPENDENCE

When the probability of a variable taking a particular value is influenced by

the value assumed by another variable, then the two variables are statistically

dependent.

Bivariate analysis typically is undertaken to determine whether or not a statistical relationship exists between two variables, and if it does, to explore the direction and the

strength of that relationship. The statistical dependence between two variables might

be positive or negative or too complex to describe with a single word, and it might be

strong or weak. In the next few sections we examine how to look for relationships

between variables and how to characterize them.

4.2. Looking for Relationships in Graphs and Tables

When we explore relationships, or statistical dependence between variables, often we

begin by using some of the principles of describing and summarizing data outlined in

Chapters 2 and 3:

159

S TAT I S T I C A L R E L AT I O N S H I P S

1. Display the data in a graph or table that allows comparison between the

variables.

2. Summarize the general patterns in the data and look for obvious departures,

or outliers, from that pattern.

3. Describe the direction and strength of statistical dependence numerically.

We start here with the first two of these tasks, examining how to use some of the

graphical techniques presented in Chapter 2 to explore the relationship between two

variables. In the following examples, we examine how to use qualitative information

to get a better understanding of the distribution of a quantitative variable. We then move

on to investigate scatterplots, the most common way of exploring the relationship between two quantitative variables.

Qualitative versus Quantitative Variables

We commonly encounter sets of data that include both qualitative and quantitative

variables. Areal spatial data provide an example, where we might have quantitative

information on a variable X, distributed across a number of areas or regions. Geographers use the qualitative, areal data to map the quantitative variable in the search

for spatial patterns. In the biomedical field, the health-care workers often have quantitative data on patient health, together with categorical or qualitative data on whether

a particular drug has been administered. When examined across a series of patients,

researchers can determine the effectiveness of a drug in combating an illness. Qualitative variables are frequently used to stratify or to group observations on a quantitative variable, in order to gain understanding of the nature of the distribution of

that variable. The graphical techniques of Chapter 2 can be used for this task, as we

demonstrate below.

EXAMPLE 4-1. Fisher’s Irises. Ronald Fisher was a geneticist and statistician who

developed many of the foundations of modern mathematical statistics. In the early

1930s, he published a dataset on the characteristics of three different species of iris

to illustrate the principles of discriminant analysis. Table 4-3 contains a portion of

Fisher’s iris data set. (The complete dataset can be found on the book’s website.)

TABLE 4-3

A Portion of Fisher’s Iris Data

Species name

Petal width

Petal length

Sepal width

Sepal length

I. setosa

I. virginica

I. virginica

I. versicolor

I. versicolor

2

24

23

20

19

14

56

51

52

51

33

31

31

30

27

50

67

69

65

58

Note: n = 150; only five observations are shown here; all measurements are in millimeters.

Source: Fisher (1936).

160

D E S C R I P T I V E S TAT I S T I C S

TABLE 4-4

Stem-and-Leaf Plot of Sepal Length

a. All species combined

Stems

4*

4⋅

5*

5⋅

6*

6⋅

7*

7⋅

Leaves

3444

56666778888899999

000000000111111111222223444444

5555555666666777777778888888999

00000011111122223333333334444444

5555566777777778889999

0122234

677779

b. Species separated

I. setosa

Stems

4T

4F

4S

4⋅

5*

5T

5F

5S

5⋅

6*

6T

6F

6S

6⋅

7*

7

7

7

7⋅

I. virginica

Leaves

Stems

3

4445

666677

888889999

000000011111111

22223

4444455

77

8

4T

4F

4S

4⋅

5*

5T

5F

5S

5⋅

6*

6T

6F

6S

6⋅

7*

7T

7F

7S

7⋅

I. versicolor

Leaves

Stems

67

8889

0011

22333333

444445555

77777

88999

1

2233

4

67777

9

4T

4F

4S

4⋅

5*

5T

5F

5S

5⋅

6*

6T

6F

6S

6⋅

7*

7T

7F

7S

7⋅

9

Leaves

9

001

2

455555

6666677777

88899

00001111

22333

445

66777

89

0

Note: Display digits = millimeters.

Sepal length, taken from Fisher’s iris dataset, is displayed in stem-and-leaf plots

in Table 4.4. The sepal is the small leaf-like structure at the base of flower petals. The

stem-and-leaf plot of Table 4.4a, for all iris species combined, shows that sepal length

ranges from a minimum value of 43 mm to a maximum value of 79 mm. Most of the

observations in the table are found between 50 mm and 64 mm. In Table 4.4b the sepal

length observations are separated by species. It should be clear from the table that iris

characteristics vary from one species to the next. The species I. setosa has a considerably shorter and less dispersed sepal length than the other two species. In turn, the

161

S TAT I S T I C A L R E L AT I O N S H I P S

species I. virginica has a sepal length that is on average a little longer and more dispersed than that of the species I. versicolor. Using a qualitative variable to separate the

observations in a data set can reveal useful information, as this example illustrates.

Calculating the mean and standard deviation of sepal length for the three samples of iris species confirms the differences alluded to above:

Species

Mean (X¯ )

Standard deviation (s)

I. setosa

I. virginica

I. versicolor

50.10

65.88

59.36

3.54

6.36

5.16

Manufacturing labor

productivity, 1963

In Chapter 9, we show how to test whether differences between samples and differences, between different groups of observations in a dataset are statistically significant.

Box plots provide another way of separating observations on a quantitative

variable by the categories of a qualitative variable. We look for differences between

groups of observations in order to gain a better understanding of the distribution of a

quantitative variable, or because theory suggests a particular pattern of dependence

between two variables. For example, theoretical arguments within economic geography

support the claim that productivity (output per unit of input) should increase with city

size. In Figure 4-1, box plots illustrate the distribution of U.S. manufacturing productivity levels in 1963 across four, somewhat arbitrary, categories of city size.

In order to construct Figure 4-1, the population sizes and productivity values

for 230 U.S. cities were recorded. The cities were then divided into four size classes.

For each of these four size classes, a boxplot was produced. Figure 4-1 shows that the

distribution of labor productivity does indeed vary for cities of different size. The box

50

*

40

*

30

20

10

1

2

3

4

City size class

FIGURE 4-1. Labor productivity and city size. n =

230. City size classes are based on population: 1 = 0–

333,333; 2 = 333,334–666,666; 3 = 666,667–999,999;

4 = 1,000,000 or larger. denotes outside values; * denotes far outside values or outliers. Source: U.S. Census

Bureau.

## 2009 james e burt gerald m barber david l rigby elementary statistics for geographers the guilford press (2009)

## Appendix 4b. Least Squares Solution via Elementary Calculus

## Appendix 5a. Counting Rules for Computing Probabilities

## Appendix 5b. Expected Value and Variance of a Continuous Random Variable

## Appendix 11a. Derivation of Equation 11-11 from Equation 11-10

## IV. PATTERNS IN SPACE AND TIME

Tài liệu liên quan