Appendix 3b. An Iterative Algorithm for Determining the Weighted or Unweighted Euclidean Median
Tải bản đầy đủ
151
D E S C R I B I N G DATA W I T H S TAT I S T I C S
(Xe , Ye ) found by the algorithm to be within 0.01, we set TOL = 0.01. Since it is an
approximating algorithm, the method is more efficient if a “good” starting point is selected. A good point is one that is close to the answer. The bivariate mean (X¯, Y¯ ) and
weighted mean (X¯w, Y¯w) are good and obvious choices for a starting point. So (X1, Y1)
= (X¯, Y¯ ). The algorithm can be described as the following sequence of steps:
1. Calculate the distance from each point (Xi, Yi) to the current estimate of the
median location
t)2 + (Y
where dit = √
⎯⎯⎯⎯⎯⎯
(Xi – X⎯⎯⎯⎯⎯⎯
⎯⎯⎯
Y t )2
i –⎯⎯⎯
where dit is the distance from point i to the median during the tth iteration.
2. Determine the values Kit = wi /dit. Note that in the unweighted case all values of wi are set equal to 1.
3. Calculate a new estimate of the median from
n
Σ Kit Xi
i=1
X t+1 = ———–
n
Σ Kit
i=1
n
Σ KitYi
i=1
Y t+1 = ———–
n
Σ Kit
i=1
4. Check to see whether the location has changed between iterations. If |X t+1 –
X t | and |Y t+1 – Y t | ≤ TOL, stop. Otherwise, set X t = X t+1 and Y t = Y t1 and go
to step 1.
The algorithm usually converges in a small number of iterations. For the nine points
of Figure 3-11, for the unweighted case, the following steps summarize the results of
the algorithm initialized with TOL = 0.10.
Iteration
Xt
Yt
1
2
3
4
44.867
43.865
43.672
43.710
45.517
44.060
43.204
42.742
The location of the unweighted median is thus estimated as (43.71, 42.74).
REFERENCES
American Association of University Professors, 2002.
H. W. Kuhn and R. E. Kuenne, “An Efficient Algorithm for the Numerical Solution of the Generalized Weber Problem in Spatial Economics,” Journal of Regional Science 4 (1962), 21–33.
S. M. Stigler, “Do Robust Estimators Work with Real Data?,” Annals of Statistics 5 (1977), 1055–
1078.
152
D E S C R I P T I V E S TAT I S T I C S
FURTHER READING
Virtually all elementary statistics textbooks for geographers include a discussion of at least part
of the material presented in this chapter. Students often find it useful to consult other textbooks
to reinforce their understanding of key concepts. Four textbooks intended for geography students are those by Clark and Hosking (1986), Ebdon (1985), Griffith and Amrhein (1991), and
O’Brien (1992). A useful first reference on geostatistics is Neft (1966), while Isaaks and Srivastava (1989) is more comprehensive and more complex. Extensions of statistical analysis to
directional data are provided by Mardia (1972). Geographical applications of graph theory and
network analysis are reviewed in Tinkler (1979) and Haggett and Chorley (1969), respectively.
W. A. V. Clark and P. L Hosking, Statistical Methods for Geographers (New York: Wiley, 1986).
D. Ebdon, Statistics in Geography, 2nd ed. (Oxford: Basil Blackwell, 1985).
D. A. Griffith and C. G. Amrhein Statistical Analysis for Geographers (Englewood Cliffs, NJ:
Prentice-Hall, 1991).
P. Haggett and R. J. Chorley, Network Analysis in Geography (London: Edward Arnold, 1969).
E. H. Isaaks and R. Mohan Srivastava, Applied Geostatistics (Oxford: Oxford University Press,
1989).
D. Neft, Statistical Analysis for Areal Distributions (Philadelphia: Regional Science Research Unit,
1966).
L. O’Brien, Introducing Quantitative Geography (London: Routledge, 1992).
K. J. Tinkler, “Graph Theory,” Progress in Human Geography 3 (1979), 85–116.
DATASETS USED IN THIS CHAPTER
The following data sets employed in this chapter can be found on the website for this book.
cavendish.html
dodata.html
traffic.html
PROBLEMS
1. Explain the following terms:
• Mean
• Median
• Percentile
• Interquartile range
• Standard deviation
• Coefficient of variation
• Skewness
• Kurtosis
•
•
•
•
•
•
•
Location quotient
Lorenz curve
Gini coefficient
Mean center
Standard distance
Moving average
Smoothing
153
D E S C R I B I N G DATA W I T H S TAT I S T I C S
2. When is the mean not a good measure of central tendency? Give an example of a variable
that might best be summarized by some other measure of central tendency.
3. Consider the following five observations: –8, 14, –2, 3, 5.
a. Calculate the mean.
b. Find the median.
c. Show that the Properties 1, 2, and 3 hold for this set of numbers.
4. Life expectancy at birth (in years) is shown for a random sample of 60 countries in 1979.
The data are drawn from the United Nations Human Development Report:
76.7 64 71.1 40.1 52.3 64.5 74.4 78.6 69.5 62.6 57.9 42.4 69.5 73 50.8 78 71 66.6 48.6
76.9 70 68.9 70 72.4 65.1 39.3 69.6 47.9 73.1 72.9 58 69.9 44.9 51.7 74 69.4 70.5 57.3 45
65.4 78.5 73.9 73 64.4 69.5 45.2 78.1 78.5 71.4 64 48.5 64 66.3 72.9 78.1 47.2 72.4 68.2
72 76 78.1
a. Calculate the following descriptive statistics for these data: mean, variance, skewness
and kurtosis.
b. Write a brief summary of your findings about the distribution of life expectancy in these
countries.
5. Using the data of Problem 4, determine
a. Range
b. First and third quartiles
c. The interquartile range
d. The 60th percentile
e. The 10th percentile
f. The 90th percentile
6. The following data describe the distribution of annual household income in three regions
of a country:
Region
X¯
s
A
B
C
$42,000
33,000
27,000
$16,000
13,000
11,000
a. In which region is income the most evenly spread?
b. In which region is income the least evenly spread?
7. The following data represent a time series developed over 45 consecutive time periods
(read the data by rows):
569 416 422 565 484 520 573 518 501 505 468 382 310 334 359 372 439 446 349 395
461 511 583 590 620 578 534 631 600 438 534 467 457 392 467 500 493 410 412 416
403 433 459 467
154
D E S C R I P T I V E S TAT I S T I C S
a.
b.
c.
d.
Smooth these data using a three-term moving average.
Smooth these data using a five-term moving average.
Graph the original data and the two smoothed sequences from (a) and (b).
Graph the fluctuating component from each of the two sequences.
8. Consider a geographic area divided into four regions, north, south, east, and west. The following table lists the population of these regions in three racial categories: Asian, Black,
and White:
Population
a.
b.
c.
d.
Region
Asian
Black
White
Total
North
South
East
West
Total
600
200
150
100
1050
200
300
150
300
950
700
400
250
200
1550
1500
900
550
600
3550
Calculate location quotients for each region and each racial group in the population.
Calculate the coefficient of localization for each group.
Draw the Lorenz curves for each of the racial groups on a single graph.
Write a brief paragraph describing the spatial distribution of the three groups using your
answers to (a), (b), and (c).
9. What is the smallest possible value for a location quotient? When can it occur? What is
the largest possible value for a location quotient?
10. Consider the following 12 coordinate pairs and weights:
X coordinate
Y coordinate
Weight
60
45
70
55
65
70
80
45
30
55
70
0
80
45
60
60
75
45
60
75
70
50
65
40
4
5
6
7
4
3
2
2
2
1
1
1
a. Locate each point, using the given coordinates, on a piece of graph paper.
b. Calculate the mean center (assume all weights are equal to 1). Calculate the weighted
mean center. Calculate the Manhattan median. Locate each measure on the graph
paper.
D E S C R I B I N G DATA W I T H S TAT I S T I C S
155
c. Calculate the standard distance. Draw a circle with radius equal to the standard distance
centered on the mean center.
d. (Optional) Use the Kuehn and Kuenne algorithm (see Appendix 3b) to determine the
Euclidean median.
11. For the census of the United States (www.census.gov) or Canada (www.statisticscanada.ca),
or an equivalent source, generate the value of a variable for a region containing at least
30 subareas. For example, you could look at some characteristic of the population for the
counties within a state or province.
a. Calculate the mean and variance for the variable of interest across the subareas.
b. Group contiguous subareas in pairs so that there are 15 or more zones; calculate the
mean and variance for the variable of interest across the newly constructed zones.
c. Group contiguous zones in part (b) to get eight or more areas; find the mean and variance of the variable of interest across these areas.
d. Continue the process of aggregating regions into larger geographical units until there
are only two subregions. At each step calculate the mean and variance of the variable of
interest across the areas.
e. The mean of the variable at each step should be equal. Why?
f. Construct a graph of the variance of the variable (Y-axis) versus the number of subareas (X-axis). What does it reveal?
12. For the same data used in Problem 11b, generate five different groupings in which subareas are joined in pairs. Calculate the variance of the variable of interest for each grouping. Do your results confirm the existence of the problem of modifiable units?
4
Statistical Relationships
Chapters 2 and 3 examined how to display, interpret, and describe the distribution
of a single variable. In this chapter, we extend the discussion by exploring statistical
relationships between variables. A few moments of reflection should confirm the importance of bivariate analysis, or thinking about how the values assumed by one variable are related to those assumed by another variable. Examples abound:
1. Is human activity, measured perhaps by global economic output, related to
the atmospheric temperature of the earth?
2. Is the rate of taxation related to economic growth?
3. Is drug X an effective treatment for disease Y?
4. Is the value of the Dow Jones Industrial Average stock market index on one
day linked to its value on another day?
5. Is average household income in a state related to average household income
in neighboring states? Alternatively, do states with high (low) average incomes tend to cluster in space?
Just as we used graphical and numerical techniques to explore the distribution of values of a single variable, we can use related techniques to investigate and describe the
relationship between variables. While the nature of a relationship, or the statistical dependence, between two quantitative variables will be our focus, we will also explore
how to examine the relationship between a quantitative variable and a qualitative variable. Investigation of the interaction between more than two variables, or what we
call multivariate analysis, occupies later chapters of the book.
Bivariate data comprise a set of observations each one of which is associated
with a pair of values for two variables, X and Y. For example, college applicants often
are characterized by their SAT scores and their GPA. Atmospheric scientists are interested in tracking levels of carbon dioxide and the temperature of the earth’s atmosphere over a series of observational units, typically years. Development economists
might compare per capita income with life expectancy across countries. In this case,
the observations would be individual countries.
156
157
S TAT I S T I C A L R E L AT I O N S H I P S
The analysis of bivariate data in this chapter is organized as follows. Section
4.1 reviews the question of statistical dependence and highlights some of the main
issues that will concern us throughout much of the rest of the chapter. In Section 4.2
we show how the graphical techniques introduced in Chapter 2 can be used to compare two distributions. These distributions may represent observations from different
variables, or they may represent different samples on a single variable. We end the
section looking at scatterplots that display the joint distributions of two variables. The
correlation coefficient is introduced in Section 4.3 as a measure of the strength of
the relationship between two quantitative variables. Section 4.4 discusses the closely
related concept of regression. In Section 4.5 we extend the concept of correlation to
investigate the relationship between successive values of a single time-series variable,
what we call temporal or serial autocorrelation. Exploration of spatial autocorrelation,
the relationship between values of a single variable distributed over space, is left to
Chapter 14.
It is important to look at your data before engaging in bivariate analysis. Looking at your data can be done efficiently using the graphical methods discussed in
Chapter 2. Examining the distributions of each of your variables independently often
provides insights into how they might be employed in bivariate studies. Such examination can also alert you to potential issues that limit which bivariate techniques can
be used, for example, if a variable has a non-normal distribution.
4.1. Relationships and Dependence
In statistics, when we say that two variables are related, we mean that the value(s) assumed by one of the variables provides at least some information about the value(s)
assumed by the other variable. It might be that one variable has a causal influence on
the other variable, or it might be that both variables are influenced by yet another variable or by a combination of other variables. Statistics can’t speak to how relationships
arise, it merely provides methods for detecting and analyzing relationships. In what follows, therefore, we will use the word “influence” simply to mean that one variable is
connected to another, either directly or indirectly through the action of other variables.
Consider Table 4-1, which shows the values of two discrete, integer-valued,
quantitative variables, X and Y. Variable X can assume the values 1, 2, and 3, while
variable Y can assume the values 1, 2, 3, and 4. For each value of X there are 100
TABLE 4-1
Dependent Variables: A Bivariate Sample of 300 Observations
Y=1
Y=2
Y=3
Y=4
All Y
X=1
X=2
X=3
5
25
35
15
30
30
25
30
30
55
15
5
100
100
100
All X
65
75
85
75
300
158
D E S C R I P T I V E S TAT I S T I C S
TABLE 4-2
Independent Variables: A Bivariate Sample of 300 Observations
Y=1
Y=2
Y=3
Y=4
All Y
X=1
X=2
X=3
25
25
25
30
30
30
30
30
30
15
15
15
100
100
100
All X
75
90
90
45
300
observations as shown in the row totals. Notice that for X = 1, the most common value
of Y is 4, occurring 55 out of 100 times. If we knew that X = 1, we could guess Y = 4
and be right more often than not. By contrast, for X = 3, that value of Y occurs only 5%
of the time. Guessing that Y = 4 when X = 3 would be a mistake; we would be better off
guessing that Y = 1. If we had no information about X, we would use the bottom row
of the table. In that case, the best guess we can make is Y = 3, but that value occurs only
slightly more often than Y = 2 and Y = 4. This is an example where the value of one
variable (X) provides information about another (Y). These variables are dependent, in
the sense that the distribution of one variable depends on the value of the other variable.
Suppose that the two variables X and Y are independent. In that case we might
see the distribution shown in Table 4-2. Notice that the distribution of Y is the same
for all values of X. Thus, the value of X provides no information about the value of Y:
the variables are statistically independent. Note that this does not mean that all values
of Y are equally likely. Indeed, in Table 4-2 a value of Y = 3 occurs twice as often as
Y = 4. Independence means that there is no change in the relative occurrence of the
different values of one variable as values of the other variable change.
DEFINITION: STATISTICAL DEPENDENCE
When the probability of a variable taking a particular value is influenced by
the value assumed by another variable, then the two variables are statistically
dependent.
Bivariate analysis typically is undertaken to determine whether or not a statistical relationship exists between two variables, and if it does, to explore the direction and the
strength of that relationship. The statistical dependence between two variables might
be positive or negative or too complex to describe with a single word, and it might be
strong or weak. In the next few sections we examine how to look for relationships
between variables and how to characterize them.
4.2. Looking for Relationships in Graphs and Tables
When we explore relationships, or statistical dependence between variables, often we
begin by using some of the principles of describing and summarizing data outlined in
Chapters 2 and 3:
159
S TAT I S T I C A L R E L AT I O N S H I P S
1. Display the data in a graph or table that allows comparison between the
variables.
2. Summarize the general patterns in the data and look for obvious departures,
or outliers, from that pattern.
3. Describe the direction and strength of statistical dependence numerically.
We start here with the first two of these tasks, examining how to use some of the
graphical techniques presented in Chapter 2 to explore the relationship between two
variables. In the following examples, we examine how to use qualitative information
to get a better understanding of the distribution of a quantitative variable. We then move
on to investigate scatterplots, the most common way of exploring the relationship between two quantitative variables.
Qualitative versus Quantitative Variables
We commonly encounter sets of data that include both qualitative and quantitative
variables. Areal spatial data provide an example, where we might have quantitative
information on a variable X, distributed across a number of areas or regions. Geographers use the qualitative, areal data to map the quantitative variable in the search
for spatial patterns. In the biomedical field, the health-care workers often have quantitative data on patient health, together with categorical or qualitative data on whether
a particular drug has been administered. When examined across a series of patients,
researchers can determine the effectiveness of a drug in combating an illness. Qualitative variables are frequently used to stratify or to group observations on a quantitative variable, in order to gain understanding of the nature of the distribution of
that variable. The graphical techniques of Chapter 2 can be used for this task, as we
demonstrate below.
EXAMPLE 4-1. Fisher’s Irises. Ronald Fisher was a geneticist and statistician who
developed many of the foundations of modern mathematical statistics. In the early
1930s, he published a dataset on the characteristics of three different species of iris
to illustrate the principles of discriminant analysis. Table 4-3 contains a portion of
Fisher’s iris data set. (The complete dataset can be found on the book’s website.)
TABLE 4-3
A Portion of Fisher’s Iris Data
Species name
Petal width
Petal length
Sepal width
Sepal length
I. setosa
I. virginica
I. virginica
I. versicolor
I. versicolor
2
24
23
20
19
14
56
51
52
51
33
31
31
30
27
50
67
69
65
58
Note: n = 150; only five observations are shown here; all measurements are in millimeters.
Source: Fisher (1936).
160
D E S C R I P T I V E S TAT I S T I C S
TABLE 4-4
Stem-and-Leaf Plot of Sepal Length
a. All species combined
Stems
4*
4⋅
5*
5⋅
6*
6⋅
7*
7⋅
Leaves
3444
56666778888899999
000000000111111111222223444444
5555555666666777777778888888999
00000011111122223333333334444444
5555566777777778889999
0122234
677779
b. Species separated
I. setosa
Stems
4T
4F
4S
4⋅
5*
5T
5F
5S
5⋅
6*
6T
6F
6S
6⋅
7*
7
7
7
7⋅
I. virginica
Leaves
Stems
3
4445
666677
888889999
000000011111111
22223
4444455
77
8
4T
4F
4S
4⋅
5*
5T
5F
5S
5⋅
6*
6T
6F
6S
6⋅
7*
7T
7F
7S
7⋅
I. versicolor
Leaves
Stems
67
8889
0011
22333333
444445555
77777
88999
1
2233
4
67777
9
4T
4F
4S
4⋅
5*
5T
5F
5S
5⋅
6*
6T
6F
6S
6⋅
7*
7T
7F
7S
7⋅
9
Leaves
9
001
2
455555
6666677777
88899
00001111
22333
445
66777
89
0
Note: Display digits = millimeters.
Sepal length, taken from Fisher’s iris dataset, is displayed in stem-and-leaf plots
in Table 4.4. The sepal is the small leaf-like structure at the base of flower petals. The
stem-and-leaf plot of Table 4.4a, for all iris species combined, shows that sepal length
ranges from a minimum value of 43 mm to a maximum value of 79 mm. Most of the
observations in the table are found between 50 mm and 64 mm. In Table 4.4b the sepal
length observations are separated by species. It should be clear from the table that iris
characteristics vary from one species to the next. The species I. setosa has a considerably shorter and less dispersed sepal length than the other two species. In turn, the
161
S TAT I S T I C A L R E L AT I O N S H I P S
species I. virginica has a sepal length that is on average a little longer and more dispersed than that of the species I. versicolor. Using a qualitative variable to separate the
observations in a data set can reveal useful information, as this example illustrates.
Calculating the mean and standard deviation of sepal length for the three samples of iris species confirms the differences alluded to above:
Species
Mean (X¯ )
Standard deviation (s)
I. setosa
I. virginica
I. versicolor
50.10
65.88
59.36
3.54
6.36
5.16
Manufacturing labor
productivity, 1963
In Chapter 9, we show how to test whether differences between samples and differences, between different groups of observations in a dataset are statistically significant.
Box plots provide another way of separating observations on a quantitative
variable by the categories of a qualitative variable. We look for differences between
groups of observations in order to gain a better understanding of the distribution of a
quantitative variable, or because theory suggests a particular pattern of dependence
between two variables. For example, theoretical arguments within economic geography
support the claim that productivity (output per unit of input) should increase with city
size. In Figure 4-1, box plots illustrate the distribution of U.S. manufacturing productivity levels in 1963 across four, somewhat arbitrary, categories of city size.
In order to construct Figure 4-1, the population sizes and productivity values
for 230 U.S. cities were recorded. The cities were then divided into four size classes.
For each of these four size classes, a boxplot was produced. Figure 4-1 shows that the
distribution of labor productivity does indeed vary for cities of different size. The box
50
*
40
*
30
20
10
1
2
3
4
City size class
FIGURE 4-1. Labor productivity and city size. n =
230. City size classes are based on population: 1 = 0–
333,333; 2 = 333,334–666,666; 3 = 666,667–999,999;
4 = 1,000,000 or larger. denotes outside values; * denotes far outside values or outliers. Source: U.S. Census
Bureau.