Appendix 11a. Derivation of Equation 11-11 from Equation 11-10
Tải bản đầy đủ
458
I N F E R E N T I A L S TAT I S T I C S
since the first term does not enter the summation over j. One of the properties of the
mean is that the sum of the deviations about the mean is always 0. Therefore, the crossproduct term must be 0 since the last part of this cross-product term is the sum of the
deviations of Yij around their mean Y¯i . This means we can now reexpress (11-10) as
t
n
t
n
t
n
Σ Σ (Yij – Y¯ )2 = i=1
Σ j=1
Σ (Y¯i – Y¯ )2 + i=1
Σ j=1
Σ (Yij – Y¯i )2
i=1 j=1
Now let us examine the first term on the right-hand side of this new equation.
t
Since the expression Σi=1
(Y¯i – Y¯ ) is independent of the summation over j, the summation causes the addition of the quantity n times. This means we can rewrite (11-10) as
t
n
t
t
n
Σ Σ (Yij – Y¯ )2 = ni=1
Σ (Y¯i – Y¯ )2 + i=1
Σ j=1
Σ (Yij – Y¯ )2
i=1 j=1
FURTHER READING
For another presentation of the basic one- and two-way ANOVA, students may wish to consult
the textbook on elementary statistics by Allan Bluman (2007). Advanced topics in this area, including the development of MANOVA or Multiple ANalysis of VAriance, are clearly explained
in Hair et al. (2004). In MANOVA, there are two or more dependent variables to be explained
by two or more independent treatments. Three examples of the use of ANOVA in contemporary research include a study of forestation, stream sediment, and fish abundance by Sutherland
et al. (2002); river channel type and in-channel river characteristics by Burge (2004); and seed
size and seed depth on germination and seedling emergence by Chen and Maun (1999).
A. Bluman, Elementary Statistics: A Step by Step Approach, 6th ed. (New York: McGraw-Hill, 2007).
L. M. Burge, “Testing Links between River Patterns and In-Channel Characteristics Using MRPP
and ANOVA,” Geomorphology 63 (2004), 115–130.
H. Chen and M. A. Maun, “Effects of Sand Burial Depth on Seedling Germination and Seedling
Emergence of Cirsiu pitcheri,” Plant Ecology 140 (1999), 53–60.
J. F. Hair, Jr., Willeam C. Black, Barry J. Babin, Rolph E. Anderson and Ronald L. Tatham Multivariate Data Analysis, 6th Edition (Upper Saddle River: Prentice Hall, 2006).
A. B. Sutherland, Judy L. Meyer and Edward P. Gardiner (2002) “Effects of Land Cover on Sediment Regime and Fish Assemblage Structure in Four Southern Appalachian Streams,” Freshwater Biology 47 (2002),1791–1805.
PROBLEMS
1. Explain the meaning of the following terms:
• One-factor, completely randomized design
• Homogeneity of variance
• Mean Square
• Scheffé’s contrast method
• Interaction
• Population census
•
•
•
•
•
•
Treatments
Pooled estimate of the variance
Contrast
Factorial experiment
ANOVA table
Secondary data
459
ANALYSIS OF VARIANCE
2. Differentiate between the following:
a. A one-way (or one-factor) and a two-way ANOVA
b. A balanced and an unbalanced design
c. Between sum of squares and error sum of squares
3. Explain the meaning of the assumptions of the ANOVA model.
4. An urban geographer interested in the relationship between crime rates and city size randomly selects eight cities in each of three city size categories and notes the number of assaults
per month per 1,000 population. The data are summarized in the following table:
Small cities
(<99,999)
Medium cities
(100,000–499,999)
Large cities
(>500,000)
27
58
64
31
74
63
93
21
74
86
71
118
91
76
72
28
146
92
42
166
108
100
118
73
a. Calculate the ANOVA table for this problem
b. Test the hypothesis of equal crime rates at α = 0.05
c. Use Scheffé’s method to determine which sample means are different.
5. Noise levels in decibels are taken 100 meters behind noise barriers that have been placed
parallel to a high-volume urban expressway. In order to test the effectiveness of the height
of the barrier and the material used in its construction, a two-factor experiment is undertaken. The 90th percentile decibel levels for a representative day and are shown in the following table:
Construction material
Barrier height
Low (2 meters)
High (4 meters)
Earth Berm
Wood
Concrete
82 86 90
72 76 80
92 96 100
82 86 90
94 90 86
78 82 86
Note that there are three observations for each treatment.
a. Calculate the 2-factor ANOVA table for this problem.
b. Calculate the two one-factor ANOVA table for this problem.
c. Are there significant interaction effects?
6. A cartographer wishes to test the ability of different individuals to use maps by determining
the amount of time it takes them to extract information from a standard topographic map.
She then classifies them by the discipline of their academic major into four categories:
460
I N F E R E N T I A L S TAT I S T I C S
Geography, Engineering, Humanities, or Physical Sciences. The results of the test expressed
in seconds are as follows:
Major
Geography
Engineering
Humanities
Sciences
32
27
65
60
80
72
30
42
34
24
58
74
90
78
83
49
a. Test the hypothesis of equal test times at α = .05 .
b. Use the Scheffé method to determine which majors have significantly different times.
7. Suppose that the city observations in Problem 4 come from four different regions of the
country. Within each city-size category, the first two observations come from the North, the
next pair from the South, the next two from the East, and the last two from the West. Use a
two-way ANOVA to test for differences in crime rates in regions and by city size.
8. A scaling technique is used to estimate the familiarity of a set of neighborhood residents
with a set of randomly selected retail stores in a city. Familiarity is given a score from 1 to
5. The retail stores are classified according to the sector in which they lie in relation to the
neighborhood. The goal is to ascertain the possible directional bias in familiarity of the residents. The scores are:
North
South
East
West
3.2
3.0
3.4
3.2
2.7
2.6
2.5
2.6
2.4
2.9
2.8
2.3
4.0
4.0
3.6
3.6
a. Test the hypothesis that store familiarity is not directionally biased using ANOVA.
b. Use the Scheffé method to determine which directions have significantly different levels
of familiarity.
12
Inferential Aspects of Linear Regression
In Chapter 4, we introduced the notions of correlation and regression in the description of the relationship between two variables. We showed how it was possible to
measure the correlation or association between any two variables X and Y using Pearson’s product–moment correlation coefficient r based on the concept of covariance.
Moreover, we found we could fit linear functions known as regression lines to twovariable scatter plots using the technique of least squares analysis. In the next two
chapters we discuss two important extensions to this framework. In this chapter, we
explain the inferential aspects of these two tools, and in Chapter 13 we discuss the
construction of regression models with multiple independent variables. We first present an overview of the model-building process, outlining the context in which to
view the logical steps of regression and correlation analysis.
12.1. Overview of the Steps in a Regression Analysis
Figure 12-1 illustrates the overall strategy for building a regression model. It is convenient to divide the model-building process into three phases:
1. Data collection and preparation
2. Model construction and refinement
3. Model validation and prediction
Let us consider each of these phases in turn.
Data Collection and Preparation
Normally, we undertake our correlation and regression analysis in the context of
some overall research task that begins with the collection of data. The particular data
collection requirements depend on the nature of the study we are undertaking. Our
461
462
I N F E R E N T I A L S TAT I S T I C S
Collect data
Data cleaning
Preliminary
data examination
Diagnostics and
remedial procedures
Stage 1
OK to
proceed?
No
Yes
Estimation of
preliminary model
Model
refinement
Remedial
measures?
Stage 2
Examination
of results
Select tentative
model (or set)
Model validation
Remedial
measures?
Acceptable?
No
Stage 3
Yes
Predictions based
on regression model
Conclusions
FIGURE 12-1. Stages in building a regression model.
INFERENTIAL ASPECTS OF LINEAR REGRESSION
463
research topic might be described as a controlled experiment in which we manipulate
a number of explanatory variables and study their effects on a response or dependent
variable. In our study of crop yields in Chapter 11, for example, we were interested
in examining the role of seed hybrid and amount of fertilizer on crop yields. Data collection for such statistical experiments is generally straightforward, though by no
means simple. There may be difficult measurement problems as well as the issue of
determining the exact levels and combination of levels of our independent variables
to be used in the experiment. If our study is based on observational or survey data,
we must collect data for explanatory variables based on prior knowledge and previous studies as well as new variables that we now feel may be applicable or important
to the study. Normally, these studies are undertaken to confirm (or not to confirm as
the case may be) hypotheses derived from theoretical models. The variables are not
controlled as in an experimental study, though sometimes we are able to ensure that
our data include sufficient numbers of observations on particularly important explanatory variables. For example, if we are studying shopping patterns, we may wish to include gender if it is shown to be an important variable explaining individual shopping
trip behavior.
Sometimes there is inadequate knowledge about a particular research topic, and
the data collection phase of the study might be described as exploratory in nature.
That is, we may be searching for explanatory variables and include a large number
of potential variables in our study. Or one of the variables of interest might be loosely
defined, and we may use several different measures of the general concept in our
search for a useful variable for our model. For example, we may have different measures of income such as family or household, individual, before tax, or income net of
tax. We may be able to obtain data on these variables from different sources, and each
may have particular advantages and disadvantages. At other times we may be seeking
out a proxy variable for a variable that is part of a theoretical model but not directly
measurable. For example, any of our measures of income might be used as a proxy
for the theoretical income stream of a household in the next 10 years. In a sense, we
are prospecting. It is clear that the set of variables can be large, and we may be interested in ensuring that we also obtain data on many different combinations of these
variables. It could be important to obtain observations not only on households with
low, middle, or high incomes, but also households of varying sizes and household
composition in each income class.
Once our data have been collected, many different editing checks and plots
can be analyzed to identify pure errors but also data outliers. The investigator should
always try to minimize data errors. This often requires the investigator to closely examine an extremely large number of histograms, boxplots, and scatterplots. If we are
interested in undertaking inferential tests on our regression equation or our correlation
coefficients, we may also find it useful to see whether or not we satisfy the assumptions of the model. We will return to this topic later in this chapter.
Notice in Figure 12-1 that we sometimes find results that force us to collect new
or improved data (if this is possible), and we don’t proceed to the next major phase
of model building until these tasks are complete.
464
I N F E R E N T I A L S TAT I S T I C S
Model Construction and Refinement
At this stage in the model-building process, we may evaluate one or more tentative
regression models and perform any number of diagnostic tests on each equation. We
may find that some of these models violate key assumptions of regression or correlation analysis and we must take remedial steps to correct these problems. The key assumptions of regression analysis, are detailed in Section 12.2, and various graphical
diagnostics are described in Section 12.4. In some cases, more than one independent
variable is related to the dependent variable, or a nonlinear function is more appropriate for the systematic part of the relationship. These two issues are explored in
Chapter 13. The model itself may not be completely specified until the data have been
collected. If we are unsure of the nature of the function linking the two variables, we
might explore different functional forms using the available empirical data. To do this
we undertake transformations of our variables.
To estimate the parameters of the regression model, we use the least squares
procedure. This procedure has been fully explained in Chapter 4 for the case of simple
regression, and it is explored further in Chapter 13 for the case of multiple regression.
Throughout the model-building process, we may be continually fitting different regression models, using the results of one equation to make subsequent improvements
in our model. We may find that we can identify a single best model, or we might find
several candidate models that are more or less equally useful. In this phase of model
building, we are often faced with the task of making inferences about the parameters
of our regression model. In parametric statistics, we used the sample mean X¯ to estimate the population mean μ, and we developed a statistical test to determine whether
or not a particular sample could have come from a population with a particular parameter value. Later in this chapter we will see that we can make similar inferences
about the slope and intercept of our calibrated regression model.
Figure 12-1 shows that this component of model building is also not necessarily a simple set of tasks that unambiguously lead to a single model. At times, the
results of fitting one model may suggest we need more data or that we have to undertake some remedial actions based on the diagnostic tests we apply. For example, we
may find we have outliers that are extremely influential and may be providing suspicious results. This stage in model building and refinement may be a lengthy exercise
in which many options are explored, some discarded, and we use large doses of pragmatic judgment as we assess our results. At the end of this phase of model building,
we hope to identify a single model or a series of candidate models that we can explore
further in the final phase of our analysis: validation.
Model Validation and Prediction
Model validity refers to the stability and reasonableness of the regression models we
have created. Are the regression coefficients plausible? Is it possible to generalize the
results of our analysis to other situations or places? Sometimes we can compare our
results to theoretical arguments or to previous empirical studies. If our results tend to
support theoretical expectations, or if they are in agreement with similar studies, this
INFERENTIAL ASPECTS OF LINEAR REGRESSION
465
tends to support the notion that we have created a valid regression model. If not, we
are less likely to believe our model is a valid model for the data.
Sometimes there may be no studies that exactly parallel ours, and it is difficult
to validate our model in this way. We can validate the model in two other ways:
1. Collect a new set of data and check the model and its predictive ability on
this new set.
2. Use a holdout sample to check the model and its predictive ability.
By far the best technique to validate our model is to collect new data and test
the ability of the model to predict the values of our dependent variable for this data.
To do this, take the simple regression equation, substitute in the value of X from an
observation in the new dataset, calculate Yˆ, and then compare it to the actual observed
value of Y for the new data. If our model is reasonable, our predicted values of Yˆ
should be close to the observed values of Y in the new set.
Unfortunately, this technique is rarely feasible because it would involve another
lengthy sampling and data collection exercise. An alternative approach is to split our
data beforehand into two parts. The first portion, or training set, is used to develop
our regression model. The second portion of the dataset is called the validation or prediction set and is used to evaluate the reasonableness of the developed model. This
type of analysis is often called cross-validation. We can also reverse the process by
developing a model using the second dataset and validating it on the first dataset.
When both procedures are used, we are performing a double cross-validation. In Section 13.3 we discuss the specific ways in which this comparison can be made.
Once it is developed, we often wish to use the regression equation to predict
values of the dependent variable. Normally, confidence intervals are used to establish
a range within which the dependent variable can be predicted at any desired level of
confidence. This issue is explored in detail in Section 12.3 and again in Chapter 13.
Whenever a regression analysis is undertaken, it may not necessarily include
all of these steps. Sometimes we may not be interested in prediction at all, and the
ultimate goal is only to evaluate the strength of the empirical evidence in support of
some prespecified theoretical model. Other times, the overriding goal may be to use
the model for prediction, and interest in hypothesis testing may be secondary. For
example, we might wish to develop a regression equation using data collected for
one period, and then evaluate its predictive capabilities using data from another period. Quite often both issues are important. The remainder of this chapter explores the
computational techniques and issues involved in both hypothesis testing and prediction. First, however, it is necessary to enumerate the assumptions that are necessary
in order to make valid inferences in regression analysis.
12.2. Assumptions of the Simple Linear Regression Model
In order to fully understand the inferential framework for regression analysis, it is essential to consider the specification of the simple linear regression function in a more
formal way.
466
I N F E R E N T I A L S TAT I S T I C S
Linear Regression Model
A linear regression model formally expresses the two components of a statistical relationship. First, the dependent variable varies systematically with the independent
variable or variables. Second, there is a scattering of the observations around this
systematic component. Although the systematic component can take on any form and
we can have as many variables as is necessary, we restrict ourselves to the simple twovariable regression model. We conceptualize the problem in the following way.
FIXED-X MODEL
Suppose we are able to control the value of variable X at some particular value, say
at X = X1. At X = X1, we have a distribution of Y values and a mean of Y for this
value of X. There are similar distributions of Y and mean responses at each other fixed
value of X, X = X2, X = X3, . . . , X = XM . The regression relationship of Y on X is derived by tracing the path of the mean value of Y for these fixed values of X. Figure 12-2
illustrates a regression relation for such a fixed-X model. Although the regression
function that describes the systematic part of this statistical relation may be of any
form whatsoever, we will restrict ourselves to situations where the path of the means
is linear. More complicated nonlinear functions are described in Section 13.2. The
function that passes through the means has the equation
Yi = a + bXi
(12-1)
which is the familiar slope-intercept form of a straight line.
EXAMPLE 12-1. To examine the rate of noise attenuation with distance from an urban expressway, sound recorders are placed at intervals of 50 meters over a range of
50 to 300 meters from the centerline of an expressway. Sound recordings are then analyzed, and the sound level in decibels, which is exceeded 10% of the day, is calculated. This measure, known as L10, is a general indicator of the highest typical sound
levels at a location. Recordings are then repeated for five consecutive weekdays using the same sound recorders placed at the identical locations. A plot of the data generated from this experiment is shown in Figure 12-3.
The graph reveals the typical pattern of distance decay. The variation in decibel levels at any fixed location either is due to the effects of other variables—perhaps
variations in traffic volumes or relative elevation—or can be thought of as random
error. Let us presume that all other variables that might have caused sound-level variations are more or less constant and that the variations do represent random error. Note
that the path of the mean sound levels with distance seems to follow the linear function superimposed on the scattergram. One would feel quite justified in fitting a linear
regression equation to this scatter of points.
RANDOM-X MODEL
If situations in which we can control the values of X so that they are set at predetermined levels were the only conditions in which regression analysis could be applied,
467
INFERENTIAL ASPECTS OF LINEAR REGRESSION
Y
x Observation
Mean value of Y for given X
o
x
x
x
x
x
x
x
x
o
x
x
o
x
x
o
o
x
x
x
x
Regression of Y on X
X
X1
X2
X3
X4
FIGURE 12-2. The fixed-X regression model.
Y
90
L10 dBA
80
x
x
o
x
x
x
x
x
x
x
o
x
x
x
70
o Mean dB at given distance
x
x
x
o
x
Path of mean
noise levels
Observation
x
x
x
o
x
x
60
x
x
x
x
o
x
x
x
o
x
x
x
50
100
200
Distance (m)
FIGURE 12-3. Scattergram of Example 12-1.
X
300
468
I N F E R E N T I A L S TAT I S T I C S
it would be of limited value in geographical research. Only rarely do geographers
gather data of this form. They are more commonly collected in experimental situations.
All the mechanical operations presented in this chapter can be performed on any set
of paired observations of two variables X and Y. Virtually all the inferential results of
regression analysis apply to data characteristics of the fixed-X model and to data in
which both X and Y can be considered random variables. This greatly generalizes the
applicability of the regression model. Let us consider an example of a random-X model.
EXAMPLE 12-2. One of the key tasks in metropolitan area transportation planning
is to estimate the total number of trips made by households during a typical day. Data
collected from individual households are usually aggregated to the level of the traffic zone. Traffic zones are small subareas of the city with relatively homogeneous
characteristics. The dependent variable is the number of trips made by a household
per day, known as the trip generation rate. One of the most important determinants
of variations in household trip generation rates is household income. We would expect the total number of trips made by a household in a day to be positively related to
the income of the household.
Consider the hypothetical city of Figure 12-4 which has been divided into 12
traffic zones. Values of the average trip generation rates and household income for
each of 12 zones are listed in Table 12-1. The data are typical for a North American
city. Inner-city traffic zones 1 to 4 are composed of lower-income households that
make few trips per day. Suburban zones 5 to 12 are populated with higher-income
households with higher trip generation rates. The scattergram of Figure 12-5 verifies
the positive relation between trip generation and household income. Notice how these
data differ from the data of a fixed-X model. First, we do not have more than one observation of trip generation rates for any single level of household income. Second,
we do not have observations of trip generation rates for systematic values of household income, say every $1,000 or $5,000 of income. Nevertheless, it appears that the
simple linear regression of household trip generation on income represents a useful
statistical relation. We shall use this example throughout this chapter as well as in
Chapter 13.
But first let us examine the nature of these data. In what sense can we think of
variable X as being a random variable? We might think of household income as a random variable since it is derived from one particular spatial aggregation of households
into traffic zones. The traffic zones of Figure 12-4 can be thought of as one random
choice among all the possible ways in which we might build traffic zones from the
household data. Other aggregations to different traffic zones would yield different
average incomes and different trip generation rates and therefore potentially different regression equations. This is not a very persuasive argument. We would think of
income as being a random variable if the original observations from which we generated Table 12-1 were a random sample taken from all the households in the city. If
this were the case, then we might have some justification in treating income as a random variable. But what if the aggregated data represent all the households in the
city? Can we make a case for treating income as a random variable? As you see, many