6 An example: school cancellations and snow
Tải bản đầy đủ - 0trang
16.6 An example: school cancellations and snow
217
16.5.2 Testing whether the intercept of the regression line
is signiﬁcantly diﬀerent to zero
The value for the intercept a calculated from a sample is only an estimate of
the population statistic α. Consequently, a positive or negative value of a
might be obtained in a sample from a population where α is zero. The
standard deviation of the points scattered around the regression line can be
used to calculate the 95% conﬁdence interval for a, and a single-sample t test
can be used to compare the value of a to zero or any other expected value.
Once again, most statistical packages include a test to determine if a diﬀers
signiﬁcantly from zero.
16.5.3 The coeﬃcient of determination r 2
The coeﬃcient of determination, symbolized by r2, is a statistic that shows
the proportion of the total variation of the values of Y from the average Y
that is explained by the regression line. It is the regression sum of squares
divided by the total sum of squares:
r2 ¼
Sum of squares explained by the regressionððaÞ aboveÞ
Total sum of squaresaị ỵ bị aboveị
(16:10)
which will only ever be a number from zero to 1.0. If the points all lie along the
regression line and it has a slope that is diﬀerent from zero, the unexplained
component (quantity (b)) will be zero and r2 will be 1. If the explained sum of
squares is small in relation to the unexplained, r2 will be a small number.
16.6
An example: school cancellations and snow
In places at high latitudes, heavy snowfalls are the dream of every young
student, because they bring the possibility of school closures (called “snow
days”), not to mention sledding, hot chocolate and additional sleep! A school
administrator looking for a way to predict the number of school closures on any
day in the city of St Paul, Minnesota hypothesized that it would be related to
the amount of snow that had fallen during the previous 24 hours. To test this,
they examined data from 10 snowfalls. These bivariate data for snowfall
(in cm) and the number of school closures on the following day are given in
Table 16.1.
218
Linear regression
Table 16.1 Data for 24-hour snowfall and the number
of school closure days for each of 10 snowfalls.
Snowfall (cm)
School closures
3
6
9
12
15
18
21
24
27
30
5
13
16
14
18
23
20
32
29
28
Table 16.2 An example of the table of results from a regression analysis. The value
of the intercept a (5.733) is given in the ﬁrst row, labeled “(Constant)” under the
heading “Value”. The slope b (0.853) is given in the second row (labeled as the
independent variable “Snowfall”) under the heading “Value.” The ﬁnal two columns
give the results of t tests comparing a and b to zero. These show the intercept, a, is
signiﬁcantly diﬀerent to zero (P = 0.035) and the slope b is also signiﬁcantly diﬀerent
to zero (P < 0.001). The signiﬁcant value of the intercept suggests that there may be
other reasons for school closures (e.g. ice storms, frozen pipes), or perhaps the
regression model is not very accurate.
Model
Value
Std error
t
Signiﬁcance
Constant
Snowfall
5.733
0.853
2.265
0.122
2.531
7.006
0.035
0.001
From a regression analysis of these data a statistical package will give
values for the equation for the regression line, plus a test of the hypotheses
that the intercept, a, and slope, b are from a population where α and β are
zero. The output will be similar in format to Table 16.2.
From the results in Table 16.2 the equation for the regression line is
school closures = 5.773 + 0.853 × snowfall. The slope is signiﬁcantly diﬀerent to zero (in this case it is positive) and the intercept is also signiﬁcantly
diﬀerent to zero. You could use the regression equation to predict the
number of school closures based on any snowfall between 3 and 30 cm.
16.8 Predicting a value of X from a value of Y
219
Table 16.3 An example of the results of an analysis of the slope of a regression. The
signiﬁcant F ratio shows the slope is signiﬁcantly diﬀerent to zero.
Regression
Residual
Total
Sum of squares
df
Mean square
F
Signiﬁcance
539.648
87.952
627.600
1
8
9
539.648
10.994
49.086
0.000
Most statistical packages will give an ANOVA of the slope. For the data in
Table 16.1 there is a signiﬁcant relationship between school closures and
snowfall (Table 16.3).
Finally, the value of r2 is also given. Sometimes there are two values: r2,
which is the statistic for the sample and a value called “Adjusted” r2, which is
an estimate for the population from which the sample has been taken. The r2
value is usually the one reported in the results of the regression. For the
example above you would get the following values:
r ¼ 0:927; r 2 ¼ 0:860; adjusted r 2 ¼ 0:842
This shows that 86% of the variation in school closures with snowfall can be
predicted by the regression line.
16.7
Predicting a value of Y from a value of X
Because the regression line has the average slope through a set of scattered
points, the predicted value of Y is only the average expected for a given value
of X. If the r2 value is 1.0, the value of Y will be predicted without error,
because all the data points will lie on the regression line. Usually, however,
the points will be scattered around the line. More advanced texts describe
how you can calculate the 95% conﬁdence interval for a value of Y and thus
predict its likely range.
16.8
Predicting a value of X from a value of Y
Often you might want to estimate a value of the independent variable X
from the dependent variable Y. Here is an example. Many elements absorb
energy of a very speciﬁc wavelength because the movement of electrons or
220
Linear regression
neutrons from one energy level to another within atoms is related to the
vibrational modes of crystal lattices. Therefore, the amount of energy
absorbed at that wavelength is dependent on the concentration of the
element present in a sample. Here it is tempting to designate the concentration of the element as the dependent variable and absorption at the
independent one and use regression in order to estimate the concentration
of the element present. This is inappropriate because concentration of an
element does not depend on the amount of energy absorbed or given oﬀ, so
one of the assumptions of regression would be violated.
Predicting X from Y can be done by rearranging the regression equation
for any point from:
Yi ẳ a ỵ bXi
(16:11)
to:
Xi ¼
Yi À a
b
(16:12)
but here too the 95% conﬁdence interval around the estimated value of X
must also be calculated because the measurement of Y is likely to include
some error. Methods for doing this are given in more advanced texts.
16.9
The danger of extrapolating beyond the range
of data available
Although regression analysis draws a line of best ﬁt through a set of data,
it is dangerous to make predictions beyond the measured range of X.
Figure 16.8 illustrates that a predicted regression line may not be a correct
estimation of the value of Y outside this range.
16.10 Assumptions of linear regression analysis
The procedure for linear regression analysis described in this chapter is
often described as a Model I regression, and makes several assumptions.
First, the values of Y are assumed to be from a population of values that
are normally and evenly distributed about the regression line, with no
gross heteroscedasticity. One easy way to check for this is to plot a graph
showing the residuals. For each data point its vertical displacement on the Y
16.10 Assumptions of linear regression analysis
221
(a)
Y
0
2
4
6
8
10
6
X
8
10
X
(b)
Y
0
2
4
Figure 16.8 It is risky to use a regression line to extrapolate values of Y
beyond the measured range of X. The regression line (a) based on the data for
values of X ranging from 1 to 5 does not necessarily give an accurate prediction
(b) of the values of Y beyond that range. A classic example of such behavior is
found in plots of the geothermal gradient of the Earth’s interior. At shallow
depths, there is generally a linear increase in temperature of ~20 K/km depth,
depending on location, but the rate increases as you go deeper into the mantle.
axis either above or below the ﬁtted regression line is the amount of residual
variation that cannot be explained by the regression line, as described in
Section 16.5.1. The residuals are calculated by subtraction (Table 16.4) and
plotted on the Y axis, against the values of X for each point and will always
give a plot where the regression line is re-expressed as horizontal line with
an intercept of zero.
If the original data points are uniformly scattered about the original
regression line, the scatter plot of the residuals will be evenly dispersed in
a band above and below zero (Figure 16.9). If there is heteroscedasticity the
band will vary in width as X increases or decreases. Most statistical packages
will give a plot of the residuals for a set of bivariate data.
222
Linear regression
Table 16.4 Original data and ﬁtted regression line of Y = 10.8 + 0.9X. The residual for
each point is its vertical displacement from the regression line. Each residual is
plotted on the Y axis against the original value of X for that point to give a graph
showing the spread of the points about a line of zero slope and intercept.
Original data
X
Y
Calculated value
^ from
of Y
regression equation
1
3
4
5
6
7
8
9
10
11
12
14
13
12
14
17
17
15
17
21
20
19
21
25
11.7
13.5
14.4
15.3
16.2
17.1
18.0
18.9
19.8
20.7
21.6
23.4
Data for the plot of residuals
Value of X (from
original data)
Value of Y
^
ðY À YÞ
1
3
4
5
6
7
8
9
10
11
12
14
1.3
− 1.5
− 0.4
1.7
0.8
− 2.1
− 1.0
2.1
0.2
− 1.7
− 0.6
1.6
Second, it is assumed the independent variable X is measured without
error. This is often diﬃcult and many texts note that X should be measured
with little error. For example, levels of an independent variable determined
by the experimenter, such as the relative % humidity, are usually
measured with very little error indeed. In contrast, variables such as the
depth of snowfall from a windy blizzard, or the in situ temperature of a
violently erupting magma, are likely to be measured with a great deal of
error. When the dependent variable is subject to error, a diﬀerent analysis
called Model II regression is appropriate. Again, this is described in more
advanced texts.
Third, it is assumed that the dependent variable is determined by the
independent variable. This was discussed in Section 16.2.
Fourth, the relationship between X and Y is assumed to be linear and it is
important to be conﬁdent of this before carrying out the analysis. A scatter
plot of the data should be drawn to look for any obvious departures from
linearity. In some cases it may be possible to transform the Y variable
16.11 Multiple linear regression
(a)
(b)
25
223
3
Residual
2
Y
20
15
–2
–3
10
0
4
8
X
0
12
4
8
12
X
(d)
(c)
Residual
0
0
0
X
X
Figure 16.9 (a) Plot of original data in Table
À 16.4,Áwith ﬁtted regression line
^ against the value of X for
Y = 10.8 + 0.9X. (b) The plot of the residual Y À Y
each data point shows a relatively even scatter about the horizontal line.
(c) General form of residual plot for data that are homoscedastic. (d) Residual
plot showing one example of heteroscedasticity, where the variance of the
residuals decreases with X.
(see Chapter 13) to give a linear relationship and proceed with a regression
analysis on the transformed data.
16.11 Multiple linear regression
Multiple linear regression is a straightforward extension of simple linear
regression. The simple linear regression equation:
Yi ẳ a ỵ bXi
(16:13 copied from 16:1)
examines the relationship between the value of a variable Y and another
variable X. Often, however, the value of Y might depend upon more than
one variable. For example, the sediment yield of a river may be dependent
on its drainage area plus other factors such as topographic relief, precipitation and ﬂow rate.
224
Linear regression
Therefore, the regression equation could be extended:
Yi ẳ a ỵ b1 X1i ỵ b2 X2i
(16:14)
which is just Equation (16.13) plus a second independent variable with its
own coeﬃcient of b2 and values (X2i). You will notice that there is now a
double subscript after the two values of X, in order to specify the ﬁrst variable
(e.g. drainage area) as X1 and the second (e.g. topographic relief) as X2.
Equation (16.14) can be further extended to include additional variables
such as precipitation and ﬂow rate:
Yi ¼ a ỵ b1 X1i ỵ b2 X2i ỵ b3 X3i þ b4 X4i ::: etc
(16:15)
The mathematics of multiple linear regression is complex, but a statistical
package will give the overall signiﬁcance of the regression and, more
importantly, the value for the slope (and its signiﬁcance) for each of the
independent variables.
If the initial analysis shows that an independent variable has no signiﬁcant eﬀect on Y, the variable can be removed from the equation and the
analysis rerun. This process of reﬁning the model can be repeated until only
signiﬁcant independent variables remain, thereby giving the best possible
model for predicting Y. There are several procedures for reﬁning, but the
one most frequently recommended is to initially include all independent
variables, run the analysis and examine the results for the ones that do not
appear to aﬀect the value of Y (i.e. variables with non-signiﬁcant values
of b). The least signiﬁcant is removed and the analysis rerun. This process,
called backward elimination, is repeated until only signiﬁcant variables
remain.
16.12 Further topics in regression
This chapter is an introduction to linear regression analysis. More advanced
analyses include procedures for comparing the slopes and intercepts of two
or more regression lines. Non-linear regression models can be ﬁtted to data
where the relationship between X and Y is exponential, logarithmic or even
more complex. The understanding of simple linear regression developed
here is an essential introduction to these methods, which will be discussed
further in relation to sequence analysis (Chapter 21).
16.13 Questions
225
16.13 Questions
(1) An easy way to help understand regression is to work through a simple
contrived example. The set of data below will give a regression with
a slope of 0 and an intercept of 10, so the line will have the equation
Y = 10 + 0X:
X
Y
0
0
0
1
1
1
2
2
2
10
9
11
10
9
11
10
9
11
(a) Use a statistical package to run the regression. What is the value of r2 for
this relationship? Is the slope of the regression signiﬁcant? (b) Next, modify
the data to give an intercept of 20, but with a slope that is still zero. (c)
Finally, modify the data to give a negative slope that is signiﬁcant.
(2) The table below gives data for the weight of alluvial gold recovered from
diﬀerent volumes of stream gravel.
Volume of gravel
processed (m3)
Weight of gold
recovered (grams)
1
2
3
4
5
6
7
8
9
0.025
0.042
0.071
0.103
0.111
0.142
0.164
0.191
0.220
226
Linear regression
(a) Run a regression analysis, where the volume of gravel is the independent variable. What is the value of r2? Is the relationship signiﬁcant?
What is the equation for the relationship between the weight of gold
recovered and the volume of gravel processed? Does the intercept of the
regression line diﬀer signiﬁcantly from zero? Would you expect it to?
Why?
17 Non-parametric statistics
17.1
Introduction
Parametric tests are designed for analyzing data from normally distributed
populations. Although these tests are quite robust to departures from
normality, and major ones can often be reduced by transformation, there
are some cases where the population is so grossly non-normal that parametric testing is unwise. In these cases a powerful analysis can often still be
done by using a non-parametric test.
Non-parametric tests are not just alternatives to the parametric procedures for analyzing ratio, interval and ordinal data described in
Chapters 8 to 16. Often geoscientists obtain data that have been measured
on a nominal scale. For example, Table 3.2 gave data for the locations
of 594 tornadoes during the period from 1998–2007 in the southeastern
states of the US. This is a sample containing frequencies in several
discrete and mutually exclusive categories and there are non-parametric
tests for analyzing these types of data (Chapter 18).
17.2
The danger of assuming normality when a population
is grossly non-normal
Parametric tests have been speciﬁcally designed for analyzing data from
populations with distributions shaped like a bell that is symmetrical about
the mean with 66.26% of values occurring within μ ± 1 standard deviation
and 95% within μ ± 1.96 standard deviations (Chapter 7). This distribution
is used to determine the range within which 95% of the values of the sample
mean, X, will occur when samples of a particular size are taken from a
population. If X occurs outside the range of μ ± 1.96 SEM, the probability
the sample has come from that population is less than 5%. If the population
227