Tải bản đầy đủ - 0 (trang)
6 An example: school cancellations and snow

# 6 An example: school cancellations and snow

Tải bản đầy đủ - 0trang

16.6 An example: school cancellations and snow

217

16.5.2 Testing whether the intercept of the regression line

is signiﬁcantly diﬀerent to zero

The value for the intercept a calculated from a sample is only an estimate of

the population statistic α. Consequently, a positive or negative value of a

might be obtained in a sample from a population where α is zero. The

standard deviation of the points scattered around the regression line can be

used to calculate the 95% conﬁdence interval for a, and a single-sample t test

can be used to compare the value of a to zero or any other expected value.

Once again, most statistical packages include a test to determine if a diﬀers

signiﬁcantly from zero.

16.5.3 The coeﬃcient of determination r 2

The coeﬃcient of determination, symbolized by r2, is a statistic that shows



the proportion of the total variation of the values of Y from the average Y

that is explained by the regression line. It is the regression sum of squares

divided by the total sum of squares:

r2 ¼

Sum of squares explained by the regressionððaÞ aboveÞ

Total sum of squaresaị ỵ bị aboveị

(16:10)

which will only ever be a number from zero to 1.0. If the points all lie along the

regression line and it has a slope that is diﬀerent from zero, the unexplained

component (quantity (b)) will be zero and r2 will be 1. If the explained sum of

squares is small in relation to the unexplained, r2 will be a small number.

16.6

An example: school cancellations and snow

In places at high latitudes, heavy snowfalls are the dream of every young

student, because they bring the possibility of school closures (called “snow

days”), not to mention sledding, hot chocolate and additional sleep! A school

administrator looking for a way to predict the number of school closures on any

day in the city of St Paul, Minnesota hypothesized that it would be related to

the amount of snow that had fallen during the previous 24 hours. To test this,

they examined data from 10 snowfalls. These bivariate data for snowfall

(in cm) and the number of school closures on the following day are given in

Table 16.1.

218

Linear regression

Table 16.1 Data for 24-hour snowfall and the number

of school closure days for each of 10 snowfalls.

Snowfall (cm)

School closures

3

6

9

12

15

18

21

24

27

30

5

13

16

14

18

23

20

32

29

28

Table 16.2 An example of the table of results from a regression analysis. The value

of the intercept a (5.733) is given in the ﬁrst row, labeled “(Constant)” under the

heading “Value”. The slope b (0.853) is given in the second row (labeled as the

independent variable “Snowfall”) under the heading “Value.” The ﬁnal two columns

give the results of t tests comparing a and b to zero. These show the intercept, a, is

signiﬁcantly diﬀerent to zero (P = 0.035) and the slope b is also signiﬁcantly diﬀerent

to zero (P < 0.001). The signiﬁcant value of the intercept suggests that there may be

other reasons for school closures (e.g. ice storms, frozen pipes), or perhaps the

regression model is not very accurate.

Model

Value

Std error

t

Signiﬁcance

Constant

Snowfall

5.733

0.853

2.265

0.122

2.531

7.006

0.035

0.001

From a regression analysis of these data a statistical package will give

values for the equation for the regression line, plus a test of the hypotheses

that the intercept, a, and slope, b are from a population where α and β are

zero. The output will be similar in format to Table 16.2.

From the results in Table 16.2 the equation for the regression line is

school closures = 5.773 + 0.853 × snowfall. The slope is signiﬁcantly diﬀerent to zero (in this case it is positive) and the intercept is also signiﬁcantly

diﬀerent to zero. You could use the regression equation to predict the

number of school closures based on any snowfall between 3 and 30 cm.

16.8 Predicting a value of X from a value of Y

219

Table 16.3 An example of the results of an analysis of the slope of a regression. The

signiﬁcant F ratio shows the slope is signiﬁcantly diﬀerent to zero.

Regression

Residual

Total

Sum of squares

df

Mean square

F

Signiﬁcance

539.648

87.952

627.600

1

8

9

539.648

10.994

49.086

0.000

Most statistical packages will give an ANOVA of the slope. For the data in

Table 16.1 there is a signiﬁcant relationship between school closures and

snowfall (Table 16.3).

Finally, the value of r2 is also given. Sometimes there are two values: r2,

which is the statistic for the sample and a value called “Adjusted” r2, which is

an estimate for the population from which the sample has been taken. The r2

value is usually the one reported in the results of the regression. For the

example above you would get the following values:

r ¼ 0:927; r 2 ¼ 0:860; adjusted r 2 ¼ 0:842

This shows that 86% of the variation in school closures with snowfall can be

predicted by the regression line.

16.7

Predicting a value of Y from a value of X

Because the regression line has the average slope through a set of scattered

points, the predicted value of Y is only the average expected for a given value

of X. If the r2 value is 1.0, the value of Y will be predicted without error,

because all the data points will lie on the regression line. Usually, however,

the points will be scattered around the line. More advanced texts describe

how you can calculate the 95% conﬁdence interval for a value of Y and thus

predict its likely range.

16.8

Predicting a value of X from a value of Y

Often you might want to estimate a value of the independent variable X

from the dependent variable Y. Here is an example. Many elements absorb

energy of a very speciﬁc wavelength because the movement of electrons or

220

Linear regression

neutrons from one energy level to another within atoms is related to the

vibrational modes of crystal lattices. Therefore, the amount of energy

absorbed at that wavelength is dependent on the concentration of the

element present in a sample. Here it is tempting to designate the concentration of the element as the dependent variable and absorption at the

independent one and use regression in order to estimate the concentration

of the element present. This is inappropriate because concentration of an

element does not depend on the amount of energy absorbed or given oﬀ, so

one of the assumptions of regression would be violated.

Predicting X from Y can be done by rearranging the regression equation

for any point from:

Yi ẳ a ỵ bXi

(16:11)

to:

Xi ¼

Yi À a

b

(16:12)

but here too the 95% conﬁdence interval around the estimated value of X

must also be calculated because the measurement of Y is likely to include

some error. Methods for doing this are given in more advanced texts.

16.9

The danger of extrapolating beyond the range

of data available

Although regression analysis draws a line of best ﬁt through a set of data,

it is dangerous to make predictions beyond the measured range of X.

Figure 16.8 illustrates that a predicted regression line may not be a correct

estimation of the value of Y outside this range.

16.10 Assumptions of linear regression analysis

The procedure for linear regression analysis described in this chapter is

often described as a Model I regression, and makes several assumptions.

First, the values of Y are assumed to be from a population of values that

are normally and evenly distributed about the regression line, with no

gross heteroscedasticity. One easy way to check for this is to plot a graph

showing the residuals. For each data point its vertical displacement on the Y

16.10 Assumptions of linear regression analysis

221

(a)

Y

0

2

4

6

8

10

6

X

8

10

X

(b)

Y

0

2

4

Figure 16.8 It is risky to use a regression line to extrapolate values of Y

beyond the measured range of X. The regression line (a) based on the data for

values of X ranging from 1 to 5 does not necessarily give an accurate prediction

(b) of the values of Y beyond that range. A classic example of such behavior is

found in plots of the geothermal gradient of the Earth’s interior. At shallow

depths, there is generally a linear increase in temperature of ~20 K/km depth,

depending on location, but the rate increases as you go deeper into the mantle.

axis either above or below the ﬁtted regression line is the amount of residual

variation that cannot be explained by the regression line, as described in

Section 16.5.1. The residuals are calculated by subtraction (Table 16.4) and

plotted on the Y axis, against the values of X for each point and will always

give a plot where the regression line is re-expressed as horizontal line with

an intercept of zero.

If the original data points are uniformly scattered about the original

regression line, the scatter plot of the residuals will be evenly dispersed in

a band above and below zero (Figure 16.9). If there is heteroscedasticity the

band will vary in width as X increases or decreases. Most statistical packages

will give a plot of the residuals for a set of bivariate data.

222

Linear regression

Table 16.4 Original data and ﬁtted regression line of Y = 10.8 + 0.9X. The residual for

each point is its vertical displacement from the regression line. Each residual is

plotted on the Y axis against the original value of X for that point to give a graph

showing the spread of the points about a line of zero slope and intercept.

Original data

X

Y

Calculated value

^ from

of Y

regression equation

1

3

4

5

6

7

8

9

10

11

12

14

13

12

14

17

17

15

17

21

20

19

21

25

11.7

13.5

14.4

15.3

16.2

17.1

18.0

18.9

19.8

20.7

21.6

23.4

Data for the plot of residuals

Value of X (from

original data)

Value of Y

^

ðY À YÞ

1

3

4

5

6

7

8

9

10

11

12

14

1.3

− 1.5

− 0.4

1.7

0.8

− 2.1

− 1.0

2.1

0.2

− 1.7

− 0.6

1.6

Second, it is assumed the independent variable X is measured without

error. This is often diﬃcult and many texts note that X should be measured

with little error. For example, levels of an independent variable determined

by the experimenter, such as the relative % humidity, are usually

measured with very little error indeed. In contrast, variables such as the

depth of snowfall from a windy blizzard, or the in situ temperature of a

violently erupting magma, are likely to be measured with a great deal of

error. When the dependent variable is subject to error, a diﬀerent analysis

called Model II regression is appropriate. Again, this is described in more

Third, it is assumed that the dependent variable is determined by the

independent variable. This was discussed in Section 16.2.

Fourth, the relationship between X and Y is assumed to be linear and it is

important to be conﬁdent of this before carrying out the analysis. A scatter

plot of the data should be drawn to look for any obvious departures from

linearity. In some cases it may be possible to transform the Y variable

16.11 Multiple linear regression

(a)

(b)

25

223

3

Residual

2

Y

20

15

–2

–3

10

0

4

8

X

0

12

4

8

12

X

(d)

(c)

Residual

0

0

0

X

X

Figure 16.9 (a) Plot of original data in Table

À 16.4,Áwith ﬁtted regression line

^ against the value of X for

Y = 10.8 + 0.9X. (b) The plot of the residual Y À Y

each data point shows a relatively even scatter about the horizontal line.

(c) General form of residual plot for data that are homoscedastic. (d) Residual

plot showing one example of heteroscedasticity, where the variance of the

residuals decreases with X.

(see Chapter 13) to give a linear relationship and proceed with a regression

analysis on the transformed data.

16.11 Multiple linear regression

Multiple linear regression is a straightforward extension of simple linear

regression. The simple linear regression equation:

Yi ẳ a ỵ bXi

(16:13 copied from 16:1)

examines the relationship between the value of a variable Y and another

variable X. Often, however, the value of Y might depend upon more than

one variable. For example, the sediment yield of a river may be dependent

on its drainage area plus other factors such as topographic relief, precipitation and ﬂow rate.

224

Linear regression

Therefore, the regression equation could be extended:

Yi ẳ a ỵ b1 X1i ỵ b2 X2i

(16:14)

which is just Equation (16.13) plus a second independent variable with its

own coeﬃcient of b2 and values (X2i). You will notice that there is now a

double subscript after the two values of X, in order to specify the ﬁrst variable

(e.g. drainage area) as X1 and the second (e.g. topographic relief) as X2.

Equation (16.14) can be further extended to include additional variables

such as precipitation and ﬂow rate:

Yi ¼ a ỵ b1 X1i ỵ b2 X2i ỵ b3 X3i þ b4 X4i ::: etc

(16:15)

The mathematics of multiple linear regression is complex, but a statistical

package will give the overall signiﬁcance of the regression and, more

importantly, the value for the slope (and its signiﬁcance) for each of the

independent variables.

If the initial analysis shows that an independent variable has no signiﬁcant eﬀect on Y, the variable can be removed from the equation and the

analysis rerun. This process of reﬁning the model can be repeated until only

signiﬁcant independent variables remain, thereby giving the best possible

model for predicting Y. There are several procedures for reﬁning, but the

one most frequently recommended is to initially include all independent

variables, run the analysis and examine the results for the ones that do not

appear to aﬀect the value of Y (i.e. variables with non-signiﬁcant values

of b). The least signiﬁcant is removed and the analysis rerun. This process,

called backward elimination, is repeated until only signiﬁcant variables

remain.

16.12 Further topics in regression

This chapter is an introduction to linear regression analysis. More advanced

analyses include procedures for comparing the slopes and intercepts of two

or more regression lines. Non-linear regression models can be ﬁtted to data

where the relationship between X and Y is exponential, logarithmic or even

more complex. The understanding of simple linear regression developed

here is an essential introduction to these methods, which will be discussed

further in relation to sequence analysis (Chapter 21).

16.13 Questions

225

16.13 Questions

(1) An easy way to help understand regression is to work through a simple

contrived example. The set of data below will give a regression with

a slope of 0 and an intercept of 10, so the line will have the equation

Y = 10 + 0X:

X

Y

0

0

0

1

1

1

2

2

2

10

9

11

10

9

11

10

9

11

(a) Use a statistical package to run the regression. What is the value of r2 for

this relationship? Is the slope of the regression signiﬁcant? (b) Next, modify

the data to give an intercept of 20, but with a slope that is still zero. (c)

Finally, modify the data to give a negative slope that is signiﬁcant.

(2) The table below gives data for the weight of alluvial gold recovered from

diﬀerent volumes of stream gravel.

Volume of gravel

processed (m3)

Weight of gold

recovered (grams)

1

2

3

4

5

6

7

8

9

0.025

0.042

0.071

0.103

0.111

0.142

0.164

0.191

0.220

226

Linear regression

(a) Run a regression analysis, where the volume of gravel is the independent variable. What is the value of r2? Is the relationship signiﬁcant?

What is the equation for the relationship between the weight of gold

recovered and the volume of gravel processed? Does the intercept of the

regression line diﬀer signiﬁcantly from zero? Would you expect it to?

Why?

17 Non-parametric statistics

17.1

Introduction

Parametric tests are designed for analyzing data from normally distributed

populations. Although these tests are quite robust to departures from

normality, and major ones can often be reduced by transformation, there

are some cases where the population is so grossly non-normal that parametric testing is unwise. In these cases a powerful analysis can often still be

done by using a non-parametric test.

Non-parametric tests are not just alternatives to the parametric procedures for analyzing ratio, interval and ordinal data described in

Chapters 8 to 16. Often geoscientists obtain data that have been measured

on a nominal scale. For example, Table 3.2 gave data for the locations

of 594 tornadoes during the period from 1998–2007 in the southeastern

states of the US. This is a sample containing frequencies in several

discrete and mutually exclusive categories and there are non-parametric

tests for analyzing these types of data (Chapter 18).

17.2

The danger of assuming normality when a population

is grossly non-normal

Parametric tests have been speciﬁcally designed for analyzing data from

populations with distributions shaped like a bell that is symmetrical about

the mean with 66.26% of values occurring within μ ± 1 standard deviation

and 95% within μ ± 1.96 standard deviations (Chapter 7). This distribution

is used to determine the range within which 95% of the values of the sample

mean, X, will occur when samples of a particular size are taken from a

population. If X occurs outside the range of μ ± 1.96 SEM, the probability

the sample has come from that population is less than 5%. If the population

227 ### Tài liệu bạn tìm kiếm đã sẵn sàng tải về

6 An example: school cancellations and snow

Tải bản đầy đủ ngay(0 tr)

×