6 Assessing the Utility of the Model: Making Inferences About the Slope β[sub(1)]
Tải bản đầy đủ
110 Chapter 3 Simple Linear Regression
Figure 3.8 Graphing the
model with
β1 = 0: y = β0 + ε
y
b0
x
0
The appropriate test statistic is found by considering the sampling distribution
of βˆ1 , the least squares estimator of the slope β1 .
Sampling Distribution of βˆ1
If we make the four assumptions about ε (see Section 3.4), then the sampling
distribution of βˆ1 , the least squares estimator of the slope, will be a normal
distribution with mean β1 (the true slope) and standard deviation
σ
σβˆ1 = √
(See Figure 3.9.)
SSxx
Figure 3.9 Sampling
distribution of βˆ1
^
b1
b1
2sb^1
2sb^1
Since σ will usually be unknown, the appropriate test statistic will generally be
a Student’s t statistic formed as follows:
βˆ1 − Hypothesized value of β1
t=
sβˆ1
ˆ
β1 − 0
= √
s/ SSxx
where sβˆ1 = √ s
SSxx
Note that
√ we have substituted the estimator s for σ , and then formed sβˆ1 by
dividing s by SSxx . The number of degrees of freedom associated with this t statistic
is the same as the number of degrees of freedom associated with s. Recall that this
will be (n − 2) df when the hypothesized model is a straight line (see Section 3.5).
Assessing the Utility of the Model: Making Inferences About the Slope β1
111
The test of the utility of the model is summarized in the next box.
Test of Model Utility: Simple Linear Regression
Test statistic: t = βˆ1 /sβˆ1 =
βˆ1
s/ SSxx
ONE-TAILED TESTS
TWO-TAILED TEST
H0 : β1 = 0
H0 : β1 = 0
H0 : β 1 = 0
Ha : β1 < 0
Ha : β1 > 0
Ha : β1 = 0
Rejection region:
t < −tα
t > tα
|t| > tα/2
p-value:
P(t < tc )
P(t > tc )
2P(t > tc ) if tc is positve
2P(t < tc ) if tc is negative
Decision: Reject H0 if α > p-value, or, if test statistic falls in rejection region
where P(t > tα ) = α, P(t > tα/2 ) = α/2, tc = calculated value of the test statistic,
the t-distribution is based on (n − 2) df and α = P(Type I error) = P(Reject
H0 | H0 true).
Assumptions: The four assumptions about ε listed in Section 3.4.
For the advertising–sales example, we will choose α = .05 and, since n = 5, df =
(n − 2) = 5 − 2 = 3. Then the rejection region for the two-tailed test is
|t| > t.025 = 3.182
We previously calculated βˆ1 = .7, s = .61, and SSxx = 10. Thus,
.7
.7
βˆ1
=
√ =
t= √
= 3.7
.19
s/ SSxx
.61/ 10
Since this calculated t-value falls in the upper-tail rejection region (see Figure 3.10),
we reject the null hypothesis and conclude that the slope β1 is not 0. The sample
evidence indicates that advertising expenditure x contributes information for the
prediction of sales revenue y using a linear model.
Figure 3.10 Rejection
region and calculated
t-value for testing whether
the slope β1 = 0
a / 2 = .025
− 3.182
a / 2 = .025
t
0
3.182
t = 3.7
Rejection
region
Rejection
region
112 Chapter 3 Simple Linear Regression
We can reach the same conclusion by using the observed signiﬁcance level
(p-value) of the test obtained from a computer printout. The SAS printout for
the advertising–sales example is reproduced in Figure 3.11. The test statistic and
two-tailed p-value are highlighted on the printout. Since p-value = .0354 is smaller
than α = .05, we will reject H0 .
Figure 3.11 SAS printout for advertising–sales regression
What conclusion can be drawn if the calculated t-value does not fall in the
rejection region? We know from previous discussions of the philosophy of hypothesis
testing that such a t-value does not lead us to accept the null hypothesis. That is,
we do not conclude that β1 = 0. Additional data might indicate that β1 differs from
0, or a more complex relationship may exist between x and y, requiring the ﬁtting
of a model other than the straight-line model. We discuss several such models in
Chapter 4.
Another way to make inferences about the slope β1 is to estimate it using a
conﬁdence interval. This interval is formed as shown in the next box.
A 100(1 − α)% Conﬁdence Interval for the Simple Linear Regression
Slope β 1
βˆ1 ± (tα/2 )sβˆ1
s
where sβˆ1 = √
SSxx
and tα/2 is based on (n − 2) df
For the advertising–sales example, a 95% conﬁdence interval for the slope β1 is
βˆ1 ± (t.025 )sβˆ1 = .7 ± (3.182) √
s
SSxx
.61
= .7 ± .61 = (.09, 1.31)
= .7 ± (3.182) √
10
This 95% conﬁdence interval for the slope parameter β1 is also shown (highlighted) at the bottom of the SAS printout, Figure 3.11.
Assessing the Utility of the Model: Making Inferences About the Slope β1
113
Remembering that y is recorded in units of $1,000 and x in units of $100, we
can say, with 95% conﬁdence, that the mean monthly sales revenue will increase
between $90 and $1,310 for every $100 increase in monthly advertising expenditure.
Since all the values in this interval are positive, it appears that β1 is positive and
that the mean of y, E(y), increases as x increases. However, the rather large width of
the conﬁdence interval reﬂects the small number of data points (and, consequently,
a lack of information) in the experiment. We would expect a narrower interval if
the sample size were increased.
3.6 Exercises
3.23 Learning the mechanics. Do the data provide sufﬁcient evidence to indicate that β1 differs from
0 for the least squares analyses in the following
exercises? Use α = .05.
(a) Exercise 3.6 (b) Exercise 3.7
3.24 Predicting home sales price. Refer to the data on
sale prices and total appraised values of 76 residential properties in an upscale Tampa, Florida,
neighborhood, Exercise 3.8 (p. 100). An SPSS simple linear regression printout for the analysis is
reproduced at the bottom of the page.
(a) Use the printout to determine whether there
is a positive linear relationship between
appraised property value x and sale price y
for residential properties sold in this neighborhood. That is, determine if there is sufﬁcient
evidence (at α = .01) to indicate that β1 , the
slope of the straight-line model, is positive.
(b) Find a 95% conﬁdence interval for the slope,
β1 , on the printout. Interpret the result practically.
(c) What can be done to obtain a narrower conﬁdence interval in part b?
3.25 Sweetness of orange juice. Refer to Exercise 3.13
(p. 102) and the simple linear regression relating
the sweetness index (y) of an orange juice sample
to the amount of water-soluble pectin (x) in the
juice. Find a 90% conﬁdence interval for the true
slope of the line. Interpret the result.
3.26 English as a second language reading ability.
What are the factors that allow a native Spanish-
SPSS Output for Exercise 3.24
speaking person to understand and read English?
A study published in the Bilingual Research Journal (Summer 2006) investigated the relationship
of Spanish (ﬁrst language) grammatical knowledge to English (second language) reading. The
study involved a sample of n = 55 native Spanishspeaking adults who were students in an English as
a second language (ESL) college class. Each student took four standardized exams: Spanish grammar (SG), Spanish reading (SR), English grammar
(EG), and English reading (ESLR). Simple linear regressions were used to model the ESLR
score (y) as a function of each of the other exam
scores (x). The results are summarized in the
table.
INDEPENDENT
VARIABLE(x)
SG score
SR score
ER score
p-VALUE FOR
TESTING H0 : β1 = 0
.739
.012
.022
(a) At α = .05, is there sufﬁcient evidence to indicate that ESLR score is linearly related to SG
score?
(b) At α = .05, is there sufﬁcient evidence to indicate that ESLR score is linearly related to SR
score?
(c) At α = .05, is there sufﬁcient evidence to indicate that ESLR score is linearly related to ER
score?
114 Chapter 3 Simple Linear Regression
3.27 Reaction to a visual stimulus. How do eye and
head movements relate to body movements when
reacting to a visual stimulus? Scientists at the
California Institute of Technology designed an
experiment to answer this question and reported
their results in Nature (August 1998). Adult male
rhesus monkeys were exposed to a visual stimulus
(i.e., a panel of light-emitting diodes) and their eye,
head, and body movements were electronically
recorded. In one variation of the experiment, two
variables were measured: active head movement
(x, percent per degree) and body plus head rotation (y, percent per degree). The data for n = 39
trials were subjected to a simple linear regression analysis, with the following results: βˆ1 = .88,
sβˆ1 = .14
(a) Conduct a test to determine whether the two
variables, active head movement x and body
plus head rotation y, are positively linearly
related. Use α = .05.
(b) Construct and interpret a 90% conﬁdence
interval for β1 .
(c) The scientists want to know if the true slope
of the line differs signiﬁcantly from 1. Based
on your answer to part b, make the appropriate inference.
3.28 Massage therapy for boxers. The British Journal of Sports Medicine (April 2000) published a
BOXING2
BLOOD LACTATE LEVEL
PERCEIVED RECOVERY
3.8
4.2
4.8
4.1
5.0
5.3
4.2
2.4
3.7
5.3
5.8
6.0
5.9
6.3
5.5
6.5
7
7
11
12
12
12
13
17
17
17
18
18
21
21
20
24
Source: Hemmings, B., Smith, M., Graydon, J., and
Dyson, R. ‘‘Effects of massage on physiological restoration, perceived recovery, and repeated sports performance,’’ British Journal of Sports Medicine, Vol. 34, No. 2,
Apr. 2000 (data adapted from Figure 3).
study of the effect of massage on boxing performance. Two variables measured on the boxers
were blood lactate concentration (mM) and the
boxer’s perceived recovery (28-point scale). Based
on information provided in the article, the data in
the table were obtained for 16 ﬁve-round boxing
performances, where a massage was given to the
boxer between rounds. Conduct a test to determine whether blood lactate level (y) is linearly
related to perceived recovery (x). Use α = .10.
NAMEGAME2
3.29 Recalling names of students. Refer to the Journal of Experimental Psychology—Applied (June
2000) name retrieval study, Exercise 3.15 (p. 103).
Recall that the goal of the study was to investigate the linear trend between proportion of names
recalled (y) and position (order) of the student
(x) during the ‘‘name game.’’ Is there sufﬁcient
evidence (at α = .01) of a linear trend? Answer
the question by analyzing the data for 144 students
saved in the NAMEGAME2 ﬁle.
LIQUIDSPILL
3.30 Spreading rate of spilled liquid. Refer to the
Chemicial Engineering Progress (January 2005)
study of the rate at which a spilled volatile liquid
(methanol) will spread across a surface, Exercise
3.16 (p. 104). Consider a straight-line model relating mass of the spill (y) to elapsed time of the
spill (x). Recall that the data are saved in the
LIQUIDSPILL ﬁle.
(a) Is there sufﬁcient evidence (at α = .05) to indicate that the spill mass (y) tends to diminish
linearly as time (x) increases?
(b) Give an interval estimate (with 95% conﬁdence) of the decrease in spill mass for each
minute of elapsed time.
3.31 Pain empathy and brain activity. Empathy refers
to being able to understand and vicariously feel
what others actually feel. Neuroscientists at University College London investigated the relationship between brain activity and pain-related
empathy in persons who watch others in pain
(Science, February 20, 2004). Sixteen couples participated in the experiment. The female partner
watched while painful stimulation was applied to
the ﬁnger of her male partner. Two variables were
measured for each female: y = pain-related brain
activity (measured on a scale ranging from −2 to
2) and x = score on the Empathic Concern Scale
(0–25 points). The data are listed in the next
table (p. 115). The research question of interest
Assessing the Utility of the Model: Making Inferences About the Slope β1
was: ‘‘Do people scoring higher in empathy show
higher pain-related brain activity?’’ Use simple
linear regression analysis to answer the research
question.
(a) The model was ﬁt to the data using MINITAB,
with the results shown in the accompanying
printout. Locate the estimates of the model
parameters on the printout.
(b) Is there sufﬁcient evidence (at α = .01) of
a positive linear relationship between elevation (x) and slugging percentage (y)? Use the
p-value shown on the printout to make the
inference.
(c) Construct a scatterplot for the data and draw
the least squares line on the graph. Locate the
data point for Denver on the graph. What do
you observe?
(d) Remove the data point for Denver from the
data set and reﬁt the straight-line model to the
remaining data. Repeat parts a and b. What
conclusions can you draw about the ‘‘thin air’’
theory from this analysis?
BRAINPAIN
COUPLE
BRAIN ACTIVITY (y)
EMPATHIC CONCERN (x)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
.05
−.03
.12
.20
.35
0
.26
.50
.20
.21
.45
.30
.20
.22
.76
.35
12
13
14
16
16
17
17
18
18
18
19
20
21
22
23
24
Source: Singer, T. et al. ‘‘Empathy for pain involves the
affective but not sensory components of pain,’’ Science,
Vol. 303, Feb. 20, 2004 (data adapted from Figure 4).
HEAT
3.32 Thermal characteristics of ﬁn-tubes. Refer to the
Journal of Heat Transfer study of the straight-line
relationship between heat transfer enhancement
(y) and unﬂooded area ratio (x), Exercise 3.22
(p. 109). Construct a 95% conﬁdence interval for
β1 , the slope of the line. Interpret the result.
3.33 Does elevation impact hitting performance in baseball? The Colorado Rockies play their major
league home baseball games in Coors Field, Denver. Each year, the Rockies are among the leaders
in team batting statistics (e.g., home runs, batting
average, and slugging percentage). Many baseball
experts attribute this phenomenon to the ‘‘thin
air’’ of Denver—called the ‘‘mile-high’’ city due
to its elevation. Chance (Winter 2006) investigated
the effects of elevation on slugging percentage in
Major League Baseball. Data were compiled on
players’ composite slugging percentage at each of
29 cities for the 2003 season, as well as each city’s
elevation (feet above sea level). The data are saved
in the MLBPARKS ﬁle. (Selected observations
are shown in the table). Consider a straightline model relating slugging percentage (y) to
elevation (x).
115
MLBPARKS (Selected observations)
CITY
Anaheim
Arlington
Atlanta
Baltimore
Boston
Denver
Seattle
San Francisco
St. Louis
Tampa
Toronto
SLUG PCT.
ELEVATION
.480
.605
.530
.505
.505
.625
.550
.510
.570
.500
.535
160
616
1050
130
20
5277
350
63
465
10
566
Source: Schaffer, J. & Heiny, E.L. ‘‘The effects of elevation on slugging percentage in Major League Baseball,’’
Chance, Vol. 19, No. 1, Winter 2006 (adapted from
Figure 2).
116 Chapter 3 Simple Linear Regression
3.7 The Coefﬁcient of Correlation
The claim is often made that the crime rate and the unemployment rate are ‘‘highly
correlated.’’ Another popular belief is that IQ and academic performance are
‘‘correlated.’’ Some people even believe that the Dow Jones Industrial Average and
the lengths of fashionable skirts are ‘‘correlated.’’ Thus, the term correlation implies
a relationship or ‘‘association’’ between two variables.
The Pearson product moment correlation coefﬁcient r, deﬁned in the box,
provides a quantitative measure of the strength of the linear relationship between
x and y, just as does the least squares slope βˆ1 . However, unlike the slope, the
correlation coefﬁcient r is scaleless. The value of r is always between −1 and +1,
regardless of the units of measurement used for the variables x and y.
Deﬁnition 3.3 The Pearson product moment coefﬁcient of correlation r is a
measure of the strength of the linear relationship between two variables x and
y. It is computed (for a sample of n measurements on x and y) as follows:
SSxy
r=
SSxx SSyy
Note that r is computed using the same quantities used in ﬁtting the least
squares line. Since both r and βˆ1 provide information about the utility of the model,
it is not surprising that there is a similarity in their computational formulas. In
particular, note that SSxy appears in the numerators of both expressions and, since
both denominators are always positive, r and βˆ1 will always be of the same sign
(either both positive or both negative).
A value of r near or equal to 0 implies little or no linear relationship between
y and x. In contrast, the closer r is to 1 or −1, the stronger the linear relationship
between y and x. And, if r = 1 or r = −1, all the points fall exactly on the least
squares line. Positive values of r imply that y increases as x increases; negative
values imply that y decreases as x increases. Each of these situations is portrayed in
Figure 3.12.
We demonstrate how to calculate the coefﬁcient of correlation r using the data
in Table 3.1 for the advertising–sales example. The quantities needed to calculate
r are SSxy , SSxx , and SSyy . The ﬁrst two quantities have been calculated previously
and are repeated here for convenience:
2
SSxy = 7, SSxx = 10, SSyy =
y2 −
= 26 −
(10)2
5
y
n
= 26 − 20 = 6
We now ﬁnd the coefﬁcient of correlation:
r=
SSxy
SSxx SSyy
=√
7
7
= √ = .904
(10)(6)
60
The fact that r is positive and near 1 in value indicates that monthly sales
revenue y tends to increase as advertising expenditures x increases—for this sample
The Coefﬁcient of Correlation
117
Figure 3.12 Values of r
and their implications
linear
linear
of ﬁve months. This is the same conclusion we reached when we found the calculated
value of the least squares slope to be positive.
Example
3.1
Legalized gambling is available on several riverboat casinos operated by a city
in Mississippi. The mayor of the city wants to know the correlation between the
number of casino employees and yearly crime rate. The records for the past 10 years
are examined, and the results listed in Table 3.5 are obtained. Find and interpret the
coefﬁcient of correlation r for the data.
Solution
Rather than use the computing formula given in Deﬁnition 3.3, we resort to a
statistical software package. The data of Table 3.5 were entered into a computer
118 Chapter 3 Simple Linear Regression
CASINO
Table 3.5 Data on casino employees and crime rate, Example 3.1
Year
Number of Casino
Employees x (thousands)
Crime Rate y (number of crimes
per 1,000 population)
2000
15
1.35
2001
18
1.63
2002
24
2.33
2003
22
2.41
2004
25
2.63
2005
29
2.93
2006
30
3.41
2007
32
3.26
2008
35
3.63
2009
38
4.15
Figure 3.13 MINITAB
correlation printout for
Example 3.1
and MINITAB was used to compute r. The MINITAB printout is shown in
Figure 3.13.
The coefﬁcient of correlation, highlighted on the printout, is r = .987. Thus,
the size of the casino workforce and crime rate in this city are very highly correlated—at least over the past 10 years. The implication is that a strong positive
linear relationship exists between these variables (see Figure 3.14). We must be
careful, however, not to jump to any unwarranted conclusions. For instance, the
Figure 3.14 MINITAB
scatterplot for Example 3.1
The Coefﬁcient of Correlation
119
mayor may be tempted to conclude that hiring more casino workers next year will
increase the crime rate—that is, that there is a causal relationship between the two
variables. However, high correlation does not imply causality. The fact is, many
things have probably contributed both to the increase in the casino workforce and
to the increase in crime rate. The city’s tourist trade has undoubtedly grown since
legalizing riverboat casinos and it is likely that the casinos have expanded both in
services offered and in number. We cannot infer a causal relationship on the basis
of high sample correlation. When a high correlation is observed in the sample data,
the only safe conclusion is that a linear trend may exist between x and y. Another
variable, such as the increase in tourism, may be the underlying cause of the high
correlation between x and y.
Warning
High correlation does not imply causality. If a large positive or negative value of
the sample correlation coefﬁcient r is observed, it is incorrect to conclude that a
change in x causes a change in y. The only valid conclusion is that a linear trend
may exist between x and y.
Keep in mind that the correlation coefﬁcient r measures the correlation between
x-values and y-values in the sample, and that a similar linear coefﬁcient of correlation
exists for the population from which the data points were selected. The population
correlation coefﬁcient is denoted by the symbol ρ (rho). As you might expect, ρ is
estimated by the corresponding sample statistic, r. Or, rather than estimating ρ, we
might want to test
H0 : ρ = 0
against
Ha : ρ = 0
That is, we might want to test the hypothesis that x contributes no information for
the prediction of y, using the straight-line model against the alternative that the
two variables are at least linearly related. However, we have already performed this
identical test in Section 3.6 when we tested H0 : β1 = 0 against Ha : β1 = 0.
It can be shown (proof omitted) that r = βˆ1 SSxx /SSyy . Thus, βˆ1 = 0 implies
r = 0, and vice versa. Consequently, the null hypothesis H0 :ρ = 0 is equivalent to the
hypothesis H0 : β1 = 0. When we tested the null hypothesis H0 : β1 = 0 in connection
with the previous example, the data led to a rejection of the null hypothesis for
α = .05. This implies that the null hypothesis of a zero linear correlation between the
two variables, crime rate and number of employees, can also be rejected at α = .05.
The only real difference between the least squares slope βˆ1 and the coefﬁcient of
correlation r is the measurement scale.∗ Therefore, the information they provide
about the utility of the least squares model is to some extent redundant. Furthermore,
the slope βˆ1 gives us additional information on the amount of increase (or decrease)
in y for every 1-unit increase in x. For this reason, we recommend using the slope
to make inferences about the existence of a positive or negative linear relationship
between two variables.
For those who prefer to test for a linear relationship between two variables using the coefﬁcient of correlation r, we outline the procedure in the
following box.
∗ The estimated slope, βˆ , is measured in the same units as y. However, the correlation coefﬁcient r is
1
independent of scale.
120 Chapter 3 Simple Linear Regression
Test of Hypothesis for Linear Correlation
√
√
Test statistic: t = r n−2/ 1 − r 2
ONE-TAILED TESTS
TWO-TAILED TEST
H0 : ρ = 0
H0 : ρ = 0
H0 : ρ = 0
Ha : ρ < 0
Ha : ρ > 0
Ha : ρ = 0
Rejection region:
t < −tα
t > tα
|t| > tα/2
p-value:
P(t < tc )
P(t > tc )
2P(t > tc ) if tc is positve
2P(t < tc ) if tc is negative
Decision: Reject H0 if α > p-value or, if test statistic falls in rejection region
where P(t > tα ) = α, P(t > tα/2 ) = α/2, tc = calculated value of the test statistic,
the t-distribution is based on (n − 2) df and α = P(Type I error) = P(Reject
H0 | H0 true).
Assumptions: The sample of (x, y) values is randomly selected from a normal
population.
The next example illustrates how the correlation coefﬁcient r may be a misleading measure of the strength of the association between x and y in situations where
the true relationship is nonlinear.
Example
3.2
Underinﬂated or overinﬂated tires can increase tire wear and decrease gas mileage.
A manufacturer of a new tire tested the tire for wear at different pressures, with
the results shown in Table 3.6. Calculate the coefﬁcient of correlation r for the data.
Interpret the result.
TIRES
Table 3.6 Data for Example 3.2
Pressure
Mileage
x, pounds per sq. inch
y, thousands
30
30
31
31
32
32
33
33
34
34
35
35
36
36
29.5
30.2
32.1
34.5
36.3
35.0
38.2
37.6
37.7
36.1
33.6
34.2
26.8
27.4