4: Inferences Based on the Estimated Regression Line (Optional)
Tải bản đầy đủ - 0trang
13.4
Inferences Based on the Estimated Regression Line (Optional)
647
varies in value with different samples is summarized by its sampling distribution.
Properties of the sampling distribution are used to obtain both a confidence interval
formula for a 1 bx* and a prediction interval formula for a particular y observation.
The width of the corresponding interval conveys information about the precision of
the estimate or prediction.
Properties of the Sampling Distribution of a 1 bx
for a Fixed x Value
Let x* denote a particular value of the independent variable x. When the four
basic assumptions of the simple linear regression model are satisfied, the sampling distribution of the statistic a 1 bx* has the following properties:
1. The mean value of a 1 bx* is a 1 bx*, so a 1 bx* is an unbiased statistic
for estimating the mean y value when x 5 x*.
2. The standard deviation of the statistic a 1 bx*, denoted by sa1bx*, is
given by
1x* 2 x 2 2
1
1
sa1bx* 5 s
Ån
Sxx
3. The distribution of a 1 bx* is normal.
As you can see from the formula for sa1bx* the standard deviation of a 1 bx* is
larger when 1x* 2 x 2 2 is large than when 1x* 2 x 2 2 is small; that is, a 1 bx* tends to
be a more precise estimate of a 1 bx* when x* is close to the center of the x values
at which observations were made than when x* is far from the center.
The standard deviation sa1bx* cannot be calculated from the sample data, because
the value of s is unknown. However, sa1bx* can be estimated by using se in place of
s. Using the mean and estimated standard deviation to standardize a 1 bx* gives a
variable with a t distribution.
The estimated standard deviation of the statistic a 1 bx*, denoted by sa1bx*, is
given by
1x* 2 x 2 2
1
sa1bx* 5 se
1
Ån
Sxx
When the four basic assumptions of the simple linear regression model are satisfied,
the probability distribution of the standardized variable
a 1 bx* 2 1a 1 bx*2
t5
sa1bx*
is the t distribution with df 5 n 2 2.
Inferences About the Mean y Value a 1 bx*
In previous chapters, standardized variables were manipulated algebraically to give
confidence intervals of the form
(point estimate) 6 (critical value)(estimated standard deviation)
A parallel argument leads to the following interval.
Copyright 2010 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s).
Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it.
648
Chapter 13
Simple Linear Regression and Correlation: Inferential Methods
Confidence Interval for a Mean y Value
When the basic assumptions of the simple linear regression model are met, a
confidence interval for a 1 bx*, the mean y value when x has value x*, is
a 1 bx* 6 1t critical value2 # sa1bx*
where the t critical value is based on df 5 n 2 2. Appendix Table 3 gives critical
values corresponding to the most frequently used confidence levels.
Because sa1bx* is larger the farther x* is from x, the confidence interval becomes
wider as x* moves away from the center of the data.
E X A M P L E 1 3 . 1 1 Shark Length and Jaw Width
Malcolm Schuyl/Alamy
Physical characteristics of sharks are of interest to surfers and scuba divers as well as to
marine researchers. The following data on x 5 length (in feet) and y 5 jaw width (in
inches) for 44 sharks were found in various articles appearing in the magazines Skin Diver
and Scuba News:
x
y
x
y
x
y
x
y
18.7
17.5
16.4
13.8
13.2
11.6
19.1
17.9
12.3
12.3
16.7
15.2
15.8
14.3
16.2
15.7
18.6
21.8
17.8
18.2
14.3
13.3
22.8
21.2
16.4
17.2
16.2
16.7
16.6
15.8
16.8
16.3
15.7
16.2
12.6
11.6
9.4
10.2
13.6
13.0
18.3
19.9
17.8
17.4
18.2
19.0
13.2
13.3
14.6
13.9
13.8
14.2
13.2
16.8
15.7
14.3
15.8
14.7
12.2
14.8
13.6
14.2
19.7
21.3
14.9
15.1
15.2
15.9
15.3
16.9
18.7
20.8
17.6
18.5
14.7
15.3
16.1
16.0
13.2
12.2
12.1
12.0
12.4
11.9
13.5
15.9
16.8
16.9
Because it is difficult to measure jaw width in living sharks, researchers would like to
determine whether it is possible to estimate jaw width from body length, which is
more easily measured. A scatterplot of the data (Figure 13.20) shows a linear pattern
and is consistent with use of the simple linear regression model.
Jaw width
20
FIGURE 13.20
A scatterplot for the data of
Example 13.11.
Step-by-Step technology
instructions available online
Data set available online
15
10
10
15
Length
20
From the accompanying Minitab output, it is easily verified that
a 5 .688 b 5 .96345 SSResid 5 79.49
r 2 5 .766
SSTo 5 339.02 se 5 1.376
Copyright 2010 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s).
Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it.
13.4
Inferences Based on the Estimated Regression Line (Optional)
649
Regression Analysis
The regression equation is
Jaw Width = 0.69 + 0.963 Length
Predictor
Constant
Length
S = 1.376
Coef
StDev
T
0.688
1.299
0.53
0.96345
0.08228
11.71
R-Sq = 76.6%
R-Sq(adj) = 76.0%
Analysis of Variance
Source
Regression
Residual Error
Total
DF
1
42
43
SS
259.53
79.49
339.02
MS
259.53
1.89
P
0.599
0.000
F
137.12
P
0.000
From the data, we can also compute Sxx 5 279.8718. Because r 2 5 .766, the
simple linear regression model explains 76.6% of the variability in jaw width. The
model utility test also confirms the usefulness of this model (P-value 5 .000).
Let’s use the data to compute a 90% confidence interval for the mean jaw width
for 15-foot-long sharks. The mean jaw width when length is 15 feet is a 1 b(15).
The point estimate is
a 1 b 1152 5 .688 1 .96345 1152 5 15.140 in.
Since
x5
685.80
gx
5
5 15.586
n
44
the estimated standard deviation of a 1 b(15) is
sa1b1152 5 se
115 2 x 2 2
1
1
Ån
Sxx
5 11.3762
115 2 15.5862 2
1
1
Å 44
279.8718
5 .213
The t critical value for df 5 42 is 1.68 (using the tabulated value for df 5 40 from
Appendix Table 3). We now have all the relevant quantities needed to compute a
90% confidence interval:
a 1 b 1152 6 1t critical value2 # sa1bx* 5 15.140 6 11.682 1.2132
5 15.140 6 .358
5 114.782, 15.4982
Based on these sample data, we can be 90% confident that the mean jaw width
for sharks of length 15 feet is between 14.782 and 15.498 inches. As with all confidence intervals, the 90% confidence level means that we have used a method that has
a 10% error rate to construct this interval estimate.
We have just considered estimation of the mean y value at a fixed x 5 x*. When
the basic assumptions of the simple linear regression model are met, the value of this
mean is a 1 bx*. The reason that our point estimate a 1 bx* is not exactly equal to
a 1 bx* is that the values of a and b are not known, so they have been estimated
from sample data. As a result, the estimate a 1 bx* is subject to sampling variability
and the extent to which the estimated line might differ from the population line is
reflected in the width of the confidence interval.
Copyright 2010 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s).
Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it.
650
Chapter 13 Simple Linear Regression and Correlation: Inferential Methods
Prediction Interval for a Single y
We now turn our attention to the problem of predicting a single y value at a particular
x 5 x* (rather than estimating the mean y value when x 5 x*). This problem is
equivalent to trying to predict the y value of an individual point in a scatterplot of
the population. If we use the estimated regression line to obtain a point prediction
a 1 bx*, this prediction will probably not be exactly equal to the true y value for two
reasons. First, as was the case when estimating a mean y value, the estimated line is
not going to be exactly equal to the population regression line. But, in the case of
predicting a single y value, there is an additional source of error: e, the deviation from
the line. Even if we knew the population line, individual points would not fall exactly
on the population line. This implies that there is more uncertainty associated with
predicting a single y value at a particular x* than with estimating the mean y value at
x*. This extra uncertainty is reflected in the width of the corresponding intervals.
An interval for a single y value, y*, is called a prediction interval (to distinguish
it from the confidence interval for a mean y value). The interpretation of a prediction
interval is similar to the interpretation of a confidence interval. A 95% prediction
interval for y* is constructed using a method for which 95% of all possible samples
would yield interval limits capturing y*; only 5% of all samples would give an interval
that did not include y*.
Manipulation of a standardized variable similar to the one from which a confidence interval was obtained gives the following prediction interval.
Prediction Interval for a Single y Value
When the four basic assumptions of the simple linear regression model are met,
a prediction interval for y*, a single y observation made when x 5 x*, has the
form
a 1 bx* 6 1t critical value2 #"s 2e 1 s 2a1bx*
The prediction interval and the confidence interval are centered at exactly the
same place, a 1 bx*. The addition of s 2e under the square-root symbol makes the
prediction interval wider—often substantially so—than the confidence interval.
E X A M P L E 1 3 . 1 2 Jaws II
In Example 13.11, we computed a 90% confidence interval for the mean jaw width
of sharks of length 15 feet. Suppose that we are interested in predicting the jaw width
of a single shark of length 15 feet. The required calculations for a 90% prediction
interval for y* are
a 1 b 1152 5 .688 1 .96245 1152 5 15.140
s 2e 5 11.3762 2 5 1.8934
2
s a1b1152 5 1.2132 2 5 .0454
The t critical value for df 5 42 and a 90% prediction level is 1.68 (using the tabled
value for df 5 40). Substitution into the prediction interval formula then gives
a 1 b 1152 6 1t critical value2 "s 2e 1 s 2a1b1152 5 15.140 6 11.682 "1.9388
5 15.140 6 2.339
5 112.801, 17.4792
Copyright 2010 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s).
Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it.
13.4
Inferences Based on the Estimated Regression Line (Optional)
651
We can be 90% confident that an individual shark of length 15 feet will have a jaw
width between 12.801 and 17.479 inches. Notice that, as expected, this 90% prediction interval is much wider than the 90% confidence interval when x* 5 15 from
Example 13.11.
Figure 13.21 gives Minitab output that includes a 95% confidence interval and a
95% prediction interval when x* 5 15 and when x* 5 20. The intervals for x* 5 20
are wider than the corresponding intervals for x* 5 15 because 20 is farther from x (the
center of the sample x values) than is 15. Each prediction interval is wider than the corresponding confidence interval. Figure 13.22 is a Minitab plot that shows the estimated
regression line as well as 90% confidence limits and prediction limits.
FIGURE 13.21
Minitab output for the data of
Example 13.12.
Regression Analysis
The regression equation is
Jaw Width 5 0.69 1 0.963 Length
Predictor
Constant
Length
S 5 1.376
Coef
0.688
0.96345
R-Sq 5 76.6%
StDev
T
1.299
0.53
0.08228
11.71
R-Sq(adj) 5 76.0%
P
0.599
0.000
DF
1
42
43
SS
259.53
79.49
399.02
F
137.12
StDev Fit
0.213
0.418
95.0% CI
( 14.710, 15.569)
( 19.113, 20.801)
Analysis of Variance
Source
Regression
Residual Error
Total
x* 5 15
x* 5 20
Predicted Values
Fit
15.140
19.957
MS
259.53
1.89
P
0.000
95.0% PI
( 12.330, 17.949)
( 17.055, 22.859)
Regression Plot
Y = 0.687864 + 0.963450X
R-Sq = 76.6%
25
Jaw width
20
FIGURE 13.22
15
Regression
90% Conﬁdence interval
90% Prediction interval
10
Minitab plot showing estimated
regression line and 90% confidence
and prediction limits for the data of
Example 13.12.
10
15
Length
20
E X E RC I S E S 1 3 . 3 3 - 1 3 .4 6
13.33 Explain the difference between a confidence in-
13.34 Suppose that a regression data set is given and
terval and a prediction interval. How can a prediction
level of 95% be interpreted?
you are asked to obtain a confidence interval. How
would you tell from the phrasing of the request whether
the interval is for b or for a 1 b*?
Bold exercises answered in back
Data set available online
Video Solution available
Copyright 2010 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s).
Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it.
652
Chapter 13
Simple Linear Regression and Correlation: Inferential Methods
13.35 In Exercise 13.17, we considered a regression of
y 5 oxygen consumption on x 5 time spent exercising.
Summary quantities given there yield
n 5 20
b 5 97.26
x 5 2.50
a 5 592.10
Sxx 5 25
se 5 16.486
Regression Analysis: Response versus Mean Blood Lead Level
The regression equation is
Response 5 20.00179 2 0.00210 Mean Blood Lead Level
a. Calculate sa1b12.02, the estimated standard deviation
of the statistic a 1 b(2.0).
b. Without any further calculation, what is sa1b(3.0) and
what reasoning did you use to obtain it?
c. Calculate the estimated standard deviation of the
statistic a 1 b(2.8).
d. For what value x* is the estimated standard deviation
x
of a 1 bx* smallest, and why?
13.36 Example 13.3 gave data on x 5 proportion who
judged candidate A as more competent and y 5 vote difference proportion. Calculate a 95% confidence interval
for the mean vote-difference proportion for congressional
races where 60% judge candidate A as more competent.
13.37 The data of Exercise 13.25, in which x 5 milk
Predictor
Constant
Mean Blood Lead Level
Coef
20.001790
20.0021007
SE Coef
0.008303
0.0005743
T
20.22
23.66
P
0.830
0.000
a. What is the equation of the estimated regression
line? y^ 5 20.001790 2 0.0021007x
b. For this dataset, n 5 100, x 5 11.5, se 5 0.032,
and Sxx 5 1764. Estimate the mean brain volume
change for people with a childhood blood lead level
of 20 mg/dL, using a 90% confidence interval.
c. Construct a 90% prediction interval for brain volume change for a person with a childhood blood
lead level of 20 mg/dL.
d. Explain the difference in interpretation of the intervals computed in Parts (b) and (c).
13.40 An experiment was carried out by geologists to
temperature and y 5 milk pH, yield
x 5 42.375
Sxx 5 7325.75
n 5 16
b 5 2.00730608
a 5 6.843345
se 5 .0356
a. Obtain a 95% confidence interval for a 1 b(40),
the mean milk pH when the milk temperature is
408C.
b. Calculate a 99% confidence interval for the mean
milk pH when the milk temperature is 358C.
c. Would you recommend using the data to calculate a
95% confidence interval for the mean pH when the
temperature is 908C? Why or why not?
13.38 Return to the regression of y 5 milk pH on x 5
milk temperature described in the previous exercise.
a. Obtain a 95% prediction interval for a single pH observation to be made when milk temperature 5 408C.
b. Calculate a 99% prediction interval for a single pH
observation when milk temperature 5 358C.
c. When the milk temperature is 608C, would a 99%
prediction interval be wider than the intervals of
Parts (a) and (b)? You should be able to answer without calculating the interval.
x.
13.39 A subset of data read from a graph that appeared
in the paper “Decreased Brain Volume in Adults with
Childhood Lead Exposure” (Public Library of Science
Medicine [May 27, 2008]: e112) was used to produce the
following Minitab output, where x 5 mean childhood
blood lead level (mg/dL) and y 5 brain volume change
Bold exercises answered in back
(percentage). (See Exercise 13.19 for a more complete
description of the study described in this paper)
Data set available online
see how the time necessary to drill a distance of 5 feet in
rock ( y, in minutes) depended on the depth at which the
drilling began (x, in feet, between 0 and 400). We show
part of the Minitab output obtained from fitting the
simple linear regression model (“Mining Information,”
American Statistician [1991]: 4–9).
The regression equation is
Time = 4.79 + 0.0144depth
Predictor
Coef
Stdev t-ratio
Constant
4.7896
0.6663
7.19
depth
0.014388 0.002847
5.05
s = 1.432
R-sq = 63.0%
R-sq(adj) = 60.5%
Analysis of Variance
Source
DF
SS
MS
Regression
1
52.378 52.378
Error
15
30.768 2.051
Total
16
83.146
p
0.000
0.000
F
25.54
p
0.000
a. What proportion of observed variation in time can
be explained by the simple linear regression model?
b. Does the simple linear regression model appear to be
useful?
c. Minitab reported that sa1b12002 5 .347. Calculate a
95% confidence interval for the mean time when
depth 5 200 feet.
d. A single observation on time is to be made when
drilling starts at a depth of 200 feet. Use a 95%
prediction interval to predict the resulting value of
time.
Video Solution available
Copyright 2010 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s).
Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it.
13.4
653
Inferences Based on the Estimated Regression Line (Optional)
e. Minitab gave (8.147, 10.065) as a 95% confidence
interval for mean time when depth 5 300. Calculate
a 99% confidence interval for this mean.
d. Calculate a 95% prediction interval for the maximum width of a food package with a minimum
width of 6 cm.
According to “Reproductive Biology of the
13.43
The shelf life of packaged food depends on
many factors. Dry cereal is considered to be a moisturesensitive product (no one likes soggy cereal!) with the shelf
life determined primarily by moisture content. In a study
of the shelf life of one particular brand of cereal, x 5 time
on shelf (days stored at 738F and 50% relative humidity)
and y 5 moisture content (%) were recorded. The resulting data are from “Computer Simulation Speeds Shelf Life
Assessments” (Package Engineering [1983]: 72–73).
13.41
Aquatic Salamander Amphiuma tridactylum in Louisiana” (Journal of Herpetology [1999]: 100–105), the size
of a female salamander’s snout is correlated with the
number of eggs in her clutch. The following data are
consistent with summary quantities reported in the article. Partial Minitab output is also included.
Snout-Vent Length
Clutch Size
32
45
53
215
53
160
53
170
54
190
Snout-Vent Length 57
Clutch Size
200
57
270
58
175
58
245
59
215
Snout-Vent Length 63
Clutch Size
170
63
240
64
245
67
280
The regression equation is
Y = –133 + 5.92x
Predictor
Coef
StDev
T
P
Constant
Ϫ133.02
64.30
2.07
0.061
x
5.919
1.127
5.25
0.000
s = 33.90
R-Sq = 69.7%
R-Sq(adj) = 67.2%
Additional summary statistics are
n 5 14
2
x
a 5 45,958
x 5 56.5
2
y
a 5 613,550
y 5 201.4
xy
a 5 164,969
a. What is the equation of the regression line for predicting clutch size based on snout-vent length?
b. What is the value of the estimated standard deviation of b?
c. Is there sufficient evidence to conclude that the slope
of the population line is positive?
d. Predict the clutch size for a salamander with a snoutvent length of 65 using a 95% interval.
e. Predict the clutch size for a salamander with a snoutvent length of 105 using a 90% interval.
13.42 The article first introduced in Exercise 13.29 of
Section 13.3 gave data on the dimensions of 27 representative food products.
a. Use the data set given there to test the hypothesis
that there is a positive linear relationship between
x 5 minimum width and y 5 maximum width of an
object.
b. Calculate and interpret se.
c. Calculate a 95% confidence interval for the mean
maximum width of products with a minimum width
of 6 cm.
Bold exercises answered in back
Data set available online
x
y
0
2.8
3
3.0
6
3.1
8
3.2
10
3.4
13
3.4
16
3.5
x
20
24
27
y
3.1
3.8
4.0
a. Summary quantities are
30
4.1
34
4.3
37
4.4
41
4.9
a x 5 269 a y 5 51 a xy 5 1081.5
2
2
a y 5 190.78 a x 5 7745
Find the equation of the estimated regression line for
predicting moisture content from time on the shelf.
b. Does the simple linear regression model provide useful information for predicting moisture content
from knowledge of shelf time?
c. Find a 95% interval for the moisture content of an
individual box of cereal that has been on the shelf
30 days.
d. According to the article, taste tests indicate that this
brand of cereal is unacceptably soggy when the moisture content exceeds 4.1. Based on your interval in
Part (c), do you think that a box of cereal that has been
on the shelf 30 days will be acceptable? Explain.
13.44 For the cereal data of the previous exercise, the
mean x value is 19.21. Would a 95% confidence interval
with x* 5 20 or x* 5 17 be wider? Explain. Answer the
same question for a prediction interval.
13.45 A regression of x 5 tannin concentration
(mg/L) and y 5 perceived astringency score was considered in Examples 5.2 and 5.6. The perceived astringency
was computed from expert tasters rating a wine on a scale
from 0 to 10 and then standardizing the rating by computing a z-score. Data for 32 red wines (given in Example 5.2) was used to compute the following summary
statistics and estimated regression line:
2
x 5 .6069
n 5 32
a 1x 2 x 2 5 1.479
y^ 5 21.59 1 2.59x
SSResid 5 1.936
Video Solution available
Copyright 2010 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s).
Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it.
654
Chapter 13
Simple Linear Regression and Correlation: Inferential Methods
a. Calculate a 95% confidence interval for the mean
astringency rating for red wines with a tannin concentration of .5 mg/L.
b. When two 95% confidence intervals are computed,
it can be shown that the simultaneous confidence
level is at least 3 100 2 2 152 4 % 5 90%. That is, if
both intervals are computed for a first sample, for a
second sample, for a third sample, and so on, in the
long run at least 90% of the samples will result in
intervals which both capture the values of the corresponding population characteristics. Calculate confidence intervals for the mean astringency rating when
the tannin concentration is .5 mg/L and when the
tannin concentration is .7 mg/L in such a way that
the simultaneous confidence level is at least 90%.
c. If two 99% confidence intervals were computed,
what do you think could be said about the simultaneous confidence level?
d. If a 95% confidence interval were computed for the
mean astringency rating when x 5 .5, another confidence interval was computed for x 5 .6, and yet
another one for x 5 .7, what do you think would be
the simultaneous confidence level for the three resulting intervals?
Bold exercises answered in back
13.5
Data set available online
13.46
The article “Performance Test Conducted for
a Gas Air-Conditioning System” (American Society of
Heating, Refrigerating, and Air Conditioning Engineering [1969]: 54) reported the following data on
maximum outdoor temperature (x) and hours of chiller
operation per day ( y) for a 3-ton residential gas air-conditioning system:
x
y
72
4.8
78
7.2
80
9.5
86
14.5
88
15.7
92
17.9
Suppose that the system is actually a prototype model,
and the manufacturer does not wish to produce this
model unless the data strongly indicate that when maximum outdoor temperature is 828F, the true average
number of hours of chiller operation is less than 12. The
appropriate hypotheses are then
H0: a 1 b(82) 5 12
versus
Ha: a 1 b(82) , 12
Use the statistic
t5
a 1 b 1822 2 12
sa1b1822
which has a t distribution based on (n 2 2) df when H0
is true, to test the hypotheses at significance level .01.
Video Solution available
Inferences About the Population Correlation
Coefficient (Optional)
The sample correlation coefficient r, defined in Chapter 5, measures how strongly the
x and y values in a sample of pairs are linearly related to one another. There is an
analogous measure of how strongly x and y are linearly related in the entire population
of pairs from which the sample (x1, y1), ... , (xn, yn) was obtained. It is called the population correlation coefficient and is denoted by r. As with r, r must be between 21
and 1, and it assesses the extent of any linear relationship in the population. To have
r 5 1 or r 5 21, all (x, y) pairs in the population must lie exactly on a straight line.
The value of r is a population characteristic and is generally unknown. The sample
correlation coefficient r can be used as the basis for making inferences about r.
Test for Independence ( 5 0)
Investigators are often interested in detecting not just linear association but also association of any kind. When there is no association of any type between the x and
y values, statisticians say that the two variables are independent. In general, r 5 0 is
not equivalent to the independence of x and y. However, there is one special—yet
frequently occurring—situation in which the two conditions (r 5 0 and independence) are identical. This is when the pairs in the population have what is called a
bivariate normal distribution. The essential feature of such a distribution is that for
Copyright 2010 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s).
Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it.
13.5 Inferences About the Population Correlation Coefficient (Optional)
655
any fixed x value, the distribution of associated y values is normal, and for any fixed y
value, the distribution of x values is normal.
As an example, suppose that height x and weight y have a bivariate normal distribution in the American adult male population. (There is good empirical evidence for this.)
Then, when x 5 68 inches, weight y has a normal distribution; when x 5 72 inches,
weight is normally distributed; when y 5 160 pounds, height x has a normal distribution; when y 5 175 pounds, height has a normal distribution; and so on. In this example,
of course, x and y are not independent, because large height values tend to be paired with
large weight values and small height values tend to be paired with small weight values.
There is no easy way to check the assumption of bivariate normality, especially when
the sample size n is small. A partial check can be based on the following property: If (x, y)
has a bivariate normal distribution, then x alone has a normal distribution and so does y.
This suggests constructing a normal probability plot of x1, x2, ... xn, and a separate normal
probability plot of y1, y2, ... , yn. If either plot shows a substantial departure from a straight
line, then bivariate normality is a questionable assumption. If both plots are reasonably
straight, then bivariate normality is plausible, although no guarantee can be given.
For a bivariate normal population, the test of independence (correlation 5 0) is
a t test. The formula for the test statistic involves standardizing the estimate r under
the assumption that the null hypothesis H0: r 5 0 is true.
A Test for Independence in a Bivariate Normal Population
H0: r 5 0
r
Test statistic: t 5
1 2 r2
Ån22
The test is based on df 5 n 2 2.
Null hypothesis:
Alternative hypothesis:
Ha: r . 0 (positive dependence)
Ha: r , 0 (negative dependence)
Ha: r ϶ 0 (dependence)
Assumptions:
P-Value:
Area under the appropriate t curve to the
right of the computed t
Area under the appropriate t curve to the
left of the computed t
(1) 2(area to the right of t) if t is positive
or
(2) 2(area to the left of t) if t is negative
r is the correlation coefficient for a random sample from a bivariate normal population.
EXAMPLE 13.13
Sleepless Nights
The relationship between sleep duration and the level of the hormone leptin (a
hormone related to energy intake and energy expenditure) in the blood was investigated in the paper “Short Sleep Duration is Associated with Reduced Leptin,
Elevated Ghrelin, and Increased Body Mass Index” (Public Library of Science
Medicine, [December 2004]: 210–217). Average nightly sleep (x, in hours) and
blood leptin level (y) were recorded for each person in a sample of 716 participants
in the Wisconsin Sleep Cohort Study. The sample correlation coefficient was
Copyright 2010 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s).
Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it.