Tải bản đầy đủ - 0 (trang)
4: Inferences Based on the Estimated Regression Line (Optional)

4: Inferences Based on the Estimated Regression Line (Optional)

Tải bản đầy đủ - 0trang

13.4



Inferences Based on the Estimated Regression Line (Optional)



647



varies in value with different samples is summarized by its sampling distribution.

Properties of the sampling distribution are used to obtain both a confidence interval

formula for a 1 bx* and a prediction interval formula for a particular y observation.

The width of the corresponding interval conveys information about the precision of

the estimate or prediction.



Properties of the Sampling Distribution of a 1 bx

for a Fixed x Value

Let x* denote a particular value of the independent variable x. When the four

basic assumptions of the simple linear regression model are satisfied, the sampling distribution of the statistic a 1 bx* has the following properties:

1. The mean value of a 1 bx* is a 1 bx*, so a 1 bx* is an unbiased statistic

for estimating the mean y value when x 5 x*.

2. The standard deviation of the statistic a 1 bx*, denoted by sa1bx*, is

given by

1x* 2 x 2 2

1

1

sa1bx* 5 s

Ån

Sxx

3. The distribution of a 1 bx* is normal.

As you can see from the formula for sa1bx* the standard deviation of a 1 bx* is

larger when 1x* 2 x 2 2 is large than when 1x* 2 x 2 2 is small; that is, a 1 bx* tends to

be a more precise estimate of a 1 bx* when x* is close to the center of the x values

at which observations were made than when x* is far from the center.

The standard deviation sa1bx* cannot be calculated from the sample data, because

the value of s is unknown. However, sa1bx* can be estimated by using se in place of

s. Using the mean and estimated standard deviation to standardize a 1 bx* gives a

variable with a t distribution.



The estimated standard deviation of the statistic a 1 bx*, denoted by sa1bx*, is

given by

1x* 2 x 2 2

1

sa1bx* 5 se

1

Ån

Sxx

When the four basic assumptions of the simple linear regression model are satisfied,

the probability distribution of the standardized variable

a 1 bx* 2 1a 1 bx*2

t5

sa1bx*

is the t distribution with df 5 n 2 2.



Inferences About the Mean y Value a 1 bx*

In previous chapters, standardized variables were manipulated algebraically to give

confidence intervals of the form

(point estimate) 6 (critical value)(estimated standard deviation)

A parallel argument leads to the following interval.

Copyright 2010 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s).

Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it.



648



Chapter 13



Simple Linear Regression and Correlation: Inferential Methods



Confidence Interval for a Mean y Value

When the basic assumptions of the simple linear regression model are met, a

confidence interval for a 1 bx*, the mean y value when x has value x*, is

a 1 bx* 6 1t critical value2 # sa1bx*

where the t critical value is based on df 5 n 2 2. Appendix Table 3 gives critical

values corresponding to the most frequently used confidence levels.



Because sa1bx* is larger the farther x* is from x, the confidence interval becomes

wider as x* moves away from the center of the data.



E X A M P L E 1 3 . 1 1 Shark Length and Jaw Width



Malcolm Schuyl/Alamy



Physical characteristics of sharks are of interest to surfers and scuba divers as well as to

marine researchers. The following data on x 5 length (in feet) and y 5 jaw width (in

inches) for 44 sharks were found in various articles appearing in the magazines Skin Diver

and Scuba News:



x

y

x

y

x

y

x

y



18.7

17.5

16.4

13.8

13.2

11.6

19.1

17.9



12.3

12.3

16.7

15.2

15.8

14.3

16.2

15.7



18.6

21.8

17.8

18.2

14.3

13.3

22.8

21.2



16.4

17.2

16.2

16.7

16.6

15.8

16.8

16.3



15.7

16.2

12.6

11.6

9.4

10.2

13.6

13.0



18.3

19.9

17.8

17.4

18.2

19.0

13.2

13.3



14.6

13.9

13.8

14.2

13.2

16.8

15.7

14.3



15.8

14.7

12.2

14.8

13.6

14.2

19.7

21.3



14.9

15.1

15.2

15.9

15.3

16.9

18.7

20.8



17.6

18.5

14.7

15.3

16.1

16.0

13.2

12.2



12.1

12.0

12.4

11.9

13.5

15.9

16.8

16.9



Because it is difficult to measure jaw width in living sharks, researchers would like to

determine whether it is possible to estimate jaw width from body length, which is

more easily measured. A scatterplot of the data (Figure 13.20) shows a linear pattern

and is consistent with use of the simple linear regression model.



Jaw width



20



FIGURE 13.20

A scatterplot for the data of

Example 13.11.



Step-by-Step technology

instructions available online

Data set available online



15



10

10



15

Length



20



From the accompanying Minitab output, it is easily verified that

a 5 .688   b 5 .96345  SSResid 5 79.49

r 2 5 .766

SSTo 5 339.02  se 5 1.376  



Copyright 2010 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s).

Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it.



13.4



Inferences Based on the Estimated Regression Line (Optional)



649



Regression Analysis

The regression equation is

Jaw Width = 0.69 + 0.963 Length

Predictor

Constant

Length

S = 1.376



Coef

StDev

T

0.688

1.299

0.53

0.96345

0.08228

11.71

R-Sq = 76.6%

R-Sq(adj) = 76.0%



Analysis of Variance

Source

Regression

Residual Error

Total



DF

1

42

43



SS

259.53

79.49

339.02



MS

259.53

1.89



P

0.599

0.000



F

137.12



P

0.000



From the data, we can also compute Sxx 5 279.8718. Because r 2 5 .766, the

simple linear regression model explains 76.6% of the variability in jaw width. The

model utility test also confirms the usefulness of this model (P-value 5 .000).

Let’s use the data to compute a 90% confidence interval for the mean jaw width

for 15-foot-long sharks. The mean jaw width when length is 15 feet is a 1 b(15).

The point estimate is

a 1 b 1152 5 .688 1 .96345 1152 5 15.140 in.

Since

x5



685.80

gx

5

5 15.586

n

44



the estimated standard deviation of a 1 b(15) is

sa1b1152 5 se



115 2 x 2 2

1

1

Ån

Sxx



5 11.3762



115 2 15.5862 2

1

1

Å 44

279.8718



5 .213

The t critical value for df 5 42 is 1.68 (using the tabulated value for df 5 40 from

Appendix Table 3). We now have all the relevant quantities needed to compute a

90% confidence interval:

a 1 b 1152 6 1t critical value2 # sa1bx* 5 15.140 6 11.682 1.2132

5 15.140 6 .358

5 114.782, 15.4982

Based on these sample data, we can be 90% confident that the mean jaw width

for sharks of length 15 feet is between 14.782 and 15.498 inches. As with all confidence intervals, the 90% confidence level means that we have used a method that has

a 10% error rate to construct this interval estimate.

We have just considered estimation of the mean y value at a fixed x 5 x*. When

the basic assumptions of the simple linear regression model are met, the value of this

mean is a 1 bx*. The reason that our point estimate a 1 bx* is not exactly equal to

a 1 bx* is that the values of a and b are not known, so they have been estimated

from sample data. As a result, the estimate a 1 bx* is subject to sampling variability

and the extent to which the estimated line might differ from the population line is

reflected in the width of the confidence interval.

Copyright 2010 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s).

Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it.



650



Chapter 13 Simple Linear Regression and Correlation: Inferential Methods



Prediction Interval for a Single y

We now turn our attention to the problem of predicting a single y value at a particular

x 5 x* (rather than estimating the mean y value when x 5 x*). This problem is

equivalent to trying to predict the y value of an individual point in a scatterplot of

the population. If we use the estimated regression line to obtain a point prediction

a 1 bx*, this prediction will probably not be exactly equal to the true y value for two

reasons. First, as was the case when estimating a mean y value, the estimated line is

not going to be exactly equal to the population regression line. But, in the case of

predicting a single y value, there is an additional source of error: e, the deviation from

the line. Even if we knew the population line, individual points would not fall exactly

on the population line. This implies that there is more uncertainty associated with

predicting a single y value at a particular x* than with estimating the mean y value at

x*. This extra uncertainty is reflected in the width of the corresponding intervals.

An interval for a single y value, y*, is called a prediction interval (to distinguish

it from the confidence interval for a mean y value). The interpretation of a prediction

interval is similar to the interpretation of a confidence interval. A 95% prediction

interval for y* is constructed using a method for which 95% of all possible samples

would yield interval limits capturing y*; only 5% of all samples would give an interval

that did not include y*.

Manipulation of a standardized variable similar to the one from which a confidence interval was obtained gives the following prediction interval.



Prediction Interval for a Single y Value

When the four basic assumptions of the simple linear regression model are met,

a prediction interval for y*, a single y observation made when x 5 x*, has the

form

a 1 bx* 6 1t critical value2 #"s 2e 1 s 2a1bx*

The prediction interval and the confidence interval are centered at exactly the

same place, a 1 bx*. The addition of s 2e under the square-root symbol makes the

prediction interval wider—often substantially so—than the confidence interval.



E X A M P L E 1 3 . 1 2 Jaws II

In Example 13.11, we computed a 90% confidence interval for the mean jaw width

of sharks of length 15 feet. Suppose that we are interested in predicting the jaw width

of a single shark of length 15 feet. The required calculations for a 90% prediction

interval for y* are

a 1 b 1152 5 .688 1 .96245 1152 5 15.140

s 2e 5 11.3762 2 5 1.8934

2

s a1b1152 5 1.2132 2 5 .0454

The t critical value for df 5 42 and a 90% prediction level is 1.68 (using the tabled

value for df 5 40). Substitution into the prediction interval formula then gives

a 1 b 1152 6 1t critical value2 "s 2e 1 s 2a1b1152 5 15.140 6 11.682 "1.9388

5 15.140 6 2.339

5 112.801, 17.4792

Copyright 2010 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s).

Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it.



13.4



Inferences Based on the Estimated Regression Line (Optional)



651



We can be 90% confident that an individual shark of length 15 feet will have a jaw

width between 12.801 and 17.479 inches. Notice that, as expected, this 90% prediction interval is much wider than the 90% confidence interval when x* 5 15 from

Example 13.11.

Figure 13.21 gives Minitab output that includes a 95% confidence interval and a

95% prediction interval when x* 5 15 and when x* 5 20. The intervals for x* 5 20

are wider than the corresponding intervals for x* 5 15 because 20 is farther from x (the

center of the sample x values) than is 15. Each prediction interval is wider than the corresponding confidence interval. Figure 13.22 is a Minitab plot that shows the estimated

regression line as well as 90% confidence limits and prediction limits.

FIGURE 13.21

Minitab output for the data of

Example 13.12.



Regression Analysis

The regression equation is

Jaw Width 5 0.69 1 0.963 Length

Predictor

Constant

Length

S 5 1.376



Coef

0.688

0.96345

R-Sq 5 76.6%



StDev

T

1.299

0.53

0.08228

11.71

R-Sq(adj) 5 76.0%



P

0.599

0.000



DF

1

42

43



SS

259.53

79.49

399.02



F

137.12



StDev Fit

0.213

0.418



95.0% CI

( 14.710, 15.569)

( 19.113, 20.801)



Analysis of Variance

Source

Regression

Residual Error

Total



x* 5 15

x* 5 20



Predicted Values

Fit

15.140

19.957



MS

259.53

1.89



P

0.000



95.0% PI

( 12.330, 17.949)

( 17.055, 22.859)



Regression Plot

Y = 0.687864 + 0.963450X

R-Sq = 76.6%

25



Jaw width



20



FIGURE 13.22



15

Regression

90% Confidence interval

90% Prediction interval



10



Minitab plot showing estimated

regression line and 90% confidence

and prediction limits for the data of

Example 13.12.



10



15

Length



20



E X E RC I S E S 1 3 . 3 3 - 1 3 .4 6

13.33 Explain the difference between a confidence in-



13.34 Suppose that a regression data set is given and



terval and a prediction interval. How can a prediction

level of 95% be interpreted?



you are asked to obtain a confidence interval. How

would you tell from the phrasing of the request whether

the interval is for b or for a 1 b*?



Bold exercises answered in back



Data set available online



Video Solution available



Copyright 2010 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s).

Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it.



652



Chapter 13



Simple Linear Regression and Correlation: Inferential Methods



13.35 In Exercise 13.17, we considered a regression of

y 5 oxygen consumption on x 5 time spent exercising.

Summary quantities given there yield

n 5 20

b 5 97.26



x 5 2.50

a 5 592.10



Sxx 5 25

se 5 16.486



Regression Analysis: Response versus Mean Blood Lead Level

The regression equation is

Response 5 20.00179 2 0.00210 Mean Blood Lead Level



a. Calculate sa1b12.02, the estimated standard deviation

of the statistic a 1 b(2.0).

b. Without any further calculation, what is sa1b(3.0) and

what reasoning did you use to obtain it?

c. Calculate the estimated standard deviation of the

statistic a 1 b(2.8).

d. For what value x* is the estimated standard deviation

x

of a 1 bx* smallest, and why?



13.36 Example 13.3 gave data on x 5 proportion who

judged candidate A as more competent and y 5 vote difference proportion. Calculate a 95% confidence interval

for the mean vote-difference proportion for congressional

races where 60% judge candidate A as more competent.



13.37 The data of Exercise 13.25, in which x 5 milk



Predictor

Constant

Mean Blood Lead Level



Coef

20.001790

20.0021007



SE Coef

0.008303

0.0005743



T

20.22

23.66



P

0.830

0.000



a. What is the equation of the estimated regression

line? y^ 5 20.001790 2 0.0021007x

b. For this dataset, n 5 100, x 5 11.5, se 5 0.032,

and Sxx 5 1764. Estimate the mean brain volume

change for people with a childhood blood lead level

of 20 mg/dL, using a 90% confidence interval.

c. Construct a 90% prediction interval for brain volume change for a person with a childhood blood

lead level of 20 mg/dL.

d. Explain the difference in interpretation of the intervals computed in Parts (b) and (c).



13.40 An experiment was carried out by geologists to



temperature and y 5 milk pH, yield

x 5 42.375

Sxx 5 7325.75

n 5 16

b 5 2.00730608

a 5 6.843345

se 5 .0356

a. Obtain a 95% confidence interval for a 1 b(40),

the mean milk pH when the milk temperature is

408C.

b. Calculate a 99% confidence interval for the mean

milk pH when the milk temperature is 358C.

c. Would you recommend using the data to calculate a

95% confidence interval for the mean pH when the

temperature is 908C? Why or why not?



13.38 Return to the regression of y 5 milk pH on x 5

milk temperature described in the previous exercise.

a. Obtain a 95% prediction interval for a single pH observation to be made when milk temperature 5 408C.

b. Calculate a 99% prediction interval for a single pH

observation when milk temperature 5 358C.

c. When the milk temperature is 608C, would a 99%

prediction interval be wider than the intervals of

Parts (a) and (b)? You should be able to answer without calculating the interval.



x.



13.39 A subset of data read from a graph that appeared

in the paper “Decreased Brain Volume in Adults with



Childhood Lead Exposure” (Public Library of Science

Medicine [May 27, 2008]: e112) was used to produce the

following Minitab output, where x 5 mean childhood

blood lead level (mg/dL) and y 5 brain volume change

Bold exercises answered in back



(percentage). (See Exercise 13.19 for a more complete

description of the study described in this paper)



Data set available online



see how the time necessary to drill a distance of 5 feet in

rock ( y, in minutes) depended on the depth at which the

drilling began (x, in feet, between 0 and 400). We show

part of the Minitab output obtained from fitting the

simple linear regression model (“Mining Information,”

American Statistician [1991]: 4–9).

The regression equation is

Time = 4.79 + 0.0144depth

Predictor

Coef

Stdev t-ratio

Constant

4.7896

0.6663

7.19

depth

0.014388 0.002847

5.05

s = 1.432

R-sq = 63.0%

R-sq(adj) = 60.5%

Analysis of Variance

Source

DF

SS

MS

Regression

1

52.378 52.378

Error

15

30.768 2.051

Total

16

83.146



p

0.000

0.000



F

25.54



p

0.000



a. What proportion of observed variation in time can

be explained by the simple linear regression model?

b. Does the simple linear regression model appear to be

useful?

c. Minitab reported that sa1b12002 5 .347. Calculate a

95% confidence interval for the mean time when

depth 5 200 feet.

d. A single observation on time is to be made when

drilling starts at a depth of 200 feet. Use a 95%

prediction interval to predict the resulting value of

time.

Video Solution available



Copyright 2010 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s).

Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it.



13.4



653



Inferences Based on the Estimated Regression Line (Optional)



e. Minitab gave (8.147, 10.065) as a 95% confidence

interval for mean time when depth 5 300. Calculate

a 99% confidence interval for this mean.



d. Calculate a 95% prediction interval for the maximum width of a food package with a minimum

width of 6 cm.



According to “Reproductive Biology of the



13.43

The shelf life of packaged food depends on

many factors. Dry cereal is considered to be a moisturesensitive product (no one likes soggy cereal!) with the shelf

life determined primarily by moisture content. In a study

of the shelf life of one particular brand of cereal, x 5 time

on shelf (days stored at 738F and 50% relative humidity)

and y 5 moisture content (%) were recorded. The resulting data are from “Computer Simulation Speeds Shelf Life

Assessments” (Package Engineering [1983]: 72–73).



13.41



Aquatic Salamander Amphiuma tridactylum in Louisiana” (Journal of Herpetology [1999]: 100–105), the size

of a female salamander’s snout is correlated with the

number of eggs in her clutch. The following data are

consistent with summary quantities reported in the article. Partial Minitab output is also included.

Snout-Vent Length

Clutch Size



32

45



53

215



53

160



53

170



54

190



Snout-Vent Length 57

Clutch Size

200



57

270



58

175



58

245



59

215



Snout-Vent Length 63

Clutch Size

170



63

240



64

245



67

280



The regression equation is

Y = –133 + 5.92x

Predictor

Coef

StDev

T

P

Constant

Ϫ133.02

64.30

2.07

0.061

x

5.919

1.127

5.25

0.000

s = 33.90

R-Sq = 69.7%

R-Sq(adj) = 67.2%



Additional summary statistics are

n 5 14

2

x

a 5 45,958



x 5 56.5

2

y

a 5 613,550



y 5 201.4

xy

a 5 164,969



a. What is the equation of the regression line for predicting clutch size based on snout-vent length?

b. What is the value of the estimated standard deviation of b?

c. Is there sufficient evidence to conclude that the slope

of the population line is positive?

d. Predict the clutch size for a salamander with a snoutvent length of 65 using a 95% interval.

e. Predict the clutch size for a salamander with a snoutvent length of 105 using a 90% interval.



13.42 The article first introduced in Exercise 13.29 of

Section 13.3 gave data on the dimensions of 27 representative food products.

a. Use the data set given there to test the hypothesis

that there is a positive linear relationship between

x 5 minimum width and y 5 maximum width of an

object.

b. Calculate and interpret se.

c. Calculate a 95% confidence interval for the mean

maximum width of products with a minimum width

of 6 cm.

Bold exercises answered in back



Data set available online



x

y



0

2.8



3

3.0



6

3.1



8

3.2



10

3.4



13

3.4



16

3.5



x

20

24

27

y

3.1

3.8

4.0

a. Summary quantities are



30

4.1



34

4.3



37

4.4



41

4.9



a x 5 269   a y 5 51   a xy 5 1081.5

2

2

a y 5 190.78   a x 5 7745

Find the equation of the estimated regression line for

predicting moisture content from time on the shelf.

b. Does the simple linear regression model provide useful information for predicting moisture content

from knowledge of shelf time?

c. Find a 95% interval for the moisture content of an

individual box of cereal that has been on the shelf

30 days.

d. According to the article, taste tests indicate that this

brand of cereal is unacceptably soggy when the moisture content exceeds 4.1. Based on your interval in

Part (c), do you think that a box of cereal that has been

on the shelf 30 days will be acceptable? Explain.



13.44 For the cereal data of the previous exercise, the

mean x value is 19.21. Would a 95% confidence interval

with x* 5 20 or x* 5 17 be wider? Explain. Answer the

same question for a prediction interval.



13.45 A regression of x 5 tannin concentration

(mg/L) and y 5 perceived astringency score was considered in Examples 5.2 and 5.6. The perceived astringency

was computed from expert tasters rating a wine on a scale

from 0 to 10 and then standardizing the rating by computing a z-score. Data for 32 red wines (given in Example 5.2) was used to compute the following summary

statistics and estimated regression line:

2

x 5 .6069

n 5 32

a 1x 2 x 2 5 1.479

y^ 5 21.59 1 2.59x

SSResid 5 1.936



Video Solution available



Copyright 2010 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s).

Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it.



654



Chapter 13



Simple Linear Regression and Correlation: Inferential Methods



a. Calculate a 95% confidence interval for the mean

astringency rating for red wines with a tannin concentration of .5 mg/L.

b. When two 95% confidence intervals are computed,

it can be shown that the simultaneous confidence

level is at least 3 100 2 2 152 4 % 5 90%. That is, if

both intervals are computed for a first sample, for a

second sample, for a third sample, and so on, in the

long run at least 90% of the samples will result in

intervals which both capture the values of the corresponding population characteristics. Calculate confidence intervals for the mean astringency rating when

the tannin concentration is .5 mg/L and when the

tannin concentration is .7 mg/L in such a way that

the simultaneous confidence level is at least 90%.

c. If two 99% confidence intervals were computed,

what do you think could be said about the simultaneous confidence level?

d. If a 95% confidence interval were computed for the

mean astringency rating when x 5 .5, another confidence interval was computed for x 5 .6, and yet

another one for x 5 .7, what do you think would be

the simultaneous confidence level for the three resulting intervals?

Bold exercises answered in back



13.5



Data set available online



13.46



The article “Performance Test Conducted for



a Gas Air-Conditioning System” (American Society of

Heating, Refrigerating, and Air Conditioning Engineering [1969]: 54) reported the following data on

maximum outdoor temperature (x) and hours of chiller

operation per day ( y) for a 3-ton residential gas air-conditioning system:

x

y



72

4.8



78

7.2



80

9.5



86

14.5



88

15.7



92

17.9



Suppose that the system is actually a prototype model,

and the manufacturer does not wish to produce this

model unless the data strongly indicate that when maximum outdoor temperature is 828F, the true average

number of hours of chiller operation is less than 12. The

appropriate hypotheses are then

H0: a 1 b(82) 5 12



versus



Ha: a 1 b(82) , 12



Use the statistic

t5



a 1 b 1822 2 12

sa1b1822



which has a t distribution based on (n 2 2) df when H0

is true, to test the hypotheses at significance level .01.

Video Solution available



Inferences About the Population Correlation

Coefficient (Optional)

The sample correlation coefficient r, defined in Chapter 5, measures how strongly the

x and y values in a sample of pairs are linearly related to one another. There is an

analogous measure of how strongly x and y are linearly related in the entire population

of pairs from which the sample (x1, y1), ... , (xn, yn) was obtained. It is called the population correlation coefficient and is denoted by r. As with r, r must be between 21

and 1, and it assesses the extent of any linear relationship in the population. To have

r 5 1 or r 5 21, all (x, y) pairs in the population must lie exactly on a straight line.

The value of r is a population characteristic and is generally unknown. The sample

correlation coefficient r can be used as the basis for making inferences about r.



Test for Independence (␳ 5 0)

Investigators are often interested in detecting not just linear association but also association of any kind. When there is no association of any type between the x and

y values, statisticians say that the two variables are independent. In general, r 5 0 is

not equivalent to the independence of x and y. However, there is one special—yet

frequently occurring—situation in which the two conditions (r 5 0 and independence) are identical. This is when the pairs in the population have what is called a

bivariate normal distribution. The essential feature of such a distribution is that for

Copyright 2010 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s).

Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it.



13.5 Inferences About the Population Correlation Coefficient (Optional)



655



any fixed x value, the distribution of associated y values is normal, and for any fixed y

value, the distribution of x values is normal.

As an example, suppose that height x and weight y have a bivariate normal distribution in the American adult male population. (There is good empirical evidence for this.)

Then, when x 5 68 inches, weight y has a normal distribution; when x 5 72 inches,

weight is normally distributed; when y 5 160 pounds, height x has a normal distribution; when y 5 175 pounds, height has a normal distribution; and so on. In this example,

of course, x and y are not independent, because large height values tend to be paired with

large weight values and small height values tend to be paired with small weight values.

There is no easy way to check the assumption of bivariate normality, especially when

the sample size n is small. A partial check can be based on the following property: If (x, y)

has a bivariate normal distribution, then x alone has a normal distribution and so does y.

This suggests constructing a normal probability plot of x1, x2, ... xn, and a separate normal

probability plot of y1, y2, ... , yn. If either plot shows a substantial departure from a straight

line, then bivariate normality is a questionable assumption. If both plots are reasonably

straight, then bivariate normality is plausible, although no guarantee can be given.

For a bivariate normal population, the test of independence (correlation 5 0) is

a t test. The formula for the test statistic involves standardizing the estimate r under

the assumption that the null hypothesis H0: r 5 0 is true.



A Test for Independence in a Bivariate Normal Population

H0: r 5 0

r

Test statistic: t 5

1 2 r2

Ån22

The test is based on df 5 n 2 2.

Null hypothesis:



Alternative hypothesis:

Ha: r . 0 (positive dependence)

Ha: r , 0 (negative dependence)

Ha: r ϶ 0 (dependence)



Assumptions:



P-Value:

Area under the appropriate t curve to the

right of the computed t

Area under the appropriate t curve to the

left of the computed t

(1) 2(area to the right of t) if t is positive

or

(2) 2(area to the left of t) if t is negative



r is the correlation coefficient for a random sample from a bivariate normal population.



EXAMPLE 13.13



Sleepless Nights



The relationship between sleep duration and the level of the hormone leptin (a

hormone related to energy intake and energy expenditure) in the blood was investigated in the paper “Short Sleep Duration is Associated with Reduced Leptin,



Elevated Ghrelin, and Increased Body Mass Index” (Public Library of Science

Medicine, [December 2004]: 210–217). Average nightly sleep (x, in hours) and

blood leptin level (y) were recorded for each person in a sample of 716 participants

in the Wisconsin Sleep Cohort Study. The sample correlation coefficient was

Copyright 2010 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s).

Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it.



Tài liệu bạn tìm kiếm đã sẵn sàng tải về

4: Inferences Based on the Estimated Regression Line (Optional)

Tải bản đầy đủ ngay(0 tr)

×