Tải bản đầy đủ - 0 (trang)
1 Parameters, Statistics, and Statistical Inference

1 Parameters, Statistics, and Statistical Inference

Tải bản đầy đủ - 0trang

Chapter 14



The calculation of the estimate of standard deviation is based on the sum of

the squared residuals for the sample. This quantity is called the sum of squared

errors and is denoted by SSE. Synonyms for “sum of squared errors” are residual sum of squares or sum of squared residuals. To find the SSE, residuals are

calculated for all observations, then the residuals are squared and summed.

The standard deviation from the regression line is





B



Sum of squared residuals

SSE

ϭ

nϪ2

Bn Ϫ 2



and this sample statistic estimates the population standard deviation s.

Estimating the Standard Deviation for a Simple Regression Model



formula



The formula for estimating the standard deviation for a simple regression

model is

SSE ϭ a 1 yi Ϫ yˆi 2 2 ϭ a e 2i

g 1yi Ϫ yˆi 2 2

SSE



ϭ

Bn Ϫ 2 B n Ϫ 2

The statistic s is an estimate of the population standard deviation s.

Technical Note: Notice the difference between the estimate of s in the regression situation compared to what it would be if we simply had a random sample

of the yi’s without information about the xi’s:

Sample of y’s only:







g 1 yi Ϫ y2 2

B



nϪ1



Sample of 1x, y2 pairs, linear regression:







g 1 yi Ϫ yˆi 2 2

B



nϪ2



Remember that in the regression context, s is the standard deviation of the y

values at each x, not the standard deviation of the whole population of y values.



Example 14.3



Relationship Between Height and Weight for College Men Figure 14.4

displays regression results from the Minitab program and a scatterplot for

the relationship between y ϭ weight (pounds) and x ϭ height (inches) in a



The regression equation is

Weight = –318 + 7.00 Height

Predictor

Constant

Height



Coef

–317.9

6.996



S = 24.00



R-Sq = 32.3%



240



SE Coef

110.9

1.581



T

–2.87

4.42



R-Sq(adj) = 30.7%



P

0.007

0.000



220

Weight (lb)



606



200

180

160

140

66



67 68



69



70 71 72 73 74 75

Height (in.)



Figure 14.4 ❚ The relationship between weight and height for n ‫ ؍‬43 college men



Inference About Simple Regression



Watch a video example at http://

1pass.thomson.com or on your CD.



in summary



607



sample of n ϭ 43 men in a statistics class. The regression line for the sample is

yˆ ϭ Ϫ318 ϩ 7x, and this line is drawn onto the plot. We see from the plot that

there is considerable variation from the line at any given height. The standard

deviation, shown in the last row of the computer output to the left of the plot, is

“S ϭ 24.00.” This value roughly measures, for any given height, the general size

of the deviations of individual weights from the mean weight for the height.

The standard deviation from the regression line can be interpreted in conjunction with the Empirical Rule for bell-shaped data stated in Section 2.7. Recall, for instance, that about 95% of individuals will fall within 2 standard deviations of the mean. As an example, consider men who are 72 inches tall. For

men with this height, the estimated average weight determined from the regression equation is Ϫ318 ϩ 7.00(72) ϭ 186 pounds. The estimated standard deviation from the regression line is s ϭ 24 pounds, so we can estimate that about 95%

of men 72 inches tall have weights within 2 ϫ 24 ϭ 48 pounds of 186 pounds,

which is 186 Ϯ 48, or 138 to 234 pounds. Think about whether this makes sense

for all the men you know who are 72 inches (6 feet) tall. ■



Interpreting the Standard Deviation for Regression

The standard deviation for regression estimates the standard deviation of the

differences between values of y and the regression equation that relates the

mean value of y to x. In other words, it measures the general size of the differences between actual and predicted values of y.



t h o u g h t q u e s t i o n 1 4 . 1 Regression equations can be used to predict the value of a response

variable for an individual. What is the connection between the accuracy of predictions based on a particular regression line and the value of the standard deviation

from the line? If you were deciding between two different regression models for predicting the same response variable, how would your decision be affected by the relative values of the standard deviations for the two models?*



The Proportion of Variation Explained by x

In Chapter 5, we learned that the squared correlation r 2 is a useful statistic. It

is used to measure how well the explanatory variable explains the variation in

the response variable. This statistic is also denoted as R 2 (rather than r 2), and

the value is commonly expressed as a percentage. Researchers typically use the

phrase “proportion of variation explained by x” in conjunction with the value

of r 2. For example, if r 2 ϭ .60 (or 60%), the researcher may write that the explanatory variable explains 60% of the variation in the response variable.



*H I N T :



Read the first paragraph of this section (p. 605).



608



Chapter 14



The formula for r 2 presented in Chapter 5 was

r2 ϭ



SSTO Ϫ SSE

SSTO



The quantity SSTO is the sum of squared differences between observed y values

and the sample mean y . It measures the size of the deviations of the y values from the overall mean of y, whereas SSE measures the deviations of the

y values from the predicted values of y.

Example 14.4



R 2 for Heights and Weights of College Men In Figure 14.4 for Example 14.3



(p. 606), we can find the information “R-sq ϭ 32.3%” for the relationship between weight and height. A researcher might write “the variable height explains

32.3% of the variation in the weights of college men.” This isn’t a particularly

impressive statistic. As we noted before, there is substantial deviation of individual weights from the regression line, so a prediction of a college man’s weight

based on height may not be particularly accurate. ■



t h o u g h t q u e s t i o n 1 4 . 2 Look at the formula for SSE, and explain in words under what condition SSE ϭ 0. Now explain what happens to r 2 when SSE ϭ 0, and explain whether

that makes sense according to the definition of r 2 as “proportion of variation in y explained by x.” *



Example 14.5



Driver Age and Highway Sign-Reading Distance In Example 5.2 (p. 153),

we examined data for the relationship between y ϭ maximum distance (feet)

at which a driver can read a highway sign and x ϭ the age of the driver. There

were n ϭ 30 observations in the dataset. Figure 14.5 displays Minitab regression



Watch a video example at http://

1pass.thomson.com or on your CD.

For software help, download your

Minitab, Excel, TI-83, SPSS, R, and

JMP manuals from http://1pass

.thomson.com, or find them on

your CD.



The regression equation is

Distance = 577 – 3.01 Age

Predictor

Constant

Age

S = 49.76



Coef

576.68

–3.0068



SE Coef

23.47

0.4243



R-Sq = 64.2%



T

24.57

–7.09



P

0.000

0.000



R-Sq (adj) = 62.9%



Analysis of Variance

Source

Regression

Residual Error

Total



DF

1

28

29



SS

124333

69334

193667



Unusual Observations

Obs

Age

Distance

27

75.0

460.00



MS

124333

2476



Fit

351.17



SE Fit

13.65



F

50.21



P

0.000



Residual

108.83



St Resid

2.27R



R denotes an observation with a large standardized residual



Figure 14.5 ❚ Minitab output: Sign-reading distance and driver age



*H I N T :



Remember that SSE stands for “sum of squared errors.” The formula for r 2 is given just before Example 14.4.



Inference About Simple Regression



609



output for these data. The equation describing the linear relationship in the

sample is

Average distance ϭ 577 Ϫ 3.01 ϫ Age

From the output, we learn that the standard deviation from the regression line

is s ϭ 49.76 and R-sq ϭ 64.2%. Roughly, the average deviation from the regression line is about 50 feet, and the proportion of variation in sign-reading distances explained by age is .642 or 64.2%.

The analysis of variance table provides the pieces needed to compute r 2

and s:

SSE ϭ 69,334





69,334

SSE

ϭ

ϭ 49.76

B n Ϫ 2 B 28



SSTO ϭ 193,667

SSTO Ϫ SSE ϭ 193,667 Ϫ 69,334 ϭ 124,333

14.2 Exercises are on page 626.



r2 ϭ



124,333

SSTO Ϫ SSE

ϭ

ϭ .642, or 64.2% ■

SSTO

193,667



14.3 Inference About the Slope

of a Linear Regression

In this section, we will learn how to carry out a hypotheses test to determine

whether we can infer that two variables are linearly related in the larger population represented by a sample. We will also learn how to use sample regression

results to calculate a confidence interval estimate of a population slope.



Hypothesis Test for a Population Slope

The statistical significance of a linear relationship can be evaluated by testing whether or not the population slope is 0. If the slope is 0 in a simple linear regression model, the two variables are not related because changes in the

x variable will not lead to changes in the y variable. The usual null hypothesis

and alternative hypothesis about b1, the slope of the population regression line

E(Y ) ϭ b0 ϩ b1x, are

H0: b1 ϭ 0

Ha: b1 0



(the population slope is 0, so y and x are not linearly related )

(the population slope is not 0, so y and x are linearly related )



The alternative hypothesis may be one-sided or two-sided, although most statistical software uses the two-sided alternative.

The test statistic used to do the hypothesis test is a t-statistic with the same

general format that we used in Chapters 12 and 13. That format, and its application to this situation, is





Sample statistic Ϫ Null value

b1 Ϫ 0

ϭ

Standard error

s.e.1b1 2



This is a standardized statistic for the difference between the sample slope and

0, the null value. Notice that a large value of the sample slope (either positive or



610



Chapter 14



negative) relative to its standard error will give a large value of t. If the mathematical assumptions about the population model described in Section 14.1 are

correct, the statistic has a t-distribution with n Ϫ 2 degrees of freedom. The

p-value for the test is determined using that distribution.

It is important to be sure that the necessary conditions are met when using

any statistical inference procedure. The necessary conditions for using this test,

and how to check them, will be discussed in Section 14.5.

“By hand” calculations of the sample slope and its standard error are cumbersome. Fortunately, the regression analysis of most statistical software includes a t-statistic and a p-value for this significance test.

formula



Formula for the Sample Slope and Its Standard Error

In case you ever need to compute the values by hand, here are the formulas for

the sample slope and its standard error:

b1 ϭ r



sy

sx



s.e.1b1 2 ϭ



s

2g 1x1 Ϫ x2



2



where s ϭ



SSE

Bn Ϫ 2



In the formula for the sample slope, sx and sy are the sample standard deviations

of the x and y values, respectively, and r is the correlation between x and y.



Example 14.6



For software help, download your

Minitab, Excel, TI-83, SPSS, R, and

JMP manuals from http://1pass

.thomson.com, or find them on

your CD.



Hypothesis Test for Driver Age and Sign-Reading Distance Figure 14.5

(p. 608) for Example 14.5 presents Minitab output for the regression of signreading distance ( y) and driver age. The part of the output that is used to test the

statistical significance of the observed relationship is shown in bold. This line of

output gives values for the sample slope, the standard error of the sample slope,

the t-statistic, and the p-value for the test of



H0: b1 ϭ 0

Ha: b1 0



(the population slope is 0, so y and x are not linearly related )

(the population slope is not 0, so y and x are linearly related )



The test statistic is





Sample statistic Ϫ Null value

b1 Ϫ 0

Ϫ3.0068 Ϫ 0

ϭ

ϭ

ϭ Ϫ7.09

Standard error

s.e.1b1 2

0.4243



The p-value (underlined in the output) is given to three decimal places as .000.

This means that the probability is virtually 0 that the sample slope could be as

far from 0 or farther than it is if the population slope really is 0. Because the pvalue is so small, we can reject the null hypothesis and infer that the linear relationship observed between the two variables in the sample represents a real relationship in the population. ■

technical note



Most statistical software reports a p-value for a two-sided alternative hypothesis when doing a test for whether the slope in the population is 0. It

may sometimes make sense to use a one-sided alternative hypothesis instead. In that case, the p-value for the one-sided alternative is (reported p>2)

if the sign of b1 is consistent with Ha, but is 1 Ϫ (reported p>2) if it is not.



Inference About Simple Regression



611



Confidence Interval for the Population Slope

The significance test of whether or not the population slope is 0 tells us only

whether we can declare the relationship to be statistically significant. If we decide that the true slope is not 0, we might ask, “What is the value of the slope?”

We can answer this question with a confidence interval for b1, the population

slope.

The format for this confidence interval is the same as the general format

used in Chapters 10 and 11, which is

Sample statistic Ϯ Multiplier ϫ Standard error

The sample statistic is b1, the slope of the least-squares regression line for the

sample. As has been shown already, the standard error formula is complicated,

and we will usually rely on statistical software to determine this value. The

“multiplier” will be labeled t * and is determined by using a t-distribution with

df ϭ n Ϫ 2. Table A.2 can be used to find the multiplier for the desired confidence level.

formula



Formula for Confidence Inter val for B 1 , the Population Slope

A confidence interval for b1 is

b1 Ϯ t * s.e.(b1)

The multiplier t * is found by using a t-distribution with n Ϫ 2 degrees of freedom and is such that the probability between Ϫt * and ϩt * equals the confidence level for the interval.



Example 14.7



95% Confidence Interval for Slope Between Age and Sign-Reading Distance In Figure 14.5 (p. 608), we see that the sample slope is b1 ϭ Ϫ3.01 and



s.e.(b1) ϭ 0.4243. There are n ϭ 30 observations, so df ϭ 28 for finding t *. For a

95% confidence level, t * ϭ 2.05 (see Table A.2). The 95% confidence interval for

the population slope is

Ϫ3.01 Ϯ 2.05 ϫ 0.4243

Ϫ3.01 Ϯ 0.87

Ϫ3.88 to Ϫ2.14

With 95% confidence, we can estimate that in the population of drivers represented by this sample, the mean sign-reading distance decreases somewhere

between 2.14 and 3.88 feet for each one-year increase in age. ■

t h o u g h t q u e s t i o n 1 4 . 3 In previous chapters, we learned that a confidence interval can be

used to determine whether a hypothesized value for a parameter can be rejected.

How would you use a confidence interval for the population slope to determine

whether there is a statistically significant relationship between x and y? For example,

why is the interval that we just computed for the sign-reading example evidence that

sign-reading distance and age are related?*



*H I N T :



What is the null value for the slope? Section 13.5 discusses the connection between

confidence intervals and significance tests.



612



Chapter 14

SSPS t i p



Calculating a 95% Confidence Inter val for the Slope





Use AnalyzebRegressionbLinear Regression. Specify the y variable in

the Dependent box and specify the x variable in the Independent(s) box.







Click Statistics and then select “Confidence intervals” under “Regression

Coefficients.”



Testing Hypotheses About

the Correlation Coefficient

In Chapter 5, we learned that the correlation coefficient is 0 when the regression

line is horizontal. In other words, if the slope of the regression line is 0, the correlation is 0. This means that the results of a hypothesis test for the population

slope can also be interpreted as applying to equivalent hypotheses about the

correlation between x and y in the population.

We use different notation to distinguish between a correlation computed

for a sample and a correlation within a population. It is commonplace to use

the Greek letter r (pronounced “rho”) to represent the correlation between two

variables within a population. Using this notation, null and alternative hypotheses of interest are as follows:

H0 : r ϭ 0

Ha: r 0



(x and y are not correlated)

(x and y are correlated)



The results of the hypothesis test described before for the population slope b1

can be used for these hypotheses as well. If we reject H0: b1 ϭ 0, we also reject

H0: r ϭ 0. If we decide in favor of Ha: b1 0, we also decide in favor of Ha: r 0.

Many statistical software programs, including Minitab, will give a p-value

for testing whether the population correlation is 0 or not. This p-value will

be the same as the p-value given for testing whether the population slope is 0

or not.

The following Minitab output is for the relationship between pulse rate and

weight in a sample of 35 college women. Notice that .292 is given as the p-value

for testing that the slope is 0 (look under P in the regression results) and for testing that the correlation is 0. Because this is not a small p-value, we cannot reject

the null hypotheses for the slope and the correlation.



Regression Analysis: Pulse Versus Weight

The regression equation is

Pulse = 57.2 + 0.159 Weight

Predictor

Constant

Weight



Coef

57.17

0.1591



SE Coef

18.51

0.1487



T

3.09

1.07



Correlations: Pulse, Weight

Pearson correlation of Pulse and Weight = 0.183

P-Value = 0.292



P

0.004

0.292



Inference About Simple Regression



613



The Effect of Sample Size on Significance



14.3 Exercises are on page 627.



technical note



The size of a sample always affects whether a specific observed result achieves

statistical significance. For example, r ϭ 0.183 is not a statistically significant

correlation for a sample size of n ϭ 35, as in the pulse and weight example, but

it would be statistically significant if n ϭ 1000. With very large sample sizes,

weak relationships with low correlation values can be statistically significant.

The “moral of the story” here is that with a large sample size, it may not be saying much to say that two variables are significantly related. This means only

that we think that the correlation is not precisely 0. To assess the practical significance of the result, we should carefully examine the observed strength of the

relationship.



The usual t-statistic for testing whether the population slope is 0 in a linear regression could also be found by using a formula that involves only n ϭ

sample size and r ϭ correlation between x and y. The algebraic equivalence is





b1

r

ϭ 2n Ϫ 2

s.e.1b1 2

21 Ϫ r 2



In the output for the pulse rate and body weight example just given, notice

that the t-statistic for testing whether the slope b1 ϭ 0 is t ϭ 1.07. This was

calculated as





b1

0.1591

ϭ

ϭ 1.07

s.e.1b1 2

0.1487



The sample size is n ϭ 35, and the correlation is r ϭ 0.183, so an equivalent

calculation of the t-statistic is

t ϭ 2n Ϫ 2



r

21 Ϫ r



2



ϭ 235 Ϫ 2



0.183

21 Ϫ 0.1832



ϭ 1.07



This second method for calculating the t-statistic illustrates two ideas. First,

there is a direct link between the correlation value and the t-statistic that is

used to test whether the slope is 0. Second, notice that for any fixed value of

r, increasing the sample size n will increase the size of the t-statistic. And the

larger the value of the t-statistic, the stronger is the evidence against the null

hypothesis.



14.4 Predicting y and Estimating

Mean y at a Specific x

In this section, we cover two different types of intervals that are used to make

inferences about the response variable ( y). The first type of interval predicts

the value of y for an individual with a specific value of x. For example, we may

want to predict the freshman year GPA of a college applicant who has a 3.6 high



614



Chapter 14



school GPA. The second type of interval estimates the mean value of y for a population of individuals who all have the same specific value of x. As an example,

we may want to estimate the mean (average) freshman year GPA of all college

applicants who have a 3.6 high school GPA.



Predicting the Value of y for an Individual

An important use of a regression equation is to estimate or predict the unknown

value of a response variable for an individual with a known specific value of the

explanatory variable. Using the data described in Example 14.5 (p. 608) for instance, we can predict the maximum distance at which an individual can read

a highway sign by substituting his or her age for x in the sample regression equation. Consider a person who is 21 years old. The predicted distance for such a

person is approximately yˆ ϭ 577 Ϫ 3.01(21) ϭ 513.79, or about 514 feet.

There will be variation among 21-year-olds with regard to the sign-reading

distance, so the predicted distance of 513.79 feet is not likely to be the exact distance for the next 21-year-old who views the sign. Rather than predicting that

the distance will be exactly 513.79 feet, we should instead predict that the distance will be within a particular interval of values.

A 95% prediction interval describes the values of the response variable ( y)

for 95% of all individuals with a particular value of x. This interval can be interpreted in two equivalent ways:

1. The 95% prediction interval estimates the central 95% of the values of y

for members of the population with a specified value of x.

2. The probability is .95 that a randomly selected individual from the population with a specified value of x falls into the corresponding 95% prediction interval.

We don’t always have to use a 95% prediction interval. A prediction interval

for the value of the response variable ( y) can be found for any specified central

percentage of a population with a specified value of x. For example, a 75% prediction interval describes the central 75% of a population of individuals with a

particular value of the explanatory variable (x).



definition



A prediction interval estimates the value of y for an individual with a particular value of x, or equivalently, the range of values of the response variable for a

specified central percentage of a population with a particular value of x.



Notice that a prediction interval differs conceptually from a confidence interval. A confidence interval estimates an unknown population parameter,

which is a numerical characteristic or summary of the population. An example

in this chapter is a confidence interval for the slope of the population line. A

prediction interval, however, does not estimate a parameter; instead, it estimates the potential data value for an individual. Equivalently, it describes an interval into which a specified percentage of the population may fall.



Inference About Simple Regression



615



t h o u g h t q u e s t i o n 1 4 . 4 If we knew the population parameters b0, b1, and s, under the

usual regression assumptions, we would know that the population of y values at a specific x value was normal with mean b0 ϩ b1x and standard deviation s. In that case,

what interval would cover the central 95% of the y values for that x value? Use your

answer to explain why a prediction interval would not have zero width even with

complete population details.*



As with most regression calculations, the “by hand” formulas for prediction

intervals are formidable. Statistical software can be used to create the interval.

Figure 14.6 shows Minitab output that includes the 95% prediction intervals for

three different ages (21, 30, and 45). The intervals are toward the bottom-right

side of the display in a column labeled “95% PI” and are highlighted with bold

type. The ages for which the intervals were computed are shown at the bottom

of the output. (Note: The term fit is a synonym for yˆ , the estimate of the average

response at the specific x value.) From Figure 14.6, here is what we can conclude:





The probability is .95 that a randomly selected 21-year-old will read the

sign at somewhere between 406.69 and 620.39 feet.



The regression equation is

Distance = 577 – 3.01 Age

Predictor

Constant

Age

S = 49.76



Coef

576.68

–3.0068

R-Sq = 64.2%



SE Coef

23.47

0.4243



T

24.57

–7.09



P

0.000

0.000



R-Sq(adj) = 62.9%



Analysis of Variance

Source

Regression

Residual Error

Total



DF

1

28

29



SS

124333

69334

193667



Unusual Observations

Obs

Age

Distance

27

75.0

460.00



Fit

351.17



MS

124333

2476



F

50.21



SE Fit

13.65



Residual

108.83



P

0.000



St Resid

2.27R



R denotes an observation with a large standardized residual

Predicted Values for New Observations

New Obs

1

2

3



Fit

513.54

486.48

441.37



SE Fit

15.64

12.73

9.44



95.0% CI

(481.50, 545.57)

(460.41, 512.54)

(422.05, 460.70)



95.0% PI

(406.69, 620.39)

(381.26, 591.69)

(337.63, 545.12)



Values of Predictors for New Observations

New Obs

Age

1

21.0

2

30.0

3

45.0



Figure 14.6 ❚ Minitab output showing prediction intervals of distance



*H I N T :



Remember the Empirical Rule, and also recall that the regression equation gives the mean y

for a specific x.



616



Chapter 14





The probability is .95 that a randomly selected 30-year-old will read the

sign at somewhere between 381.26 and 591.69 feet.







The probability is .95 that a randomly selected 45-year-old will read the

sign at somewhere between 337.63 and 545.12 feet.



We can also interpret each interval as an estimate of the sign-reading distances

for the central 95% of a population of drivers with a specified age. For instance,

about 95% of all drivers 21 years old will be able to read the sign at a distance

somewhere between roughly 407 and 620 feet.

With Minitab, we can describe any central percentage of the population that

we wish. For example, here are 50% prediction intervals for the sign-reading

distance at the three specific ages we considered above.

Age

21

30

45



Fit

513.54

486.48

441.37



50.0% PI

(477.89, 549.18)

(451.38, 521.58)

(406.76, 475.98)



For each specific age, the 50% prediction interval estimates the central 50%

of the maximum sign-reading distances in a population of drivers with that

age. For example, we can estimate that 50% of drivers 21 years old would have

a maximum sign-reading distance somewhere between about 478 feet and

549 feet. The distances for the other 50% of 21-year-old drivers would be predicted to be outside this range, with 25% above about 549 feet and 25% below

about 478 feet.



technical note



The formula for the prediction interval for y at a specific x is

yˆ Ϯ t *2s 2 ϩ 3 s.e.1fit2 4 2

where

s.e.1fit 2 ϭ s



1x Ϫ x 2 2

1

ϩ

Bn

g 1x1 Ϫ x2 2



The multiplier t * is found by using a t-distribution with n Ϫ 2 degrees of freedom and is such that the probability between Ϫt * and ϩt * equals the desired

level for the interval.

Note:

● The s.e.(fit), and thus the width of the interval, depends on how far the

specified x value is from x. The farther the specific x is from the mean, the

wider is the interval.





When n is large, s.e.(fit) will be small, and the prediction interval will be

approximately yˆ Ϯ t *s.



Tài liệu bạn tìm kiếm đã sẵn sàng tải về

1 Parameters, Statistics, and Statistical Inference

Tải bản đầy đủ ngay(0 tr)

×