Tải bản đầy đủ
11 A Quadratic (Second-Order) Model with a Quantitative Predictor

11 A Quadratic (Second-Order) Model with a Quantitative Predictor

Tải bản đầy đủ

202 Chapter 4 Multiple Regression Models
AEROBIC

Table 4.2 Data on immunity and fitness level of 30 subjects
Subject
1

IgG
y, milligrams

Maximal Oxygen Uptake
x, milliliters per kilogram

Subject

IgG
y, milligrams

Maximal Oxygen Uptake
x, milliliters per kilogram

34.6

16

1,660

52.5

881

2

1,290

45.0

17

2,121

69.9

3

2,147

62.3

18

1,382

38.8

4

1,909

58.9

19

1,714

50.6

5

1,282

42.5

20

1,959

69.4

6

1,530

44.3

21

1,158

37.4

7

2,067

67.9

22

965

35.1

8

1,982

58.5

23

1,456

43.0

9

1,019

35.6

24

1,273

44.1

10

1,651

49.6

25

1,418

49.8

11

752

33.0

26

1,743

54.4

12

1,687

52.0

27

1,997

68.5

13

1,782

61.4

28

2,177

69.5

14

1,529

50.2

29

1,965

63.0

15

969

34.1

30

1,264

43.2

(c) Graph the prediction equation and assess how well the model fits the data,
both visually and numerically.
(d) Interpret the β estimates.
(e) Is the overall model useful (at α = .01) for predicting IgG?
(f) Is there sufficient evidence of concave downward curvature in the immunity–fitness level? Test using α = .01.

Solution
(a) A scatterplot for the data in Table 4.2, produced using SPSS, is shown in
Figure 4.13. The figure illustrates that immunity appears to increase in a
curvilinear manner with fitness level. This provides some support for the
inclusion of the quadratic term x 2 in the model.
(b) We also used SPSS to fit the model to the data in Table 4.2. Part of the SPSS
regression output is displayed in Figure 4.14. The least squares estimates of the
β parameters (highlighted at the bottom of the printout) are βˆ0 = −1,464.404,
βˆ1 = 88.307, and βˆ2 = −.536. Therefore, the equation that minimizes the SSE
for the data is
yˆ = −1,464.4 + 88.307x − .536x 2
(c) Figure 4.15 is a MINITAB graph of the least squares prediction equation.
Note that the graph provides a good fit to the data in Table 4.2. A numerical
measure of fit is obtained with the adjusted coefficient of determination, Ra2 .
From the SPSS printout, Ra2 = .933. This implies that about 93% of the sample
variation in IgG (y) can be explained by the quadratic model (after adjusting
for sample size and degrees of freedom).

A Quadratic (Second-Order) Model with a Quantitative Predictor

203

Figure 4.13 SPSS
scatterplot for data of
Example 4.7

Figure 4.14 SPSS output
for quadratic model of
Example 4.7

(d) The interpretation of the estimated coefficients in a quadratic model must be
undertaken cautiously. First, the estimated y-intercept, βˆ0 , can be meaningfully
interpreted only if the range of the independent variable includes zero—that
is, if x = 0 is included in the sampled range of x. Although βˆ0 = −1,464.4
seems to imply that the estimated immunity level is negative when x = 0, this
zero point is not in the range of the sample (the lowest value of maximal
oxygen uptake x is 33 milliliters per kilogram), and the value is nonsensical

204 Chapter 4 Multiple Regression Models

Figure 4.15 MINITAB
graph of least squares fit
for the quadratic model

(a person with 0 aerobic fitness level); thus, the interpretation of βˆ0 is not
meaningful.
The estimated coefficient of x is βˆ1 = 88.31, but it no longer represents
a slope in the presence of the quadratic term x 2 .∗ The estimated coefficient of
the first-order term x will not, in general, have a meaningful interpretation in
the quadratic model.
The sign of the coefficient, βˆ2 = −.536, of the quadratic term, x 2 , is
the indicator of whether the curve is concave downward (mound-shaped) or
concave upward (bowl-shaped). A negative βˆ2 implies downward concavity,
as in this example (Figure 4.15), and a positive βˆ2 implies upward concavity.
Rather than interpreting the numerical value of βˆ2 itself, we utilize a graphical
representation of the model, as in Figure 4.15, to describe the model.
Note that Figure 4.15 implies that the estimated immunity level (IgG)
is leveling off as the aerobic fitness levels increase beyond 70 milliliters per
kilogram. In fact, the concavity of the model would lead to decreasing usage
estimates if we were to display the model out to x = 120 and beyond (see
Figure 4.16). However, model interpretations are not meaningful outside the
range of the independent variable, which has a maximum value of 69.9 in this
example. Thus, although the model appears to support the hypothesis that the
rate of increase of IgG with maximal oxygen uptake decreases for subjects with
aerobic fitness levels near the high end of the sampled values, the conclusion
that IgG will actually begin to decrease for very large aerobic fitness levels
would be a misuse of the model, since no subjects with x-values of 70 or more
were included in the sample.
(e) To test whether the quadratic model is statistically useful, we conduct the
global F -test:
H0 : β1 = β2 = 0
Ha : At least one of the above coefficients is nonzero
∗ For students with knowledge of calculus, note that the slope of the quadratic model is the first derivative
∂y/∂x = β1 + 2β2 x. Thus, the slope varies as a function of x, rather than the constant slope associated with the
straight-line model.

A Quadratic (Second-Order) Model with a Quantitative Predictor

Figure 4.16 Potential

205

y

misuse of quadratic model

Use model within range of
not outside range of
independent variable...
independent variable.

IgG (milligrams)

2,200

1,800

1,400
Nonsensical
predictions
1,000
40

x

60
80
100
120
Maximal oxygen uptake (ml/kg)

From the SPSS printout, Figure 4.14, the test statistic is F = 203.159 with an
associated p-value of 0. For any reasonable α, we reject H0 and conclude that
the overall model is a useful predictor of immunity level, y.
(f) Figure 4.15 shows concave downward curvature in the relationship between
immunity level and aerobic fitness level in the sample of 30 data points. To
determine if this type of curvature exists in the population, we want to test
H0 : β2 = 0 (no curvature in the response curve)
Ha : β2 < 0 (downward concavity exists in the response curve)
The test statistic for testing β2 , highlighted on the SPSS printout (Figure 4.14),
is t = −3.39 and the associated two-tailed p-value is .002. Since this is a
one-tailed test, the appropriate p-value is .002/2 = .001. Now α = .01 exceeds
this p-value. Thus, there is very strong evidence of downward curvature in
the population, that is, immunity level (IgG) increases more slowly per unit
increase in maximal oxygen uptake for subjects with high aerobic fitness than
for those with low fitness levels.
Note that the SPSS printout in Figure 4.14 also provides the t-test
statistic and corresponding two-tailed p-values for the tests of H0 : β0 = 0 and
H0 : β1 = 0. Since the interpretation of these parameters is not meaningful for
this model, the tests are not of interest.

4.11 Exercises
4.34 Assertiveness and leadership. Management professors at Columbia University examined the
relationship between assertiveness and leadership (Journal of Personality and Social Psychology, February 2007). The sample was comprised
of 388 people enrolled in a full-time MBA program. Based on answers to a questionnaire, the
researchers measured two variables for each subject: assertiveness score (x) and leadership ability
score (y). A quadratic regression model was fit to
the data with the following results:

INDEPENDENT
VARIABLE

x
x2
Model R 2 = .12

β ESTIMATE

t-VALUE

p-VALUE

.57
−.88

2.55
−3.97

.01
< .01

(a) Conduct a test of overall model utility. Use
α = .05.
(b) The researchers hypothesized that leadership
ability will increase at a decreasing rate with

206 Chapter 4 Multiple Regression Models
(b) Explain why the value βˆ1 = −321.67 should
not be interpreted as a slope.
(c) Examine the value of βˆ2 to determine the
nature of the curvature (upward or downward) in the sample data.
(d) The researchers used the model to estimate
‘‘that just after the year 2021 the fleet of cars
with catalytic converters will completely disappear.’’ Comment on the danger of using
the model to predict y in the year 2021.

assertiveness. Set up the null and alternative
hypothesis to test this theory.
(c) Use the reported results to conduct the test,
part b. Give your conclusion (at α = .05) in
the words of the problem.

4.35 Urban population estimation using satellite
images. Refer to the Geographical Analysis
(January 2007) study that demonstrated the use
of satellite image maps for estimating urban population, Exercise 4.7 (p. 185). A first-order model
for census block population density (y) was fit
as a function of proportion of block with lowdensity residential areas (x1 ) and proportion of
block with high-density residential areas (x2 ).
Now consider a second-order model for y.
(a) Write the equation of a quadratic model for
y as a function of x1 .
(b) Identify the β term in the model that allows
for a curvilinear relationship between y
and x1 .
(c) Suppose that the rate of increase of population density (y) with proportion of block
with low-density areas (x1 ) is greater for
lower proportions than for higher proportions. Will the term you identified in part b
be positive or negative? Explain.

4.37 Carp diet study. Fisheries Science (February

4.36 Cars with catalytic converters.

A quadratic
model was applied to motor vehicle toxic emissions data collected over 15 recent years in
Mexico City (Environmental Science and Engineering, September 1, 2000). The following
equation was used to predict the percentage
(y) of motor vehicles without catalytic converters in the Mexico City fleet for a given year
(x): yˆ = 325,790 − 321.67x + 0.794x 2 .
(a) Explain why the value βˆ0 = 325,790 has no
practical interpretation.
MINITAB output for Exercise 4.37

1995) reported on a study of the variables that
affect endogenous nitrogen excretion (ENE)
in carp raised in Japan. Carp were divided
into groups of 2–15 fish, each according to
body weight and each group placed in a separate tank. The carp were then fed a proteinfree diet three times daily for a period of 20
days. One day after terminating the feeding
CARP
TANK

BODY WEIGHT x

ENE y

1
2
3
4
5
6
7
8
9
10

11.7
25.3
90.2
213.0
10.2
17.6
32.6
81.3
141.5
285.7

15.3
9.3
6.5
6.0
15.7
10.0
8.6
6.4
5.6
6.0

Source: Watanabe, T., and Ohta, M. ‘‘Endogenous
nitrogen excretion and non-fecal energy losses in
carp and rainbow trout.’’ Fisheries Science, Vol.
61, No. 1, Feb. 1995, p. 56 (Table 5).

A Quadratic (Second-Order) Model with a Quantitative Predictor

scatterplot shows this relationship for data collected on 113 individuals diagnosed with SCA1.

experiment, the amount of ENE in each tank
was measured. The table (p. 206) gives the mean
body weight (in grams) and ENE amount (in milligrams per 100 grams of body weight per day)
for each carp group.

4.38 Estimating change-point dosage. A standard
method for studying toxic substances and their
effects on humans is to observe the responses of
rodents exposed to various doses of the substance
over time. In the Journal of Agricultural, Biological, and Environmental Statistics (June 2005),
researchers used least squares regression to estimate the ‘‘change-point’’ dosage—defined as the
largest dose level that has no adverse effects. Data
were obtained from a dose–response study of rats
exposed to the toxic substance aconiazide. A sample of 50 rats was evenly divided into five dosage
groups: 0, 100, 200, 500, and 750 milligrams per
kilograms of body weight. The dependent variable y measured was the weight change (in grams)
after a 2-week exposure. The researchers fit the
quadratic model E(y) = β0 + β1 x + β2 x 2 , where
x = dosage level, with the following results:
yˆ = 10.25 + .0053x − .0000266x 2 .
(a) Construct a rough sketch of the least squares
prediction equation. Describe the nature of
the curvature in the estimated model.
(b) Estimate the weight change (y) for a rat given
a dosage of 500 mg/kg of aconiazide.
(c) Estimate the weight change (y) for a rat
given a dosage of 0 mg/kg of aconiazide. (This
dosage is called the ‘‘control’’ dosage level.)
(d) Of the five groups in the study, find the largest
dosage level x that yields an estimated weight
change that is closest to but below the estimated weight change for the control group.
This value is the ‘‘change-point’’ dosage.

4.39 Onset of a neurodegenerative disorder. Spinocerebellar ataxia type 1 (SCA1) is an inherited
neurodegenerative disorder characterized by dysfunction of the brain. From a DNA analysis
of SCA1 chromosomes, researchers discovered
the presence of repeat gene sequences (Cell
Biology, February 1995). In general, the more
repeat sequences observed, the earlier the onset
of the disease (in years of age). The following

80

Age of onset

(a) Graph the data in a scatterplot. Do you detect
a pattern?
(b) The quadratic model E(y) = β0 + β1 x +
β2 x 2 was fit to the data using MINITAB.
The MINITAB printout is displayed on
p. 206. Conduct the test H0 : β2 = 0 against
Ha : β2 = 0 using α = .10. Give the conclusion
in the words of the problem.

207

60

40

20

0
40

50

60
70
Number of repeats

80

90

(a) Suppose you want to model the age y of
onset of the disease as a function of number
x of repeat gene sequences in SCA1 chromosomes. Propose a quadratic model for y.
(b) Will the sign of β2 in the model, part a, be
positive or negative? Base your decision on
the results shown in the scatterplot.
(c) The researchers reported a correlation of r =
−.815 between age and number of repeats.
Since r 2 = (−.815)2 = .664, they concluded
that about ‘‘66% of the variability in the age
of onset can be accounted for by the number
of repeats.’’ Does this statement apply to the
quadratic model E(y) = β0 + β1 x + β2 x 2 ? If
not, give the equation of the model for which
it does apply.

4.40 Failure times of silicon wafer microchips.
Researchers at National Semiconductor experimented with tin-lead solder bumps used to
manufacture silicon wafer integrated circuit chips
(International Wafer Level Packaging Conference, November 3–4, 2005). The failure times of
the microchips (in hours) were determined at different solder temperatures (degrees Centigrade).
The data for one experiment are given in the next
table (p. 208). The researchers want to predict
failure time (y) based on solder temperature (x).
(a) Construct a scatterplot for the data. What
type of relationship, linear or curvilinear,
appears to exist between failure time and
solder temperature?
(b) Fit the model, E(y) = β0 + β1 x + β2 x 2 , to
the data. Give the least squares prediction
equation.
(c) Conduct a test to determine if there is upward
curvature in the relationship between failure
time and solder temperature. (Use α = .05.)

208 Chapter 4 Multiple Regression Models
WAFER
TEMPERATURE
(◦ C)

TIME TO FAILURE
(hours)

165
162
164
158
158
159
156
157
152
147
149
149
142
142
143
133
132
132
134
134
125
123

200
200
1200
500
600
750
1200
1500
500
500
1100
1150
3500
3600
3650
4200
4800
5000
5200
5400
8300
9700

of semiconductor materials. In the Journal of
Applied Physics (December 1, 2000), electrical
engineers at Nagoya University (Japan) studied
the kinetics of fluorocarbon plasmas in order to
optimize material processing. In one portion of
the study, the surface production rate of fluorocarbon radicals emitted from the production
process was measured at various points in time
(in milliseconds) after the radio frequency power
was turned off. The data are given in the accompanying table. Consider a model relating surface
production rate (y) to time (x).
(a) Graph the data in a scattergram. What trend
do you observe?
(b) Fit a quadratic model to the data. Give the
least squares prediction equation.
(c) Is there sufficient evidence of upward curvature in the relationship between surface
production rate and time after turnoff? Use
α = .05.

4.42 Public perceptions of health risks. In the Journal
of Experimental Psychology: Learning, Memory,
INFECTION

Source: Gee, S., & Nguyen, L. ‘‘Mean time to failure in
wafer level–CSP packages with SnPb and SnAgCu solder
bmps,’’ International Wafer Level Packaging Conference,
San Jose, CA, Nov. 3–4, 2005 (adapted from Figure 7).

4.41 Optimizing semiconductor material processing.
Fluorocarbon plasmas are used in the production
RADICALS
RATE

TIME

1.00
0.80
0.40
0.20
0.05
0.00
−0.05
−0.02
0.00
−0.10
−0.15
−0.05
−0.13
−0.08
0.00

0.1
0.3
0.5
0.7
0.9
1.1
1.3
1.5
1.7
1.9
2.1
2.3
2.5
2.7
2.9

Source: Takizawa, K., et al. ‘‘Characteristics of C3 radicals in high-density C4 F8 plasmas studied by laser-induced
fluorescence spectroscopy,’’ Journal of Applied Physics,
Vol. 88, No. 11, Dec. 1, 2000 (Figure 7). Reprinted with
permission from Journal of Applied Physics. Copyright ©
2000, American Institute of Physics.

INFECTION

Polio
Diphtheria
Trachoma
Rabbit Fever
Cholera
Leprosy
Tetanus
Hemorrhagic Fever
Trichinosis
Undulant Fever
Well’s Disease
Gas Gangrene
Parrot Fever
Typhoid
Q Fever
Malaria
Syphilis
Dysentery
Gonorrhea
Meningitis
Tuberculosis
Hepatitis
Gastroenteritis
Botulism

INCIDENCE RATE

0.25
1
1.75
2
3
5
9
10
22
23
39
98
119
152
179
936
1514
1627
2926
4019
12619
14889
203864
15

ESTIMATE

300
1000
691
200
17.5
0.8
1000
150
326.5
146.5
370
400
225
200
200
400
1500
1000
6000
5000
1500
10000
37000
37500

Source: Hertwig, R., Pachur, T., & Kurzenhauser, S.
‘‘Judgments of risk frequencies: Tests of possible cognitive mechanisms,’’ Journal of Experimental Psychology:
Learning, Memory, and Cognition, Vol. 31, No. 4, July
2005 (Table 1). Copyright © 2005 American Psychological
Association, reprinted with permission.

More Complex Multiple Regression Models (Optional)

and Cognition (July 2005), University of Basel
(Switzerland) psychologists tested the ability of
people to judge risk of an infectious disease. The
researchers asked German college students to
estimate the number of people who are infected
with a certain disease in a typical year. The
median estimates as well as the actual incidence
rate for each in a sample of 24 infections are
provided in the table (p. 208). Consider the
quadratic model, E(y) = β0 + β1 x + β2 x 2 , where
y = actual incidence rate and x = estimated rate.

209

(a) Fit the quadratic model to the data, then conduct a test to determine if incidence rate is
curvilinearly related to estimated rate. (Use
α = .05.)
(b) Construct a scatterplot for the data. Locate
the data point for Botulism on the graph.
What do you observe?
(c) Repeat part a, but omit the data point for
Botulism from the analysis. Has the fit of the
model improved? Explain.

4.12 More Complex Multiple Regression
Models (Optional)
In the preceding sections, we have demonstrated the methods of multiple regression
analysis by fitting several basic models, including a first-order model, a quadratic
model, and an interaction model. In this optional section we introduce more
advanced models for those who do not cover the more thorough discussion of model
building in Chapter 5.

Models with Quantitative x’s We begin with a discussion of models using
quantitative independent variables. We have already encountered several basic
models of this type in previous sections. These models are summarized in the
following boxes.

A First-Order Model Relating E(y) to Five Quantitative x’s
E(y) = β0 + β1 x1 + β2 x2 + · · · + β5 x5

A Quadratic (Second-Order) Model Relating E(y) to One
Quantitative x
E(y) = β0 + β1 x + β2 x 2

An Interaction Model Relating E(y) to Two Quantitative x’s
E(y) = β0 + β1 x1 + β2 x2 + β3 x1 x2

Now, we consider a model for E(y) that incorporates both interaction and
curvature. Suppose E(y) is related to two quantitative x’s, x1 and x2 , by the
equation:
E(y) = 1 + 7x1 − 10x2 + 5x1 x2 − x12 + 3x22
Note that this model contains all of the terms in the interaction model, plus
the second-order terms, x12 and x22 . Figure 4.17 shows a graph of the relationship

210 Chapter 4 Multiple Regression Models
y
70

50

x2

2

x2

1

30

10

x2

−10

0

−30

x1
0

1

2

3

4

5

6

7

8

9

10

Figure 4.17 Graph of E(y) = 1 + 7x1 − 10x2 + 5x1 x2 − x12 + 3x22
between E(y) and x1 for x2 = 0, 1, and 2. You can see that there are three curvilinear
relationships—one for each value of x2 held fixed—and the curves have different
shapes. The model E(y) = 1 + 7x1 − 10x2 + 5x1 x2 − x12 + 3x22 is an example of a
complete second-order model in two quantitative independent variables. A complete
second-order model contains all of the terms in a first-order model and, in addition,
the second-order terms involving cross-products (interaction terms) and squares of
the independent variables. (Note that an interaction model is a special case of a
second-order model, where the β coefficients of x12 and x22 are both equal to 0.)

A Complete Second-Order Model with Two Quantitative x’s
E(y) = β0 + β1 x1 + β2 x2 + β3 x1 x2 + β4 x12 + β5 x22

How can you choose an appropriate model to fit a set of quantitative data?
Since most relationships in the real world are curvilinear (at least to some extent),
a good first choice would be a second-order model. If you are fairly certain that the
relationships between E(y) and the individual quantitative independent variables
are approximately first-order and that the independent variables do not interact,
you could select a first-order model for the data. If you have prior information
that suggests there is moderate or very little curvature over the region in which the
independent variables are measured, you could use the interaction model described
previously. However, keep in mind that for all multiple regression models, the
number of data points must exceed the number of parameters in the model. Thus,
you may be forced to use a first-order model rather than a second-order model
simply because you do not have sufficient data to estimate all of the parameters in
the second-order model.
A practical example of choosing and fitting a linear model with two quantitative
independent variables follows.

More Complex Multiple Regression Models (Optional)

Example
4.8

211

Although a regional express delivery service bases the charge for shipping a package
on the package weight and distance shipped, its profit per package depends on
the package size (volume of space that it occupies) and the size and nature of the
load on the delivery truck. The company recently conducted a study to investigate
the relationship between the cost, y, of shipment (in dollars) and the variables
that control the shipping charge—package weight, x1 (in pounds), and distance
shipped, x2 (in miles). Twenty packages were randomly selected from among the
large number received for shipment and a detailed analysis of the cost of shipment
was made for each package, with the results shown in Table 4.3.

EXPRESS

Table 4.3 Cost of shipment data for Example 4.8
Package

Weight
x1 (lbs)

Distance
x2 (miles)

Cost
y (dollars)

Package

Weight
x1 (lbs)

Distance
x2 (miles)

Cost
y (dollars)

1

5.9

47

2.60

11

5.1

240

11.00

2

3.2

145

3.90

12

2.4

209

5.00

3

4.4

202

8.00

13

.3

160

2.00

4

6.6

160

9.20

14

6.2

115

6.00

5

.75

280

4.40

15

2.7

45

1.10

6

.7

80

1.50

16

3.5

250

8.00

7

6.5

240

14.50

17

4.1

95

3.30

8

4.5

9
10

.60
7.5

53

1.90

18

8.1

160

12.10

100

1.00

19

7.0

260

15.50

190

14.00

20

1.1

90

1.70

(a)
(b)
(c)
(d)
(e)

Give an appropriate linear model for the data.
Fit the model to the data and give the prediction equation.
Find the value of s and interpret it.
Find the value of Ra2 and interpret it.
Is the model statistically useful for the prediction of shipping cost y? Find the
value of the F statistic on the printout and give the observed significance level
(p-value) for the test.
(f) Find a 95% prediction interval for the cost of shipping a 5-pound package a
distance of 100 miles.

Solution
(a) Since we have no reason to expect that the relationship between y and x1 and
x2 would be first-order, we will allow for curvature in the response surface and
fit the complete second-order model
y = β0 + β1 x1 + β2 x2 + β3 x1 x2 + β4 x12 + β5 x22 + ε
The mean value of the random error term ε is assumed to equal 0. Therefore,
the mean value of y is
E(y) = β0 + β1 x1 + β2 x2 + β3 x1 x2 + β4 x12 + β5 x22

212 Chapter 4 Multiple Regression Models
(b) The SAS printout for fitting the model to the n = 20 data points is shown in
Figure 4.18. The parameter estimates (highlighted on the printout) are:
βˆ0 = .82702 βˆ1 = −.60914 βˆ2 = .00402
βˆ3 = .00733 βˆ4 = .08975

βˆ5 = .00001507

Therefore, the prediction equation that relates the predicted shipping cost, y,
ˆ
to weight of package, x1 , and distance shipped, x2 , is
yˆ = .82702 − .60914x1 + .00402x2 + .00733x1 x2 + .08975x12
+.00001507x22

Figure 4.18 SAS multiple
regression output for
Example 4.8

(c) The value of s (shaded on the printout) is .44278. Since s estimates the standard
deviation σ of the random error term, our interpretation is that approximately
95% of the sampled shipping cost values fall within 2s = .886, or about 89¢, of
their respective predicted values.