Tải bản đầy đủ
5 Extrapolation: Predicting Outside the Experimental Region

# 5 Extrapolation: Predicting Outside the Experimental Region

Tải bản đầy đủ

370 Chapter 7 Some Regression Pitfalls
regression model that includes a number of independent variables may be more
difﬁcult. For example, consider a model for GDP (y) using inﬂation rate (x1 ) and
prime interest rate (x2 ) as predictor variables. Suppose a sample of size n = 5 was
observed, and the values of x1 and x2 corresponding to the ﬁve values for GDP
were (1, 10), (1.25, 12), (2.25, 10.25), (2.5, 13), and (3, 11.5). Notice that x1 ranges
from 1% to 3% and x2 ranges from 10% to 13% in the sample data. You may think
that the experimental region is deﬁned by the ranges of the individual variables
(i.e., 1 ≤ x1 ≤ 3 and 10 ≤ x2 ≤ 13). However, the levels of x1 and x2 jointly deﬁne
the region. Figure 7.8 shows the experimental region for our hypothetical data. You
can see that an observation with levels x1 = 3 and x2 = 10 clearly falls outside the
experimental region, yet is within the ranges of the individual x-values. Using the
model to predict GDP for this observation—called hidden extrapolation—may lead
to unreliable results.

Figure 7.8 Experimental
region for modeling GDP
(y) as a function of
inﬂation rate (x1 ) and
prime interest rate (x2 )

Example
7.6

In Section 4.14 (p. 235), we presented a study of bid collusion in the Florida highway
construction industry. Recall that we modeled the cost of a road construction contract
(y) as a function of the Department of Transportation (DOT) engineer’s cost
estimate (x1 , in thousands of dollars) and bid status (x2 ), where x2 = 1 if the contract
was ﬁxed, 0 if competitive. Based on data collected for n = 235 contracts (and
saved in the FLAG ﬁle), the interaction model E(y) = β0 + β1 x1 + β2 x2 + β3 x1 x2
was found to be the best model for predicting contract cost. Find the experimental
region for the model, then give values of the independent variables that fall outside
this region.

Solution
For this regression analysis, the experimental region is deﬁned as the values of the
independent variables, DOT estimate (x1 ) and bid status (x2 ), that span the sample
data in the FLAG ﬁle. Since bid status is a qualitative variable at two levels, we can
ﬁnd the experimental region by examining descriptive statistics for DOT estimate
at each level of bid status. These descriptive statistics, produced using MINITAB,
are shown in Figure 7.9.
Figure 7.9 MINITAB
descriptive statistics for
independent variables,
Example 7.6

Variable Transformations

371

Examining Figure 7.9, you can see that when the bids are ﬁxed (x2 = 1), DOT
estimate (x1 ) ranges from a minimum of 66 thousand dollars to a maximum of
5,448 thousand dollars. In contrast, for competitive bids (x2 = 0), DOT estimate
(x1 ) ranges from 28 thousand dollars to 10,744 thousand dollars. These two ranges
deﬁne the experimental region for the analysis. Consequently, the DOT should
avoid making cost predictions for ﬁxed contracts that have DOT estimates outside
the interval (\$66,000, \$5,448,000) and for competitive contracts that have DOT
estimates outside the interval (\$28,000, \$10,744,000).

7.6 Variable Transformations
The word transform means to change the form of some object or thing. Consequently, the phrase variable transformation means that we have done, or plan to do,
something to change the form of the variable. For example, if one of the independent
variables in a model is the price p of a commodity, we might choose to introduce

this variable into the model as x = 1/p, x = p, or x = e−p . Thus, if we were to let

x = p, we would compute the square root of each price value, and these square
roots would be the values of x that would be used in the regression analysis.
Transformations are performed on the y-values to make them more nearly
satisfy the assumptions of Section 4.2 and, sometimes, to make the deterministic
portion of the model a better approximation to the mean value of the transformed
response. Transformations of the values of the independent variables are performed
solely for the latter reason—that is, to achieve a model that provides a better
approximation to E(y). In this section, we discuss transformations on the dependent
and independent variables to achieve a good approximation to E(y). (Transformations on the y-values for the purpose of satisfying the assumptions are discussed in
Chapter 8.)
Suppose you want to ﬁt a model relating the demand y for a product to its price
p. Also, suppose the product is a nonessential item, and you expect the mean demand
to decrease as price p increases and then to decrease more slowly as p gets larger
(see Figure 7.10). What function of p will provide a good approximation to E(y)?
To answer this question, you need to know the graphs of some elementary
mathematical functions—there is a one-to-one relationship between mathematical
functions and graphs. If we want to model a relationship similar to the one indicated
in Figure 7.10, we need to be able to select a mathematical function that will possess
a graph similar to the curve shown.

Figure 7.10 Hypothetical
relation between demand y
and price p

Portions of some curves corresponding to mathematical functions that decrease
as p increases are shown in Figure 7.11. Of the seven models shown, the curves in

372 Chapter 7 Some Regression Pitfalls
Figure 7.11c, 7.11d, 7.11f, and 7.11g will probably provide the best approximations
to E(y). These four graphs all show E(y) decreasing and approaching (but never
reaching) 0 as p increases. Figures 7.11c and 7.11d suggest that the independent
variable, price, should be transformed using either x = 1/p or x = e−p . Then you
might try ﬁtting the model
E(y) = β0 + β1 x
using the transformed data. Or, as suggested by Figures 7.11f and 7.11g, you might
try the transformation x = ln(p) and ﬁt either of the models
E(y) = β0 + β1 x
or
E{ln(y)} = β0 + β1 x
The functions shown in Figure 7.11 produce curves that either rise or fall depending
on the sign of the parameter β1 in parts a, c, d, e, f, and g, and on β2 and the portion
of the curve used in part b. When you choose a model for a regression analysis, you
do not have to specify the sign of the parameter(s). The least squares procedure will
choose as estimates of the parameters those that minimize the sum of squares of
the residuals. Consequently, if you were to ﬁt the model shown in Figure 7.11c to
a set of y-values that increase in value as p increases, your least squares estimate
of β1 would be negative, and a graph of y would produce a curve similar to curve 2
in Figure 7.11c. If the y-values decrease as p increases, your estimate of β1 will be
positive and the curve will be similar to curve 1 in Figure 7.11c. All the curves in
Figure 7.11 shift upward or downward depending on the value of β0 .

Figure 7.11 Graphs of
some mathematical
functions relating E(y) to p

Variable Transformations

373

Figure 7.11 (continued)

Example
7.7

Refer to the models and graphs in Figure 7.11. Consider a situation where there
is no a priori theory on the true relationship between demand (y) and price (p).
Consequently, you will ﬁt the models and compare them to determine the ‘‘best’’
model for E(y).
(a) Identify the models that are nested. How would you compare these models?
(b) Identify the non-nested models. How would you compare these models?

Solution
(a) Nested models, by deﬁnition, have the same form for the dependent variable
on the left-hand side of the equation. Also, for two nested models, the
‘‘complete’’ model has the same terms (independent variables) on the righthand side of the equation as the ‘‘reduced’’ model, plus more. Thus, the only
two nested models in Figure 7.11 are Models (a) and (b).
Model (a): E(y) = β0 + β1 p

(Reduced model)

Model (b): E(y) = β0 + β1 p + β2 p2

(Complete model)

These two models can be compared by testing H0 : β2 = 0 using a partial F -test
(or a t-test).
(b) Any two of the remaining models shown in Figure 7.11 are non-nested. For
example, Models (c) and (d) are non-nested models. Some other non-nested
models are Models (a) and (c), Models (e) and (g), and Models (g) and (f).
The procedure for comparing non-nested models will depend on whether
or not the dependent variable on the left-hand side of the equation is the
same. For example, for Models (a), (c), (d), and (f), the dependent variable

374 Chapter 7 Some Regression Pitfalls
is untransformed demand (y). Consequently, these models can be compared
by examining overall model statistics like the global F -test, adjusted-R 2 , and
the estimated standard deviation s. Presuming the global F -test is signiﬁcant, the model with the highest adjusted-R 2 and smallest value of s would be
deemed the ‘‘best’’ model.
Two non-nested models with different dependent variables on the lefthand side of the equation, like Models (a) and (e), can be compared using the
method outlined in optional Section 4.12 (p. 209).
Model (a): E(y) = β0 + β1 p

(Untransformed y)

Model (e): E[ln(y)] = β0 + β1 p

(Transformed y)

The key is to calculate a statistic like R 2 or adjusted-R 2 that can be compared
across models. For example, the R 2 value for untransformed Model (a) is
compared to the pseudo-R 2 value for the log-transformed Model (e), where
2
Rln(y)
is based on the predicted values yˆ = exp{ln(y)}.
Example
7.8

A supermarket chain conducted an experiment to investigate the effect of price p
on the weekly demand (in pounds) for a house brand of coffee. Eight supermarket
stores that had nearly equal past records of demand for the product were used
in the experiment. Eight prices were randomly assigned to the stores and were
advertised using the same procedures. The number of pounds of coffee sold during
the following week was recorded for each of the stores and is shown in Table 7.3.
COFFEE

Table 7.3 Data for Example 7.8
Demand
y, pounds

Price
p, dollars

1,120

3.00

999

3.10

932

3.20

884

3.30

807

3.40

760

3.50

701

3.60

688

3.70

(a) Previous research by the supermarket chain indicates that weekly demand (y)
decreases with price (p), but at a decreasing rate. This implies that model (d),
Figure 7.11, is appropriate for predicting demand. Fit the model
E(y) = β0 + β1 x
to the data, letting x = 1/p.
(b) Do the data provide sufﬁcient evidence to indicate that the model contributes
information for the prediction of demand?
(c) Find a 95% conﬁdence interval for the mean demand when the price is set at
\$3.20 per pound. Interpret this interval.

Variable Transformations

375

Solution
(a) The ﬁrst step is to calculate x = 1/p for each data point. These values are
given in Table 7.4. The MINITAB printout∗ (Figure 7.12) gives
βˆ0 = −1,180.5

βˆ1 = 6, 808.1

and
yˆ = −1,180.5 + 6,808.1x
= −1,180.5 + 6,808.1

1
p

Table 7.4 Values of transformed price
y

x = 1/p

1,120

.3333

999

.3226

932

.3125

884

.3030

807

.2941

760

.2857

701

.2778

688

.2703

Figure 7.12 MINITAB
regression printout for
Example 7.8

∗ MINITAB uses full decimal accuracy for x = 1/p. Hence, the results shown in Figure 7.12 differ from results

that would be calculated using the four-decimal values for x = 1/p shown in the table.

376 Chapter 7 Some Regression Pitfalls
(You can verify that the formulas of Section 3.3 give the same answers.)
A graph of this prediction equation is shown in Figure 7.13.
(b) To determine whether x contributes information for the prediction of y, we
test H0 : β1 = 0 against the alternative hypothesis Ha: β1 = 0. The test statistic,
shaded in Figure 7.12, is t = 19.0. We wish to detect either β1 > 0 or β1 < 0,
thus we will use a two-tailed test. Since the two-tailed p-value shown on the
printout, .000, is less than α = .05, we reject H0 : β1 = 0 and conclude that
x = 1/p contributes information for the prediction of demand y.
(c) For price p = 3.20, x = 1/p = .3125. The bottom of the MINITAB printout
gives a 95% conﬁdence interval for the mean demand E(y) when price is
p = \$3.20 (i.e., x = .3125). The interval (shaded) is (925.86, 968.24). Thus, we
are 95% conﬁdent that mean demand will fall between 926 and 968 pounds
when the price is set at \$3.20.

Figure 7.13 Graph of the
demand–price curve for
Example 7.8

This discussion is intended to emphasize the importance of data transformation and to explain its role in model building. Keep in mind that the symbols
x1 , x2 , . . . , xk that appear in the linear models of this text can be transformations
on the independent variables you have observed. These transformations, coupled
with the model-building methods of Chapter 5, allow you to use a great variety of
mathematical functions to model the mean E(y) for data.

Quick Summary
KEY FORMULAS

KEY IDEAS

pth-order polynomial

Establishing cause and effect

levels of x ≥ (p + 1)
Standardized beta for xi
βˆι ∗ = βˆι (sx /sy )
Variance inﬂation factor for xi
VIFi = 1/(1 − Ri2 ), where Ri2 is R 2 for the model
E(xi ) = β0 + β1 x1 + β2 x2 + · · · + βi−1 xi−1 + βi+1 xi+1
+ · · · + βk xk

1. It is dangerous to deduce a cause-and-effect relationship with observational data
2. Only with a properly designed experiment can
you establish a cause and effect
Parameter estimability
Insufﬁcient data for levels of either a quantitative or
qualitative independent variable can result in inestimable regression parameters.

Variable Transformations

Multicollinearity
1. Occurs when two or more independent variables
are correlated.
2. Indicators of multicollinearity:
(a) Highly correlated x’s
(b) Signiﬁcant global F -test, but all t-tests on
individual β’s are nonsigniﬁcant
(c) Signs of β’s opposite from what is expected
(d) VIF exceeding 10
3. Model modiﬁcations for solving multicollinearity:
(a) Drop one or more of the highly correlated x’s
(b) Keep all x’s in the model, but avoid making
inferences on the β’s

377

(c) Code quantitative x’s to reduce correlation
between x and x 2
(d) Use ridge regression to estimate the β’s
Extrapolation
1. Occurs when you predict y for values of the
independent variables that are outside the experimental region.
2. Be wary of hidden extrapolation (where values of
the x’s fall within the range of each individual x,
but fall outside the experimental region deﬁned
jointly by the x’s)
Variable transformations
Transforming y and/or the x’s in a model can provide
a better model ﬁt.

Supplementary Exercises
7.1. Extrapolation. Why is it dangerous to predict y for
values of independent variables that fall outside the
experimental region?
7.2. Multicollinearity.
(a) Discuss the problems that result when multicollinearity is present in a regression analysis.
(b) How can you detect multicollinearity?
(c) What remedial measures are available when
multicollinearity is detected?
7.3. Data transformations. Refer to Example 7.8. Can
you think of any other transformations on price that
might provide a good ﬁt to the data? Try them and
answer the questions in Example 7.8 again.
7.4. Women in top management. The Journal of Organizational Culture, Communications and Conﬂict
(July 2007) published a study on women in uppermanagement positions at U.S. ﬁrms. Observational
data (n = 252 months) were collected for several
variables in an attempt to model the number of
females in managerial positions (y). The independent variables included the number of females
with a college degree (x1 ), the number of female
high school graduates with no college degree (x2 ),
the number of males in managerial positions
(x3 ), the number of males with a college degree
(x4 ), and the number of male high school graduates
with no college degree (x5 ).
(a) The correlation relating number of females in
managerial positions and number of females

with a college degree was determined to be
r = .983. Can the researchers conclude that
an increase in the number of females with
a college degree will cause the number of
females in managerial positions to increase?
Explain.
(b) The correlation relating number of males in
managerial positions and number of males with
a college degree was determined to be r = .722.
What potential problem can occur in the regression analysis? Explain.
7.5. Urban/rural ratings of counties. Refer to the Professional Geographer (February 2000) study of
urban and rural counties in the western United
States, Exercise 4.16 (p. 190). Recall that six independent variables—total county population (x1 ),
population density (x2 ), population concentration
(x3 ), population growth (x4 ), proportion of county
land in farms (x5 ), and 5-year change in agricultural
land base (x6 )—were used to model the urban/rural
rating (y) of a county. Prior to running the multiple
regression analysis, the researchers were concerned
about possible multicollinearity in the data. The
correlation matrix (shown on the next page) is a
table of correlations between all pairs of the independent variables.
(a) Based on the correlation matrix, is there any
evidence of extreme multicollinearity?
(b) Refer to the multiple regression results in the
table given in Exercise 4.16 (p.190). Based on

378 Chapter 7 Some Regression Pitfalls
the reported tests, is there any evidence of
extreme multicollinearity?
INDEPENDENT
VARIABLE
x1
x2
x3
x4
x5
x6

x1

x2

x3

x4

x5

Total population
Population density
.20
Population concentration
.45
.43
Population growth
−.05 −.14 −.01
Farm land
−.16 −.15 −.07 −.20
Agricultural change
−.12 −.12 −.22 −.06 −.06

Source: Berry, K. A., et al. ‘‘Interpreting what is rural
and urban for western U.S. counties,’’ Professional Geographer, Vol. 52, No. 1, Feb. 2000 (Table 2).
PONDICE
7.6. Characteristics of sea ice melt ponds. Surface
albedo is deﬁned as the ratio of solar energy
directed upward from a surface over energy incident upon the surface. Surface albedo is a critical
climatological parameter of sea ice. The National
Snow and Ice Data Center (NSIDC) collects data
on the albedo, depth, and physical characteristics
of ice melt ponds in the Canadian Arctic, including ice type (classiﬁed as ﬁrst-year ice, multiyear
ice, or landfast ice). Data for 504 ice melt ponds
located in the Barrow Strait in the Canadian Arctic
are saved in the PONDICE ﬁle. Environmental
engineers want to model the broadband surface
albedo level, y, of the ice as a function of pond
depth, x1 (meters), and ice type, represented by the
dummy variables x2 = {1 if ﬁrst-year ice, 0 if not}
and x3 = {1 if multiyear ice, 0 if not}. Ultimately,
the engineers will use the model to predict the
surface albedo level of an ice melt pond. Access the
data in the PONDICE ﬁle and identify the experimental region for the engineers. What advice do you
give them about the use of the prediction equation?
7.7. Personality and aggressive behavior. Psychological Bulletin (Vol. 132, 2006) reported on a study
linking personality and aggressive behavior. Four of
the variables measured in the study were aggressive
behavior, irritability, trait anger, and narcissism.
Pairwise correlations for these four variables are
given below.
Aggressive behavior–Irritability: .77
Aggressive behavior–Trait anger: .48
Aggressive behavior–Narcissism: .50
Irritability–Trait anger: .57
Irritability–Narcissism: .16
Trait anger–Narcissism: .13
(a) Suppose aggressive behavior is the dependent variable in a regression model and the

other variables are independent variables. Is
there evidence of extreme multicollinearity?
Explain.
(b) Suppose narcissism is the dependent variable
in a regression model and the other variables
are independent variables. Is there evidence of
extreme multicollinearity? Explain.
7.8. Steam processing of peat. A bioengineer wants to
model the amount (y) of carbohydrate solubilized
during steam processing of peat as a function of
temperature (x1 ), exposure time (x2 ), and pH value
(x3 ). Data collected for each of 15 peat samples
were used to ﬁt the model
E(y) = β0 + β1 x1 + β2 x2 + β3 x3
A summary of the regression results follows:
yˆ = −3, 000 + 3.2x1 − .4x2 − 1.1x3
sβˆ1 = 2.4

sβˆ2 = .6

r12 = .92

r13 = .87

R 2 = .93

sβˆ3 = .8
r23 = .81

Based on these results, the bioengineer concludes
that none of the three independent variables, x1 , x2 ,
and x3 , is a useful predictor of carbohydrate amount,
y. Do you agree with this statement? Explain.
7.9. Salaries of top university researchers. The provost
of a top research university wants to know what
salaries should be paid to the college’s top
researchers, based on years of experience. An
independent consultant has proposed the quadratic
model
E(y) = β0 + β1 x + β2 x 2
where
y = Annual salary (thousands of dollars)
x = Years of experience
To ﬁt the model, the consultant randomly sampled
three researchers at other research universities and
recorded the information given in the accompanying table. Give your opinion regarding the adequacy
of the proposed model.

Researcher 1
Researcher 2
Researcher 3

y

x

60
45
82

2
1
5

7.10. FDA investigation of a meat-processing plant. A
particular meat-processing plant slaughters steers
and cuts and wraps the beef for its customers. Suppose a complaint has been ﬁled with the Food and

Variable Transformations

Drug Administration (FDA) against the processing
plant. The complaint alleges that the consumer
does not get all the beef from the steer he purchases. In particular, one consumer purchased a
300-pound steer but received only 150 pounds of
cut and wrapped beef. To settle the complaint, the
FDA collected data on the live weights and dressed
weights of nine steers processed by a reputable
meat-processing plant (not the ﬁrm in question).
The results are listed in the table.
STEERS
LIVE WEIGHT
x, pounds

DRESSED WEIGHT
y, pounds

420
380
480
340
450
460
430
370
390

280
250
310
210
290
280
270
240
250

(a) Fit the model E(y) = β0 + β1 x to the data.
(b) Construct a 95% prediction interval for the
dressed weight y of a 300-pound steer.
(c) Would you recommend that the FDA use
the interval obtained in part b to determine
whether the dressed weight of 150 pounds is a
reasonable amount to receive from a 300-pound
steer? Explain.
7.11. FTC cigarette study. Refer to the FTC cigarette
data of Example 7.5 (p. 365). The data are saved in
the FTCCIGAR ﬁle.
(a) Fit the model E(y) = β0 + β1 x1 to the data. Is
there evidence that tar content x1 is useful for
predicting carbon monoxide content y?
(b) Fit the model E(y) = β0 + β2 x2 to the data. Is
there evidence that nicotine content x2 is useful
for predicting carbon monoxide content y?
(c) Fit the model E(y) = β0 + β3 x3 to the data.
Is there evidence that weight x3 is useful for
predicting carbon monoxide content y?
(d) Compare the signs of βˆ1 , βˆ2 , and βˆ3 in the models of parts a, b, and c, respectively, to the signs
ˆ in the multiple regression model ﬁt
of the β’s
ˆ change
in Example 7.5. Is the fact that the β’s
dramatically when the independent variables
are removed from the model an indication of a
serious multicollinearity problem?
7.12. Demand for car motor fuel. An economist
wants to model annual per capita demand, y, for

379

passenger car motor fuel in the United States
as a function of the two quantitative independent variables, average real weekly earnings (x1 )
and average price of regular gasoline (x2 ). Data
on these three variables for the years 1985–2008
are available in the 2009 Statistical Abstract of
the United States. Suppose the economist ﬁts the
model E(y) = β0 + β1 x1 + β2 x2 to the data. Would
you recommend that the economist use the least
squares prediction equation to predict per capita
consumption of motor fuel in 2011? Explain.
7.13. Accuracy of software effort estimates. Refer to
the Journal of Empirical Software Engineering
(Vol. 9, 2004) study of software engineers’ effort
in developing new software, Exercise 6.2 (p. 339).
Recall that the researcher modeled the relative
error in estimating effort (y) as a function of two
qualitative independent variables: company role of
estimator (x1 = 1 if developer, 0 if project leader)
and previous accuracy (x8 = 1 if more than 20%
accurate, 0 if less than 20% accurate). A stepwise regression yielded the following prediction
equation:
yˆ = .12 − .28x1 + .27x8
(a) The researcher is concerned that the sign of
βˆ1 in the model is the opposite from what is
expected. (The researcher expects a project
leader to have a smaller relative error of estimation than a developer.) Give at least one
reason why this phenomenon occurred.
(b) Now, consider the interaction model E(y) =
β0 + β1 x1 + β2 x8 + β3 x1 x8 . Suppose that there
is no data collected for project leaders with
less than 20% accuracy. Are all the β’s in the
interaction model estimable? Explain.
7.14. Yield strength of steel alloy. Refer to Exercise 6.4
(p. 340) and the Modelling and Simulation in Materials Science and Engineering (Vol. 13, 2005) study
in which engineers built a regression model for the
tensile yield strength (y) of a new steel alloy. The
potential important predictors of yield strength are
listed below. The engineers discovered that the independent variable Nickel (x4 ) was highly correlated
with each of the other 10 potential independent
variables. Consequently, Nickel was dropped from
the model. Do you agree with this decision?
Explain.
x1 = Carbon amount (% weight)
x2 = Manganese amount (% weight)
x3 = Chromium amount (% weight)
x4 = Nickel amount (% weight)
x5 = Molybdenum amount (% weight)
x6 = Copper amount (% weight)

380 Chapter 7 Some Regression Pitfalls
x7 = Nitrogen amount (% weight)
x8 = Vanadium amount (% weight)
x9 = Plate thickness (millimeters)
x10 = Solution treating (milliliters)
x11 = Ageing temperature (degrees, Celcius)
FLAG2
7.15. Collusive bidding in road construction. Refer to
the Florida Attorney General (FLAG) Ofﬁce’s
investigation of bid-rigging in the road construction industry, Exercise 6.8 (p. 341). Recall that
FLAG wants to model the price (y) of the contract bid by lowest bidder in hopes of preventing
price-ﬁxing in the future. Consider the independent
variables selected by the stepwise regression run in
Exercise 6.8. Do you detect any multicollinearity
in these variables? If so, do you recommend that
all of these variables be used to predict low-bid
price, y?
7.16. Fitting a quadratic model. How many levels of x are
required to ﬁt the model E(y) = β0 + β1 x + β2 x 2 ?
How large a sample size is required to have sufﬁcient
degrees of freedom for estimating σ 2 ?
7.17. Fitting an interaction model. How many levels of x1 and x2 are required to ﬁt the model
E(y) = β0 + β1 x1 + β2 x2 + β3 x1 x2 ? How large a
sample size is required to have sufﬁcient degrees of
freedom for estimating σ 2 ?

7.18. Fitting a complete second-order model. How
many levels of x1 and x2 are required to ﬁt the
model E(y) = β0 + β1 x1 + β2 x2 + β3 x1 x2 + β4 x12 +
β5 x22 ? How large a sample is required to have
sufﬁcient degrees of freedom for estimating σ 2 ?
GASTURBINE
7.19. Cooling method for gas turbines. Refer to the
Journal of Engineering for Gas Turbines and
Power (January 2005) study of a high-pressure
inlet fogging method for a gas turbine engine,
Exercise 6.10 (p. 343). Recall that a number of
independent variables were used to predict the
heat rate (kilojoules per kilowatt per hour) for
each in a sample of 67 gas turbines augmented
with high-pressure inlet fogging. For this exercise,
consider a ﬁrst-order model for heat rate as a
function of the quantitative independent variables’
cycle speed (revolutions per minute), cycle pressure ratio, inlet temperature (◦ C), exhaust gas
temperature (◦ C), air mass ﬂow rate (kilograms
per second), and horsepower (Hp units). Theoretically, the heat rate should increase as cycle speed
increases. In contrast, theory states that the heat
rate will decrease as any of the other independent
variables increase. The model was ﬁt to the data in
the GASTURBINE ﬁle with the results shown
in the accompanying MINITAB printout. Do you
detect any signs of multicollinearity? If so, how
should the model be modiﬁed?

MINITAB Output for Exercise 7.19