Tải bản đầy đủ
11 Cross Validation, Cp, and Other Criteria for Model Selection

# 11 Cross Validation, Cp, and Other Criteria for Model Selection

Tải bản đầy đủ

488

Chapter 12 Multiple Linear Regression and Certain Nonlinear Regression Models
to allow the use of separate methodologies for ﬁtting and assessment of a speciﬁc
model. For assessment of a model, the “−i” indicates that the PRESS residual
gives a prediction error where the observation being predicted is independent of
the model ﬁt.
Criteria that make use of the PRESS residuals are given by
n

n

|δi |

δi2 .

and PRESS =

i=1

i=1

(The term PRESS is an acronym for prediction sum of squares.) We suggest
that both of these criteria be used. It is possible for PRESS to be dominated by
n

one or only a few large PRESS residuals. Clearly, the criterion on

|δi | is less

i=1

sensitive to a small number of large values.
In addition to the PRESS statistic itself, the analyst can simply compute an
R2 -like statistic reﬂecting prediction performance. The statistic is often called
2
Rpred
and is given as follows:
2
R2 of Prediction Given a ﬁtted model with a speciﬁc value for PRESS, Rpred
is given by
2
=1−
Rpred

PRESS
n

.

(yi − y¯)2

i=1
2
Rpred

Note that
is merely the ordinary R2 statistic with SSE replaced by the
PRESS statistic.
In the following case study, an illustration is provided in which many candidate
models are ﬁt to a set of data and the best model is chosen. The sequential
procedures described in Section 12.9 are not used. Rather, the role of the PRESS
residuals and other statistical values in selecting the best regression equation is
illustrated.
Case Study 12.2: Football Punting: Leg strength is a necessary characteristic of a successful punter
in American football. One measure of the quality of a good punt is the “hang time.”
This is the time that the ball hangs in the air before being caught by the punt
returner. To determine what leg strength factors inﬂuence hang time and to develop an empirical model for predicting this response, a study on The Relationship
Between Selected Physical Performance Variables and Football Punting Ability was
conducted by the Department of Health, Physical Education, and Recreation at
Virginia Tech. Thirteen punters were chosen for the experiment, and each punted
a football 10 times. The average hang times, along with the strength measures
used in the analysis, were recorded in Table 12.12.
Each regressor variable is deﬁned as follows:
1. RLS, right leg strength (pounds)
2. LLS, left leg strength (pounds)
3. RHF, right hamstring muscle ﬂexibility (degrees)
4. LHF, left hamstring muscle ﬂexibility (degrees)

12.11 Cross Validation, Cp , and Other Criteria for Model Selection

489

5. Power, overall leg strength (foot-pounds)
Determine the most appropriate model for predicting hang time.
Table 12.12: Data for Case Study 12.2
Punter
1
2
3
4
5
6
7
8
9
10
11
12
13

Hang Time,
y (sec)
4.75
4.07
4.04
4.18
4.35
4.16
4.43
3.20
3.02
3.64
3.68
3.60
3.85

RLS,
x1
170
140
180
160
170
150
170
110
120
130
120
140
160

LLS,
x2
170
130
170
160
150
150
180
110
110
120
140
130
150

RHF,
x3
106
92
93
103
104
101
108
86
90
85
89
92
95

LHF,
x4
106
93
78
93
93
87
106
92
86
80
83
94
95

Power,
x5
240.57
195.49
152.99
197.09
266.56
260.56
219.25
132.68
130.24
205.88
153.92
154.64
240.57

Solution : In the search for the best of the candidate models for predicting hang time, the
information in Table 12.13 was obtained from a regression computer package. The
models are ranked in ascending order of the values of the PRESS statistic. This
display provides enough information on all possible models to enable the user to
eliminate from consideration all but a few models. The model containing x2 and
x5 (LLS and Power), denoted by x2 x5 , appears to be superior for predicting punter
hang time. Also note that all models with low PRESS, low s2 , low

n

|δi |, and

i=1
2

high R -values contain these two variables.
In order to gain some insight from the residuals of the ﬁtted regression
yˆi = b0 + b2 x2i + b5 x5i ,
the residuals and PRESS residuals were generated. The actual prediction model
(see Exercise 12.47 on page 494) is given by
yˆ = 1.10765 + 0.01370x2 + 0.00429x5 .
Residuals, HAT diagonal values, and PRESS values are listed in Table 12.14.
Note the relatively good ﬁt of the two-variable regression model to the data.
The PRESS residuals reﬂect the capability of the regression equation to predict
hang time if independent predictions were to be made. For example, for punter
number 4, the hang time of 4.180 would encounter a prediction error of 0.039 if the
model constructed by using the remaining 12 punters were used. For this model,
the average prediction error or cross-validation error is
1
13

n

|δi | = 0.1489 second,
i=1

490

Chapter 12 Multiple Linear Regression and Certain Nonlinear Regression Models

Table 12.13: Comparing Diﬀerent Regression Models
Model
x 2 x5
x 1 x2 x5
x 2 x4 x5
x 2 x3 x5
x 1 x2 x4 x 5
x 1 x2 x3 x 5
x 2 x3 x4 x 5
x 1 x3 x5
x 1 x4 x5
x 1 x5
x 2 x3
x 1 x3
x 1 x2 x3 x 4 x5
x2
x 3 x5
x 1 x2
x3
x 1 x3 x4
x 2 x3 x4
x 2 x4
x 1 x2 x3
x 3 x4
x 1 x4
x1
x 1 x3 x4 x 5
x 1 x2 x4
x 3 x4 x5
x 1 x2 x3 x 4
x5
x 4 x5
x4

s2
0.036907
0.041001
0.037708
0.039636
0.042265
0.044578
0.042421
0.053664
0.056279
0.059621
0.056153
0.059400
0.048302
0.066894
0.065678
0.068402
0.074518
0.065414
0.062082
0.063744
0.059670
0.080605
0.069965
0.080208
0.059169
0.064143
0.072505
0.066088
0.111779
0.105648
0.186708

|δi |
1.93583
2.06489
2.18797
2.09553
2.42194
2.26283
2.55789
2.65276
2.75390
2.99434
2.95310
3.01436
2.87302
3.22319
3.09474
3.09047
3.06754
3.36304
3.32392
3.59101
3.41287
3.28004
3.64415
3.31562
3.37362
3.89402
3.49695
3.95854
4.17839
4.12729
4.88870

PRESS
0.54683
0.58998
0.59915
0.66182
0.67840
0.70958
0.86236
0.87325
0.89551
0.97483
0.98815
0.99697
1.00920
1.04564
1.05708
1.09726
1.13555
1.15043
1.17491
1.18531
1.26558
1.28314
1.30194
1.30275
1.36867
1.39834
1.42036
1.52344
1.72511
1.87734
2.82207

R2
0.871300
0.871321
0.881658
0.875606
0.882093
0.875642
0.881658
0.831580
0.823375
0.792094
0.804187
0.792864
0.882096
0.743404
0.770971
0.761474
0.714161
0.794705
0.805163
0.777716
0.812730
0.718921
0.756023
0.692334
0.834936
0.798692
0.772450
0.815633
0.571234
0.631593
0.283819

which is small compared to the average hang time for the 13 punters.
We indicated in Section 12.9 that the use of all possible subset regressions is
often advisable when searching for the best model. Most commercial statistics
software packages contain an all possible regressions routine. These algorithms
compute various criteria for all subsets of model terms. Obviously, criteria such as
R2 , s2 , and PRESS are reasonable for choosing among candidate subsets. Another
very popular and useful statistic, particularly for areas in the physical sciences and
engineering, is the Cp statistic, described below.

12.11 Cross Validation, Cp , and Other Criteria for Model Selection

491

Table 12.14: PRESS Residuals
Punter
1
2
3
4
5
6
7
8
9
10
11
12
13

y
ˆi
4.470
3.728
4.094
4.146
4.307
4.281
4.515
3.184
3.174
3.636
3.687
3.553
4.196

yi
4.750
4.070
4.040
4.180
4.350
4.160
4.430
3.200
3.020
3.640
3.680
3.600
3.850

ei =yi − y
ˆi
0.280
0.342
−0.054
0.034
0.043
−0.121
−0.085
0.016
−0.154
0.004
−0.007
0.047
−0.346

hii
0.198
0.118
0.444
0.132
0.286
0.250
0.298
0.294
0.301
0.231
0.152
0.142
0.154

δi
0.349
0.388
−0.097
0.039
0.060
−0.161
−0.121
0.023
−0.220
0.005
−0.008
0.055
−0.409

The Cp Statistic
Quite often, the choice of the most appropriate model involves many considerations.
Obviously, the number of model terms is important; the matter of parsimony is
a consideration that cannot be ignored. On the other hand, the analyst cannot
be pleased with a model that is too simple, to the point where there is serious
underspeciﬁcation. A single statistic that represents a nice compromise in this
regard is the Cp statistic. (See Mallows, 1973, in the Bibliography.)
The Cp statistic appeals nicely to common sense and is developed from considerations of the proper compromise between excessive bias incurred when one
underﬁts (chooses too few model terms) and excessive prediction variance produced when one overﬁts (has redundancies in the model). The Cp statistic is a
simple function of the total number of parameters in the candidate model and the
mean square error s2 .
We will not present the entire development of the Cp statistic. (For details, the
reader is referred to Myers, 1990, in the Bibliography.) The Cp for a particular
subset model is an estimate of the following:
Γ(p) =

1
σ2

n

Var(ˆ
yi ) +
i=1

1
σ2

n

(Bias yˆi )2 .
i=1

It turns out that under the standard least squares assumptions indicated earlier
in this chapter, and assuming that the “true” model is the model containing all
candidate variables,
1
σ2

n

Var(ˆ
yi ) = p
i=1

(number of parameters in the candidate model)

492

Chapter 12 Multiple Linear Regression and Certain Nonlinear Regression Models
(see Review Exercise 12.63) and an unbiased estimate of
1
σ2

n

(Bias yˆi )2 is given by
i=1

1
σ2

n

(Bias yˆi )2 =
i=1

(s2 − σ 2 )(n − p)
.
σ2

In the above, s2 is the mean square error for the candidate model and σ 2 is the
population error variance. Thus, if we assume that some estimate σ
ˆ 2 is available
for σ 2 , Cp is given by the following equation:
Cp Statistic

Cp = p +

ˆ 2 )(n − p)
(s2 − σ
,
σ
ˆ2

where p is the number of model parameters, s2 is the mean square error for the
candidate model, and σ
ˆ 2 is an estimate of σ 2 .
Obviously, the scientist should adopt models with small values of Cp . The
reader should note that, unlike the PRESS statistic, Cp is scale-free. In addition,
one can gain some insight concerning the adequacy of a candidate model by observing its value of Cp . For example, Cp > p indicates a model that is biased due
to being an underﬁtted model, whereas Cp ≈ p indicates a reasonable model.
There is often confusion concerning where σ
ˆ 2 comes from in the formula for Cp .
Obviously, the scientist or engineer does not have access to the population quantity
σ 2 . In applications where replicated runs are available, say in an experimental
design situation, a model-independent estimate of σ 2 is available (see Chapters 11
and 15). However, most software packages use σ
ˆ 2 as the mean square error from
the most complete model. Obviously, if this is not a good estimate, the bias portion
of the Cp statistic can be negative. Thus, Cp can be less than p.
Example 12.12: Consider the data set in Table 12.15, in which a maker of asphalt shingles is
interested in the relationship between sales for a particular year and factors that
inﬂuence sales. (The data were taken from Kutner et al., 2004, in the Bibliography.)
Of the possible subset models, three are of particular interest. These three are
x2 x3 , x1 x2 x3 , and x1 x2 x3 x4 . The following represents pertinent information for
comparing the three models. We include the PRESS statistics for the three models
to supplement the decision making.
Model
R2
R2pred
s2
PRESS
Cp
x2 x3
0.9940 0.9913 44.5552 782.1896 11.4013
x1 x 2 x 3
0.9970 0.9928 24.7956 643.3578
3.4075
x1 x2 x3 x4 0.9971 0.9917 26.2073 741.7557
5.0
It seems clear from the information in the table that the model x1 , x2 , x3 is
preferable to the other two. Notice that, for the full model, Cp = 5.0. This occurs
since the bias portion is zero, and σ
ˆ 2 = 26.2073 is the mean square error from the
full model.
Figure 12.6 is a SAS PROC REG printout showing information for all possible
regressions. Here we are able to show comparisons of other models with (x1 , x2 , x3 ).
Note that (x1 , x2 , x3 ) appears to be quite good when compared to all models.
As a ﬁnal check on the model (x1 , x2 , x3 ), Figure 12.7 shows a normal probability plot of the residuals for this model.

12.11 Cross Validation, Cp , and Other Criteria for Model Selection

493

Table 12.15: Data for Example 12.12
Promotional
Active
Competing Potential,
Sales, y
District Accounts, x1 Accounts, x2 Brands, x3
x4
(thousands)
8
10
31
5.5
\$ 79.3
1
6
8
55
2.5
200.1
2
9
12
67
8.0
163.2
3
16
7
50
3.0
200.1
4
15
8
38
3.0
146.0
5
17
12
71
2.9
177.7
6
8
12
30
8.0
30.9
7
10
5
56
9.0
291.9
8
4
8
42
4.0
160.0
9
16
5
73
6.5
339.4
10
7
11
60
5.5
159.6
11
12
44
5.0
86.3
12
12
6
50
6.0
237.5
13
6
10
39
5.0
107.2
14
4
10
55
3.5
155.0
15
4

Number in
Model
3
4
2
3
3
2
2
1
3
2
2
1
2
1
1

Dependent Variable: sales
C(p) R-Square R-Square
3.4075
5.0000
11.4013
13.3770
1053.643
1082.670
1215.316
1228.460
1653.770
1668.699
1685.024
1693.971
3014.641
3088.650
3364.884

0.9970
0.9971
0.9940
0.9940
0.6896
0.6805
0.6417
0.6373
0.5140
0.5090
0.5042
0.5010
0.1151
0.0928
0.0120

0.9961
0.9959
0.9930
0.9924
0.6049
0.6273
0.5820
0.6094
0.3814
0.4272
0.4216
0.4626
-.0324
0.0231
-.0640

MSE

24.79560
26.20728
44.55518
48.54787
2526.96144
2384.14286
2673.83349
2498.68333
3956.75275
3663.99357
3699.64814
3437.12846
6603.45109
6248.72283
6805.59568

Variables in Model
x1
x1
x2
x2
x1
x3
x1
x3
x1
x1
x2
x2
x1
x4
x1

x2
x2
x3
x3
x3
x4
x3

x3
x3 x4
x4
x4

x2 x4
x2
x4
x4

Figure 12.6: SAS printout of all possible subsets on sales data for Example 12.12.

/
494

/

Chapter 12 Multiple Linear Regression and Certain Nonlinear Regression Models

6

Sample Quantiles

4
2
0
Ϫ2
Ϫ4
Ϫ6
Ϫ8
Ϫ1

0
Theoretical Quantiles

1

Figure 12.7: Normal probability plot of residuals using the model x1 x2 x3 for Example 12.12.

Exercises
12.47 Consider the “hang time” punting data given
in Case Study 12.2, using only the variables x2 and x3 .
(a) Verify the regression equation shown on page 489.
(b) Predict punter hang time for a punter with LLS =
180 pounds and Power = 260 foot-pounds.
(c) Construct a 95% conﬁdence interval for the mean
hang time of a punter with LLS = 180 pounds and
Power = 260 foot-pounds.
12.48 For the data of Exercise 12.15 on page 452, use
the techniques of
(a) forward selection with a 0.05 level of signiﬁcance to
choose a linear regression model;
(b) backward elimination with a 0.05 level of signiﬁcance to choose a linear regression model;
(c) stepwise regression with a 0.05 level of signiﬁcance
to choose a linear regression model.
12.49 Use the techniques of backward elimination
with α = 0.05 to choose a prediction equation for the
data of Table 12.8.
12.50 For the punter data in Case Study 12.2,
an additional response, “punting distance,” was also
recorded. The average distance values for each of the
13 punters are given.
(a) Using the distance data rather than the hang times,
estimate a multiple linear regression model of the
type
μY |x1 ,x2 ,x3 ,x4 ,x5
= β0 + β1 x1 + β2 x2 + β3 x3 + β4 x4 + β5 x5
for predicting punting distance.

(b) Use stepwise regression with a signiﬁcance level of
0.10 to select a combination of variables.
(c) Generate values for s2 , R2 , PRESS, and

13
i=1

|δi | for

the entire set of 31 models. Use this information
to determine the best combination of variables for
predicting punting distance.
(d) For the ﬁnal model you choose, plot the standardized residuals against Y and do a normal probability plot of the ordinary residuals. Comment.
Punter Distance, y (ft)
1
162.50
2
144.00
3
147.50
4
163.50
5
192.00
6
171.75
7
162.00
8
104.93
9
105.67
10
117.59
11
140.25
12
150.17
13
165.16
12.51 The following is a set of data for y, the amount
of money (in thousands of dollars) contributed to the
alumni association at Virginia Tech by the Class of
1960, and x, the number of years following graduation:

/

/

Exercises

495

y
x
1
812.52
2
822.50
3
1211.50
4
1348.00
8
1301.00
9
2567.50
10
2526.50
(a) Fit a regression model of

y
2755.00
4390.50
5581.50
5548.00
6086.00
5764.00
8903.00
the type

x
11
12
13
14
15
16
17

μY |x = β0 + β1 x.
(b) Fit a quadratic model of the type
μY |x = β0 + β1 x + β11 x2 .
(c) Determine which of the models in (a) or (b) is
preferable. Use s2 , R2 , and the PRESS residuals
12.52 For the model of Exercise 12.50(a), test the hypothesis
H0: β4 = 0,
H1: β4 = 0.
Use a P-value in your conclusion.
12.53 For the quadratic model of Exercise 12.51(b),
give estimates of the variances and covariances of the
estimates of β1 and β11 .

(a) Fit a multiple linear regression to the data.
(b) Compute t-tests on coeﬃcients. Give P-values.
(c) Comment on the quality of the ﬁtted model.
12.55 Rayon whiteness is an important factor for scientists dealing in fabric quality. Whiteness is aﬀected
by pulp quality and other processing variables. Some
of the variables include acid bath temperature, ◦ C (x1 );
cascade acid concentration, % (x2 ); water temperature,

C (x3 ); sulﬁde concentration, % (x4 ); amount of chlorine bleach, lb/min (x5 ); and blanket ﬁnish temperature, ◦ C (x6 ). A set of data from rayon specimens is
given here. The response, y, is the measure of whiteness.
y
x1
x2
x3
x4
x5
x6
88.7 43 0.211 85 0.243 0.606 48
89.3 42 0.604 89 0.237 0.600 55
75.5 47 0.450 87 0.198 0.527 61
92.1 46 0.641 90 0.194 0.500 65
83.4 52 0.370 93 0.198 0.485 54
44.8 50 0.526 85 0.221 0.533 60
50.9 43 0.486 83 0.203 0.510 57
78.0 49 0.504 93 0.279 0.489 49
86.8 51 0.609 90 0.220 0.462 64
47.3 51 0.702 86 0.198 0.478 63
53.7 48 0.397 92 0.231 0.411 61
92.0 46 0.488 88 0.211 0.387 88
87.9 43 0.525 85 0.199 0.437 63
90.3 45 0.486 84 0.189 0.499 58
94.2 53 0.527 87 0.245 0.530 65
89.5 47 0.601 95 0.208 0.500 67
(a) Use the criteria MSE, Cp , and PRESS to ﬁnd the
“best” model from among all subset models.
(b) Plot standardized residuals against Y and do a
normal probability plot of residuals for the “best”
model. Comment.

12.54 A client from the Department of Mechanical
Engineering approached the Consulting Center at Virginia Tech for help in analyzing an experiment dealing
with gas turbine engines. The voltage output of engines was measured at various combinations of blade
speed and sensor extension.
y
Speed, x1 Extension,
12.56 In an eﬀort to model executive compensation
(volts)
(in./sec)
x2 (in.)
for the year 1979, 33 ﬁrms were selected, and data were
1.95
6336
0.000
gathered on compensation, sales, proﬁts, and employ2.50
7099
0.000
ment. The following data were gathered for the year
2.93
8026
0.000
1979.
1.69
6230
0.000
Compen1.23
5369
0.000
sation, y
Sales, x1 Proﬁts, x2 Employ3.13
8343
0.000
Firm (thousands) (millions) (millions) ment, x3
1.55
6522
0.006
1.94
7310
0.006
48,000
\$128.1
\$4600.6
\$450
1
2.18
7974
0.006
55,900
783.9
9255.4
387
2
2.70
8501
0.006
13,783
136.0
1526.2
368
3
1.32
6646
0.012
27,765
179.0
1683.2
277
4
1.60
7384
0.012
34,000
231.5
2752.8
676
5
1.89
8000
0.012
26,500
329.5
2205.8
454
6
2.15
8545
0.012
30,800
381.8
2384.6
507
7
1.09
6755
0.018
41,000
237.9
2746.0
496
8
1.26
7362
0.018
25,900
222.3
1434.0
487
9
1.57
7934
0.018
(cont.)
1.92
8554
0.018

496

Chapter 12 Multiple Linear Regression and Certain Nonlinear Regression Models

Compensation, y
Sales, x1 Proﬁts, x2 EmployFirm (thousands) (millions) (millions) ment, x3
\$470.6
\$383
8600
\$63.7
10
1508.0
311
21,075
149.5
11
464.4
271
6874
30.0
12
9329.3
524
39,000
577.3
13
2377.5
498
34,300
14
250.7
1174.3
343
19,405
15
82.6
409.3
354
3586
16
61.5
724.7
324
3905
17
90.8
578.9
225
4139
18
63.3
966.8
254
6255
19
42.8
591.0
208
10,605
20
48.5
4933.1
518
65,392
21
310.6
7613.2
406
89,400
22
491.6
3457.4
332
23
228.0
55,200
545.3
340
24
54.6
7800
22,862.8
698
25
3011.3
337,119
2361.0
306
26
203.0
52,000
2614.1
613
27
201.0
50,500
1013.2
302
28
121.3
18,625
4560.3
540
29
194.6
97,937
855.7
293
30
63.4
12,300
4211.6
528
31
352.1
71,800
5440.4
456
32
655.2
87,700
417
33
97.5
14,600
1229.9
Consider the model

(d) Give the ﬁnal model.
(e) For your model in part (d), plot studentized residuals (or R-Student) and comment.
y
8.0
8.3
8.5
8.8
9.0
9.3
9.3
9.5
9.8
10.0
10.3
10.5
10.8
11.0
11.3
11.5
11.8
12.3
12.5

x1
5.2
5.2
5.8
6.4
5.8
5.2
5.6
6.0
5.2
5.8
6.4
6.0
6.2
6.2
6.2
5.6
6.0
5.8
5.6

x2
19.6
19.8
19.6
19.4
18.6
18.8
20.4
19.0
20.8
19.9
18.0
20.6
20.2
20.2
19.2
17.0
19.8
18.8
18.6

x3
29.6
32.4
31.0
32.4
28.6
30.6
32.4
32.6
32.2
31.8
32.6
33.4
31.8
32.4
31.4
33.2
35.4
34.0
34.2

x4
94.9
89.7
96.2
95.6
86.5
84.5
88.8
85.7
93.6
86.0
87.1
93.1
83.4
94.5
83.4
85.2
84.1
86.9
83.0

x5
2.1
2.1
2.0
2.2
2.0
2.1
2.2
2.1
2.3
2.1
2.0
2.1
2.2
2.1
1.9
2.1
2.0
2.1
1.9

x6
2.3
1.8
2.0
2.1
1.8
2.1
1.9
1.9
2.1
1.8
1.6
2.1
2.1
1.9
1.8
2.1
1.8
1.8
2.0

12.58 For Exercise 12.57, test H0: β1 = β6 = 0. Give
P-values and comment.

yi = β0 + β1 ln x1i + β2 ln x2i
+ β3 ln x3i + i ,
i = 1, 2, . . . , 33.

12.59 In Exercise 12.28, page 462, we have the following data concerning wear of a bearing:

(a) Fit the regression with the model above.
(b) Is a model with a subset of the variables preferable
to the full model?

y (wear) x1 (oil viscosity) x2 (load)
851
1.6
193
816
15.5
230
1058
22.0
172
1201
43.0
91
1357
33.0
113
1115
40.0
125
(a) The following model may be considered to describe
the data:

12.57 The pull strength of a wire bond is an important characteristic. The following data give information on pull strength y, die height x1 , post height x2 ,
loop height x3 , wire length x4 , bond width on the die x5 ,
and bond width on the post x6 . (From Myers, Montgomery, and Anderson-Cook, 2009.)
(a) Fit a regression model using all independent variables.
(b) Use stepwise regression with input signiﬁcance level
0.25 and removal signiﬁcance level 0.05. Give your
ﬁnal model.
(c) Use all possible regression models and compute R2 ,
Cp , s2 , and adjusted R2 for all models.

12.12

yi = β0 + β1 x1i + β2 x2i + β12 x1i x2i + i ,
for i = 1, 2, . . . , 6. The x1 x2 is an “interaction”
term. Fit this model and estimate the parameters.
(b) Use the models (x1 ), (x1 , x2 ), (x2 ), (x1 , x2 , x1 x2 )
and compute PRESS, Cp , and s2 to determine the
“best” model.

Special Nonlinear Models for Nonideal Conditions
In much of the preceding material in this chapter and in Chapter 11, we have
beneﬁted substantially from the assumption that the model errors, the i , are
normal with mean 0 and constant variance σ 2 . However, there are many real-life

12.12 Special Nonlinear Models for Nonideal Conditions

497

situations in which the response is clearly nonnormal. For example, a wealth of
applications exist where the response is binary (0 or 1) and hence Bernoulli in
nature. In the social sciences, the problem may be to develop a model to predict
whether or not an individual is a good credit risk (0 or 1) as a function of certain
socioeconomic regressors such as income, age, gender, and level of education. In
a biomedical drug trial, the response is often whether or not the patient responds
positively to a drug while regressors may include drug dosage as well as biological
factors such as age, weight, and blood pressure. Again the response is binary
in nature. Applications are also abundant in manufacturing areas where certain
controllable factors inﬂuence whether a manufactured item is defective or not.
A second type of nonnormal application on which we will touch brieﬂy has to do
with count data. Here the assumption of a Poisson response is often convenient.
In biomedical applications, the number of cancer cell colonies may be the response
which is modeled against drug dosages. In the textile industry, the number of
imperfections per yard of cloth may be a reasonable response which is modeled
against certain process variables.

Nonhomogeneous Variance
The reader should note the comparison of the ideal (i.e., the normal response)
situation with that of the Bernoulli (or binomial) or the Poisson response. We
have become accustomed to the fact that the normal case is very special in that
the variance is independent of the mean. Clearly this is not the case for either
Bernoulli or Poisson responses. For example, if the response is 0 or l, suggesting a
Bernoulli response, then the model is of the form
p = f (x, β),
where p is the probability of a success (say response = 1). The parameter
p plays the role of μY |x in the normal case. However, the Bernoulli variance is
p(1 − p), which, of course, is also a function of the regressor x. As a result, the
variance is not constant. This rules out the use of standard least squares, which
we have utilized in our linear regression work up to this point. The same is true
for the Poisson case since the model is of the form
λ = f (x, β),
with Var(y) = μy = λ, which varies with x.

Binary Response (Logistic Regression)
The most popular approach to modeling binary responses is a technique entitled
logistic regression. It is used extensively in the biological sciences, biomedical
research, and engineering. Indeed, even in the social sciences binary responses are
found to be plentiful. The basic distribution for the response is either Bernoulli or
binomial. The former is found in observational studies where there are no repeated
runs at each regressor level, while the latter will be the case when an experiment
is designed. For example, in a clinical trial in which a new drug is being evaluated,
the goal might be to determine the dose of the drug that provides eﬃcacy. So

498

Chapter 12 Multiple Linear Regression and Certain Nonlinear Regression Models
certain doses will be employed in the experiment, and more than one subject will
be used for each dose. This case is called the grouped case.

What Is the Model for Logistic Regression?
In the case of binary responses ,the mean response is a probability. In the preceding
clinical trial illustration, we might say that we wish to estimate the probability that
the patient responds properly to the drug, P(success). Thus, the model is written
in terms of a probability. Given regressors x, the logistic function is given by
p=

1
.
1 + e−x β

The portion x β is called the linear predictor, and in the case of a single regressor
x it might be written x β = β0 + β1 x. Of course, we do not rule out involving
multiple regressors and polynomial terms in the so-called linear predictor. In the
grouped case, the model involves modeling the mean of a binomial rather than a
Bernoulli, and thus we have the mean given by
np =

n
.
1 + e−x β

Characteristics of Logistic Function
A plot of the logistic function reveals a great deal about its characteristics and
why it is utilized for this type of problem. First, the function is nonlinear. In
addition, the plot in Figure 12.8 reveals the S-shape with the function approaching
p = 1.0 as an asymptote. In this case, β1 > 0. Thus, we would never experience
an estimated probability exceeding 1.0.
p
1.0

x

Figure 12.8: The logistic function.
The regression coeﬃcients in the linear predictor can be estimated by the
method of maximum likelihood, as described in Chapter 9. The solution to the