Tải bản đầy đủ
1 Introduction: Why Use a Variable-Screening Method?
+ β36 x8 + β37 x9 + β38 x10
(dummy variables for qualitative variables)
+ β39 x8 x9 + β40 x8 x10 + β41 x9 x10 + β42 x8 x9 x10
(interaction terms for qualitative variables)
+ β43 x1 x8 + β44 x2 x8 + β45 x3 x8 + β46 x4 x8 + β47 x5 x8 + β48 x6 x8 + β49 x7 x8
+ β50 x1 x2 x8 + β51 x1 x3 x8 + β52 x1 x4 x8 + · · · + β70 x6 x7 x8
β71 x12 x8 + β72 x22 x8 + · · · + β77 x72 x8
(interactions between quantitative terms and qualitative variable x8 )
β78 x1 x9 + β79 x2 x9 + β80 x3 x9 + · · · + β112 x72 x9
(interactions between quantitative terms and qualitative variable x9 )
β113 x1 x10 + β114 x2 x10 + β115 x3 x10 + · · · + β147 x72 x10
(interactions between quantitative terms and qualitative variable x10 )
β148 x1 x8 x9 + β149 x2 x8 x9 + β150 x3 x8 x9 + · · · + β182 x72 x8 x9
(interactions between quantitative terms and qualitative term x8 x9 )
β183 x1 x8 x10 + β184 x2 x8 x10 + β185 x3 x8 x10 + · · · + β217 x72 x8 x10
(interactions between quantitative terms and qualitative term x8 x10 )
β218 x1 x9 x10 + β219 x2 x9 x10 + β220 x3 x9 x10 + · · · + β252 x72 x9 x10
(interactions between quantitative terms and qualitative term x9 x10 )
+ β253 x1 x8 x9 x10 + β254 x2 x8 x9 x10 + β255 x3 x8 x9 x10 + · · · + β287 x72 x8 x9 x10
(interactions between quantitative terms and qualitative term x8 x9 x10 )
To ﬁt this model, we would need to collect data for, at minimum, 289 executives!
Otherwise, we will have 0 degrees of freedom for estimating σ 2 , the variance of
the random error term. Even if we could obtain a data set this large, the task of
interpreting the β parameters in the model is a daunting one. This model, with its
numerous multivariable interactions and squared terms, is way too complex to be of
use in practice.
In this chapter, we consider two systematic methods designed to reduce a large
list of potential predictors to a more manageable one. These techniques, known as
variable screening procedures, objectively determine which independent variables
in the list are the most important predictors of y and which are the least important predictors. The most widely used method, stepwise regression, is discussed in
Section 6.2, while another popular method, the all-possible-regressions-selection procedure, is the topic of Section 6.3. In Section 6.4, several caveats of these methods are
6.2 Stepwise Regression
One of the most widely used variable screening methods is known as stepwise
regression. To run a stepwise regression, the user ﬁrst identiﬁes the dependent
variable (response) y, and the set of potentially important independent variables,
328 Chapter 6 Variable Screening Methods
x1 , x2 , . . . , xk , where k is generally large. [Note: This set of variables could include
both ﬁrst-order and higher-order terms as well as interactions.] The data are entered
into the computer software, and the stepwise procedure begins.
Step 1. The software program ﬁts all possible one-variable models of the form
E(y) = β0 + β1 xi
to the data, where xi is the ith independent variable, i = 1, 2, . . . , k. For
each model, the test of the null hypothesis
H0 : β1 = 0
against the alternative hypothesis
Ha : β1 = 0
is conducted using the t-test (or the equivalent F -test) for a single β parameter. The independent variable that produces the largest (absolute) t-value
is declared the best one-variable predictor of y.∗ Call this independent
variable x1 .
Step 2. The stepwise program now begins to search through the remaining (k − 1)
independent variables for the best two-variable model of the form
E(y) = β0 + β1 x1 + β2 xi
This is done by ﬁtting all two-variable models containing x1 (the variable
selected in the ﬁrst step) and each of the other (k − 1) options for the second
variable xi . The t-values for the test H0 : β2 = 0 are computed for each of
the (k − 1) models (corresponding to the remaining independent variables,
xi , i = 2, 3, . . . , k), and the variable having the largest t is retained. Call this
variable x2 .
Before proceeding to Step 3, the stepwise routine will go back and check
the t-value of βˆ1 after βˆ2 x2 has been added to the model. If the t-value has
become nonsigniﬁcant at some speciﬁed α level (say α = .05), the variable
x1 is removed and a search is made for the independent variable with a
β parameter that will yield the most signiﬁcant t-value in the presence
of βˆ2 x2 .
The reason the t-value for x1 may change from step 1 to step 2 is that
the meaning of the coefﬁcient βˆ1 changes. In step 2, we are approximating
a complex response surface in two variables with a plane. The best-ﬁtting
plane may yield a different value for βˆ1 than that obtained in step 1. Thus,
both the value of βˆ1 and its signiﬁcance usually changes from step 1 to step 2.
For this reason, stepwise procedures that recheck the t-values at each step
Step 3. The stepwise regression procedure now checks for a third independent
variable to include in the model with x1 and x2 . That is, we seek the best
model of the form
E(y) = β0 + β1 x1 + β2 x2 + β3 xi
To do this, the computer ﬁts all the (k − 2) models using x1 , x2 , and each of
the (k − 2) remaining variables, xi , as a possible x3 . The criterion is again to
include the independent variable with the largest t-value. Call this best third
variable x3 . The better programs now recheck the t-values corresponding to
∗ Note that the variable with the largest t-value is also the one with the largest (absolute) Pearson product
moment correlation, r (Section 3.7), with y.
the x1 and x2 coefﬁcients, replacing the variables that yield nonsigniﬁcant
t-values. This procedure is continued until no further independent variables
can be found that yield signiﬁcant t-values (at the speciﬁed α level) in the
presence of the variables already in the model.
The result of the stepwise procedure is a model containing only those terms with
t-values that are signiﬁcant at the speciﬁed α level. Thus, in most practical situations
only several of the large number of independent variables remain. However, it is
very important not to jump to the conclusion that all the independent variables
important for predicting y have been identiﬁed or that the unimportant independent
variables have been eliminated. Remember, the stepwise procedure is using only
sample estimates of the true model coefﬁcients (β’s) to select the important variables.
An extremely large number of single β parameter t-tests have been conducted, and
the probability is very high that one or more errors have been made in including
or excluding variables. That is, we have very probably included some unimportant
independent variables in the model (Type I errors) and eliminated some important
ones (Type II errors).
There is a second reason why we might not have arrived at a good model.
When we choose the variables to be included in the stepwise regression, we
may often omit high-order terms (to keep the number of variables manageable).
Consequently, we may have initially omitted several important terms from the
model. Thus, we should recognize stepwise regression for what it is: an objective
variable screening procedure.
Successful model builders will now consider second-order terms (for quantitative variables) and other interactions among variables screened by the stepwise
procedure. It would be best to develop this response surface model with a second set
of data independent of that used for the screening, so the results of the stepwise procedure can be partially veriﬁed with new data. This is not always possible, however,
because in many modeling situations only a small amount of data is available.
Do not be deceived by the impressive-looking t-values that result from the
stepwise procedure—it has retained only the independent variables with the largest
t-values. Also, be certain to consider second-order terms in systematically developing
the prediction model. Finally, if you have used a ﬁrst-order model for your stepwise
procedure, remember that it may be greatly improved by the addition of higherorder terms.
Be wary of using the results of stepwise regression to make inferences about the
relationship between E(y) and the independent variables in the resulting ﬁrstorder model. First, an extremely large number of t-tests have been conducted,
leading to a high probability of making one or more Type I or Type II errors.
Second, it is typical to enter only ﬁrst-order and main effect terms as candidate
variables in the stepwise model. Consequently, the ﬁnal stepwise model will not
include any higher-order or interaction terms. Stepwise regression should be
used only when necessary, that is, when you want to determine which of a large
number of potentially important independent variables should be used in the
Refer to Example 4.10 (p. 217) and the multiple regression model for executive
salary. A preliminary step in the construction of this model is the determination of
the most important independent variables. For one ﬁrm, 10 potential independent
variables (seven quantitative and three qualitative) were measured in a sample of
330 Chapter 6 Variable Screening Methods
Table 6.1 Independent variables in the executive salary example
Gender (1 if male, 0 if female)—qualitative
Number of employees supervised—quantitative
Corporate assets (millions of dollars)—quantitative
Board member (1 if yes, 0 if no)—qualitative
Company proﬁts (past 12 months, millions of dollars)—quantitative
Has international responsibility (1 if yes, 0 if no)—qualitative
Company’s total sales (past 12 months, millions of dollars)—
100 executives. The data, described in Table 6.1, are saved in the EXECSAL2 ﬁle.
Since it would be very difﬁcult to construct a complete second-order model with all of
the 10 independent variables, use stepwise regression to decide which of the 10 variables should be included in the building of the ﬁnal model for the natural log of
We will use stepwise regression with the main effects of the 10 independent variables
to identify the most important variables. The dependent variable y is the natural
logarithm of the executive salaries. The MINITAB stepwise regression printout is
shown in Figure 6.1. MINITAB automatically enters the constant term (β0 ) into the
model in the ﬁrst step. The remaining steps follow the procedure outlined earlier in
In Step 1, MINITAB ﬁts all possible one-variable models of the form,
E(y) = β0 + β1 xi .
You can see from Figure 6.1 that the ﬁrst variable selected is x1 , years of experience.
Thus, x1 has the largest (absolute) t-value associated with a test of H0 : β1 = 0. This
value, t = 12.62, is highlighted on the printout.
Next (step 2), MINITAB ﬁts all possible two-variable models of the form,
E(y) = β0 + β1 x1 + β2 xi .
(Note that the variable selected in the ﬁrst step, x1 , is automatically included in
the model.) The variable with the largest (absolute) t-value associated with a test
of H0 : β2 = 0 is the dummy variable for gender, x3 . This t-value, t = 7.10, is also
highlighted on the printout.
In Step 3, all possible three-variable models of the form
E(y) = β0 + β1 x1 + β2 x3 + β3 xi
are ﬁt. (Note that x1 and x3 are included in the model.) MINITAB selects x4 , number
of employees supervised, based on the value t = 7.32 (highlighted on the printout)
associated with a test of H0 : β3 = 0.
Figure 6.1 MINITAB
stepwise regression results
for executive salaries
In Steps 4 and 5, the variables x2 (years of education) and x5 (corporate assets),
respectively, are selected for inclusion into the model. The t-values for the tests of
the appropriate β’s are highlighted in Figure 6.1. MINITAB stopped after ﬁve steps
because none of the other independent variables met the criterion for admission to
the model. As a default, MINITAB (and most other statistical software packages)
uses α = .15 in the tests conducted. In other words, if the p-value associated with a
test of a β-coefﬁcient is greater than α = .15, then the corresponding variable is not
included in the model.
The results of the stepwise regression suggest that we should concentrate on the
ﬁve variables, x1 , x2 , x3 , x4 , and x5 , in our ﬁnal modeling effort. Models with curvilinear terms as well as interactions should be proposed and evaluated (as demonstrated
in Chapter 5) to determine the best model for predicting executive salaries.
There are several other stepwise regression techniques designed to select the
most important independent variables. One of these, called forward selection, is
nearly identical to the stepwise procedure previously outlined. The only difference is
that the forward selection technique provides no option for rechecking the t-values
corresponding to the x’s that have entered the model in an earlier step. Thus,
stepwise regression is preferred to forward selection in practice.
Another technique, called backward elimination, initially ﬁts a model containing
terms for all potential independent variables. That is, for k independent variables,
332 Chapter 6 Variable Screening Methods
Figure 6.2 SAS backward stepwise regression for executive salaries
the model E(y) = β0 + β1 x1 + β2 x2 + · · · + βk xk is ﬁt in step 1. The variable with
the smallest t (or F ) statistic for testing H0 : βi = 0 is identiﬁed and dropped
from the model if the t-value is less than some speciﬁed critical value. The model
with the remaining (k − 1) independent variables is ﬁt in step 2, and again, the
variable associated with the smallest nonsigniﬁcant t-value is dropped. This process
is repeated until no further nonsigniﬁcant independent variables can be found.
For example, applying the backward elimination method to the executive salary
data of Example 6.1 yields the results shown in the SAS printout in Figure 6.2. At
the bottom of the printout you can see that the variables x10 , x7 , x8 , x6 , and x9 (in that
order) were removed from the model, leaving x1 –x5 as the selected independent
variables. Thus, for this example, the backward elimination and stepwise methods
yield identical results. This will not always be the case, however. In fact, the
backward elimination method can be an advantage when at least one of the
candidate independent variables is a qualitative variable at three or more levels
(requiring at least two dummy variables), since the backward procedure tests the
contribution of each dummy variable after the others have been entered into the
model. The real disadvantage of using the backward elimination technique is that
you need a sufﬁciently large number of data points to ﬁt the initial model in Step 1.
6.3 All-Possible-Regressions Selection Procedure
In Section 6.2, we presented stepwise regression as an objective screening procedure for selecting the most important predictors of y. Other, more subjective,
variable selection techniques have been developed in the literature for the purpose of identifying important independent variables. The most popular of these
procedures are those that consider all possible regression models given the set
All-Possible-Regressions Selection Procedure
of potentially important predictors. Such a procedure is commonly known as an
all-possible-regressions selection procedure. The techniques differ with respect to
the criteria for selecting the ‘‘best’’ subset of variables. In this section, we describe
four criteria widely used in practice, then give an example illustrating the four
Consider the set of potentially important variables, x1 , x2 , x3 , . . . , xk . We learned in
Section 4.7 that the multiple coefﬁcient of determination
R2 = 1 −
will increase when independent variables are added to the model. Therefore, the
model that includes all k independent variables
E(y) = β0 + β1 x1 + β2 x2 + · · · + βk xk
will yield the largest R 2 . Yet, we have seen examples (Chapter 5) where adding
terms to the model does not yield a signiﬁcantly better prediction equation. The
objective of the R 2 criterion is to ﬁnd a subset model (i.e., a model containing a
subset of the k independent variables) so that adding more variables to the model
will yield only small increases in R 2 . In practice, the best model found by the R 2
criterion will rarely be the model with the largest R 2 . Generally, you are looking
for a simple model that is as good as, or nearly as good as, the model with all k
independent variables. But unlike that in stepwise regression, the decision about
when to stop adding variables to the model is a subjective one.
Adjusted R2 or MSE Criterion
One drawback to using the R 2 criterion, you will recall, is that the value of R 2 does
not account for the number of β parameters in the model. If enough variables are
added to the model so that the sample size n equals the total number of β’s in the
model, you will force R 2 to equal 1. Alternatively, we can use the adjusted R 2 . It is
easy to show that Ra2 is related to MSE as follows:
Ra2 = 1 − (n − 1)
Note that Ra2 increases only if MSE decreases [since SS(Total) remains constant
for all models]. Thus, an equivalent procedure is to search for the model with the
minimum, or near minimum, MSE.
A third option is based on a quantity called the total mean square error (TMSE) for
the ﬁtted regression model:
[yˆ i − E(yi )]2
TMSE = E
[E(yˆ i ) − E(yi )]2 +
Var(yˆ i )
where E(yˆ i ) is the mean response for the subset (ﬁtted) regression model and E(yi )
is the mean response for the true model. The objective is to compare the TMSE for
334 Chapter 6 Variable Screening Methods
the subset regression model with σ 2 , the variance of the random error for the true
model, using the ratio
Small values of imply that the subset regression model has a small total mean
square error relative to σ 2 . Unfortunately, both TMSE and σ 2 are unknown, and we
must rely on sample estimates of these quantities. It can be shown (proof omitted)
that a good estimator of the ratio is given by
+ 2(p + 1) − n
where n is the sample size, p is the number of independent variables in the subset
model, k is the total number of potential independent variables, SSEp is the SSE for
the subset model, and MSEk is the MSE for the model containing all k independent
variables. The statistical software packages discussed in this text have routines that
calculate the Cp statistic. In fact, the Cp value is automatically printed at each step
in the SAS and MINITAB stepwise regression printouts (see Figure 6.1).
The Cp criterion selects as the best model the subset model with (1) a small
value of Cp (i.e., a small total mean square error) and (2) a value of Cp near p + 1, a
property that indicates that slight or no bias exists in the subset regression model.∗
Thus, the Cp criterion focuses on minimizing total mean square error and the
regression bias. If you are mainly concerned with minimizing total mean square
error, you will want to choose the model with the smallest Cp value, as long as the
bias is not large. On the other hand, you may prefer a model that yields a Cp value
slightly larger than the minimum but that has slight (or no) bias.
A fourth criterion used to select the best subset regression model is the PRESS
statistic, introduced in Section 5.11. Recall that the PRESS (or, prediction sum of
squares) statistic for a model is calculated as follows:
[yi − yˆ (i) ]2
where yˆ (i) denotes the predicted value for the ith observation obtained when the
regression model is ﬁt with the data point for the ith observation omitted (or
deleted) from the sample.† Thus, the candidate model is ﬁt to the sample data n
times, each time omitting one of the data points and obtaining the predicted value
of y for that data point. Since small differences yi − yˆ (i) indicate that the model is
predicting well, we desire a model with a small PRESS.
Computing the PRESS statistic may seem like a tiresome chore, since repeated
regression runs (a total of n runs) must be made for each candidate model.
However, most statistical software packages have options for computing PRESS
∗ A model is said to be unbiased if E(y)
ˆ = E(y). We state (without proof) that for an unbiased regression model,
E(Cp ) ≈ p + 1. In general, subset models will be biased since k − p independent variables are omitted from the
ﬁtted model. However, when Cp is near p + 1, the bias is small and can essentially be ignored.
† The quantity y − yˆ is called the ‘‘deleted’’ residual for the ith observation. We discuss deleted residuals in
more detail in Chapter 8.
‡ PRESS can also be calculated using the results from a regression run on all n data points. The formula is
yi − yˆ i
1 − hii
where hii is a function of the independent variables in the model. In Chapter 8, we show how hii (called leverage)
can be used to detect inﬂuential observations.
All-Possible-Regressions Selection Procedure
Plots aid in the selection of the best subset regression model using the allpossible-regressions procedure. The criterion measure, either R 2 , MSE, Cp , or
PRESS, is plotted on the vertical axis against p, the number of independent
variables in the subset model, on the horizontal axis. We illustrate all three variable
selection techniques in an example.
Refer to Example 6.1 and the data on executive salaries. Recall that we want to
identify the most important independent variables for predicting the natural log
of salary from the list of 10 variables given in Table 6.1. Apply the all-possibleregressions selection procedure to ﬁnd the most important independent variables.
We entered the executive salary data into MINITAB and used MINITAB’s allpossible-regressions selection routine to obtain the printout shown in Figure 6.2. For
p = 10 independent variables, there exists 1,023 possible subset ﬁrst-order models.
Although MINITAB ﬁts all of these models, the output in Figure 6.3 shows only the
results for the ‘‘best’’ model for each value of p. From the printout, you can see that
the best one-variable model includes x1 (years of experience); the best two-variable
model includes x1 and x3 (gender); the best three-variable model includes x1 , x3 , and
x4 (number supervised); and so on.
These ‘‘best subset’’ models are summarized in Table 6.2. In addition to the
variables included in each model, the table gives the values of R 2 , adjusted-R 2 , MSE,
Cp , and PRESS. To determine which subset model to select, we plot these quantities
against the number of variables, p. The MINITAB graphs for R 2 , adjusted-R 2 , Cp ,
and PRESS are shown in Figures 6.4a–d, respectively.
In Figure 6.4a, we see that the R 2 values tend to increase in very small amounts
for models with more than p = 5 predictors. A similar pattern is shown in Figure 6.4b
for Ra2 . Thus, both the R 2 and Ra2 criteria suggest that the model containing
the ﬁve predictors x1 , x2 , x3 , x4 , and x5 is a good candidate for the best subset
Figure 6.4c shows the plotted Cp values and the line Cp = p + 1. Notice that
the subset models with p ≥ 5 independent variables all have relatively small Cp
values and vary tightly about the line Cp = p + 1. This implies that these models
Figure 6.3 MINITAB
selection results for
336 Chapter 6 Variable Screening Methods
Table 6.2 Results for best subset models
Variables in the Model
x1 , x3
x1 , x3 , x4
x1 , x2 , x3 , x4
x1 , x2 , x3 , x4 , x5
x1 , x2 , x3 , x4 , x5 , x9
x1 , x2 , x3 , x4 , x5 , x6 , x9
x1 , x2 , x3 , x4 , x5 , x6 , x8 , x9
x1 , x2 , x3 , x4 , x5 , x6 , x7 , x8 , x9
x1 , x2 , x3 , x4 , x5 , x6 , x7 , x8 , x9 , x10
Figure 6.4 MINITAB plots of all-possible-regressions selection criteria for Example 6.2
have a small total mean square error and a negligible bias. The model corresponding to p = 4, although certainly outperforming the models p ≤ 3, appears to fall
short of the larger models according to the Cp criterion. From Figure 6.4d you can
see that the PRESS is smallest for the ﬁve-variable model with x1 , x2 , x3 , x4 , and x5
(PRESS = .610).
According to all four criteria, the variables x1 , x2 , x3 , x4 , and x5 should be
included in the group of the most important predictors.
In summary, variable selection procedures based on the all-possible-regressions
selection criterion will assist you in identifying the most important independent
variables for predicting y. Keep in mind, however, that these techniques lack the
objectivity of a stepwise regression procedure. Furthermore, you should be wary of
concluding that the best model for predicting y has been found, since, in practice,
interactions and higher-order terms are typically omitted from the list of potential
Both stepwise regression and the all-possible-regressions selection procedure are
useful variable screening methods. Many regression analysts, however, tend to apply
these procedures as model-building methods. Why? The stepwise (or best subset)
model will often have a high value of R 2 and all the β coefﬁcients in the model
will be signiﬁcantly different from 0 with small p-values (see Figure 6.1). And, with
very little work (other than collecting the data and entering it into the computer),
you can obtain the model using a statistical software package. Consequently, it is
extremely tempting to use the stepwise model as the ﬁnal model for predicting and
making inferences about the dependent variable, y.
We conclude this chapter with several caveats and some advice on using stepwise
regression and the all-possible-regressions selection procedure. Be wary of using
the stepwise (or best subset) model as the ﬁnal model for predicting y for several
reasons. First, recall that either procedure tends to ﬁt an extremely large number
of models and perform an extremely large number of tests (objectively, in stepwise
regression, and subjectively, in best subsets regression). Thus, the probability of
making at least one Type I error or at least one Type II error is often quite high.
That is, you are very likely to either include at least one unimportant independent
variable or leave out at least one important independent variable in the ﬁnal
Second, analysts typically do not include higher-order terms or interactions in
the list of potential predictors for stepwise regression. Therefore, if no real model
building is performed, the ﬁnal model will be a ﬁrst-order, main effects model. Most
real-world relationships between variables are not linear, and these relationships
often are moderated by another variable (i.e., interaction exists). In Chapter 8, we
learn that higher-order terms are often revealed through residual plotting.
Third, even if the analyst includes some higher-order terms and interactions as
potential predictors, the stepwise and best subsets procedures will more than likely
select a nonsensical model. For example, consider the stepwise model
E(y) = β0 + β1 x1 + β2 x2 x5 + β3 x32 .
The model includes an interaction for x2 and x5 , but omits the main effects for
these terms, and it includes a quadratic term for x3 but omits the ﬁrst-order (shift