12 Case Study 1: Modeling Multiple Linear Regressions
Tải bản đầy đủ - 0trang
174 ◾ Statistical Data Mining Using SAS Application
5.Save _score_ and regest datasets for future use: These two datasets are created and saved as temporary SAS datasets in the work folder and also exported
to Excel worksheets and saved in the user-specified output folder. The _score_
data contains the observed variables; the predicted scores, including observations with missing response value; residuals; and confidence-interval estimates. This dataset could be used as the base for developing the scorecards
for each observation. Also, the second SAS data called regest contains the
parameter estimates that could be used in the RSCORE macro for scoring
different datasets containing the same variables.
6.If–then analysis and lift charts: Perform IF-THEN analysis and construct
a lift chart to estimate the differences in the predicted response when one of
the continuous predictor variables is fixed at a given value.
Multiple Linear Regression Analysis of 1993 Car Attribute Data
Data name
SAS dataset CARS93
Multiattributes
Y2: Midprice
X1 Air bags (0 = none, 1 = driver only, 2 = driver and
passenger)
X2 Number of cylinders
X3 Engine size (liters)
X4 HP (maximum)
X5 RPM (revs per minute at maximum HP)
X6 Engine revolutions per mile (in highest gear)
X7 Fuel tank capacity (gallons)
X8 Passenger capacity (persons)
X9 Car length (inches)
X10 Wheelbase (inches)
X11 Car width (inches)
X12 U-turn space (feet)
X13 Rear seat room (inches)
X14 Luggage capacity (cubic feet)
X15 Weight (pounds)
Number of observations 92
Car93 Data Source:
Lock38
© 2010 by Taylor and Francis Group, LLC
K10535_Book.indb 174
5/18/10 3:37:24 PM
Supervised Learning Methods ◾ 175
5.12.1.1 Step 1: Preliminary Model Selection
Open the REGDIAG2.SAS macro-call file in the SAS EDITOR window and click
RUN to open the REGDIAG2 macro-call window (Figure 5.4). Input the appropriate macro-input values by following the suggestions given in the help file (Appendix 2).
Leave the group variable option blank since all the predictors used are continuous.
Leave the macro field #14 BLANK to skip regression diagnostics and to run MLR.
◾◾ Special note to SAS Enterprise Guide (EG) Code Window Users: Because
these user-friendly SAS macro applications included in this book, use SAS
WINDOW/DISPLAY commands, and these commands are not compatible
with SAS EG, open the traditional REGDIAG macro-call file included in the
\dmsas2e\maccal\nodisplay\ into the SAS editor. Read the instructions given
in Appendix 3 regarding using the traditional macro-call files in the SAS EG/
SAS Learning Edition (LE) code window.
Model selection: Variable selection using MAX R 2 selection method: The REGDIAG2
macro utilizes all possible regression models using the MAXR 2 selection methods
and output the best two models for all subsets (Table 5.1). Because 15 continuous
predictor variables were used in the model selection, the full model had 15 predictors. Fifteen subsets are possible with 15 predictor variables. By comparing the R 2 ,
R 2(adj), RMSE, C(p), and AIC values between the full model and all subsets, we
can conclude that the 6-variable subset model is superior to all other subsets.
The Mallows C(p) measures the total squared error for a subset that equals total
error variance plus the bias introduced by not including the important variables in
the subset. The C(p) plot (Figure 5.5) shows the C(p) statistic against the number
of predictor variables for the full model and the best two models for each subset.
Additionally, the RMSE statistic for the full model and best two regression models
in each subset is also shown in the C(p) plot. Furthermore, the diameter of the
bubbles in the C(p) plot is proportional to the magnitude of RMSE. Consequently,
dropping any variable from the six-variable model is not recommended because, the
C(p), RMSE, and AIC values jump up so high. These results clearly indicate that
C(p), RMSE, and AIC statistics are better indicators for variable selection than R 2
and R 2(adj). Thus, the C(p) plot and the summary table of model selection statistics
produced by the REGDIAG2 macro can be used effectively in selecting the best
subset in regression models with many (5 to 25) predictor variables.
LASSO, the new model selection method implemented in the new SAS procedure GLMSELECT, is also utilized in the REGDIAG2 macro for screening all listed
predictor variables and examine and visualize the contribution of each predictor in
the model selection. Two informative diagnostic plots (Figures 5.6 and 5.7) generated by the ODS graphics feature in the GLMSELECT can be used to visualize the
importance of the predictor variables. The fit criteria plot (Figure 5.6) displays the
trend plots of six model selection criteria versus the number of model parameters, and
© 2010 by Taylor and Francis Group, LLC
K10535_Book.indb 175
5/18/10 3:37:24 PM
Number
in Model
R-Square
Adjusted
R-Square
C(p)
AIC
Root
MSE
SBC
1
0.6670
0.6627
48.5856
266.1241
5.10669
270.91297
X4
1
0.6193
0.6145
66.5603
276.9593
5.45992
281.74819
X15
2
0.7006
0.6929
37.8970
259.4966
4.87278
266.67998
X4 X7
2
0.6996
0.6919
38.2671
259.7618
4.88076
266.94510
X1 X4
3
0.7364
0.7261
26.4105
251.1920
4.60207
260.76981
X4 X11 X15
3
0.7276
0.7169
29.7340
253.8557
4.67837
263.43350
X4 X10 X11
4
0.7710
0.7589
15.3699
241.8019
4.31775
253.77414
X1 X2 X11 X15
4
0.7666
0.7544
16.9950
243.3118
4.35818
255.28403
X1 X4 X11 X15
5
0.7960
0.7824
7.9336
234.4305
4.10214
248.79718
X1 X2 X4 X7 X11
5
0.7943
0.7805
8.5844
235.1128
4.11945
249.47949
X1 X2 X7 X11 X15
6
0.8162
0.8013
2.2959
227.9613
3.91941
244.72243
X1 X2 X4 X7 X10 X11
6
0.8079
0.7924
5.4282
231.5424
4.00701
248.30352
X1 X2 X4 X7 X11 X15
7
0.8188
0.8015
3.3136
228.8049
3.91809
247.96048
X1 X2 X4 X6 X7 X10 X11
7
0.8185
0.8011
3.4231
228.9346
3.92123
248.09024
X1 X2 X4 X7 X10 X11 X15
8
0.8245
0.8050
3.1778
228.2320
3.88305
249.78208
X1 X2 X4 X6 X7 X10 X11 X15
Variables in Model
176 ◾ Statistical Data Mining Using SAS Application
© 2010 by Taylor and Francis Group, LLC
K10535_Book.indb 176
Table 5.1 Macro REGDIAG2—Best Two Subsets in All Possible MAXR2 Selection Method
5/18/10 3:37:24 PM
© 2010 by Taylor and Francis Group, LLC
K10535_Book.indb 177
0.8208
0.8009
4.5708
229.9193
3.92370
251.46934
X1 X2 X4 X7 X10 X11 X12 X15
9
0.8259
0.8038
4.6640
229.6007
3.89509
253.54524
X1 X2 X4 X6 X7 X10 X11 X12 X15
9
0.8248
0.8026
5.0546
230.0812
3.90666
254.02565
X1 X2 X4 X6 X7 X8 X10 X11 X15
10
0.8261
0.8013
6.5653
231.4789
3.91986
257.81784
X1 X2 X4 X6 X7 X9 X10 X11 X12 X15
10
0.8261
0.8013
6.5721
231.4873
3.92007
257.82622
X1 X2 X4 X6 X7 X8 X10 X11 X12 X15
11
0.8266
0.7989
8.4032
233.2784
3.94328
262.01177
X1 X2 X4 X6 X7 X8 X10 X11 X12 X13 X15
11
0.8265
0.7988
8.4254
233.3058
3.94395
262.03921
X1 X2 X4 X6 X7 X8 X9 X10 X11 X12 X15
12
0.8270
0.7964
10.2462
235.0837
3.96740
266.21150
X1 X2 X4 X6 X7 X8 X9 X10 X11 X12 X13 X15
12
0.8268
0.7963
10.3043
235.1558
3.96917
266.28364
X1 X2 X4 X6 X7 X8 X10 X11 X12 X13 X14 X15
13
0.8273
0.7938
12.1050
236.9082
3.99257
270.43044
X1 X2 X4 X5 X6 X7 X8 X9 X10 X11 X12 X13 X15
13
0.8272
0.7937
12.1621
236.9792
3.99432
270.50152
X1 X2 X4 X6 X7 X8 X9 X10 X11 X12 X13 X14 X15
14
0.8276
0.7911
14.0000
238.7775
4.01946
274.69420
X1 X2 X4 X5 X6 X7 X8 X9 X10 X11 X12 X13 X14 X15
14
0.8273
0.7907
14.1049
238.9081
4.02270
274.82480
X1 X2 X3 X4 X5 X6 X7 X8 X9 X10 X11 X12 X13 X15
15
0.8276
0.7878
16.0000
240.7775
4.05026
279.08865
X1 X2 X3 X4 X5 X6 X7 X8 X9 X10 X11 X12 X13 X14 X15
5/18/10 3:37:24 PM
Supervised Learning Methods ◾ 177
8
178 ◾ Statistical Data Mining Using SAS Application
CP/P-Ratio & RMSE (area of the bubble) Plot
30
25
RMSE
5.107
Cp/_P_ratio
20
15
4.881
4.873
10
4.678
4.602
5
4.358
4.318
4.050
4.119
4.102
4.019
4.023
3.993
3.994
3.967
3.969
4.007
3.943
3.944
3.920
3.924
3.907
3.895
3.918
3.921
3.883
3.919
0
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
Number of Predictor Variables
Figure 5.5 Model selection using SAS macro REGDIAG2: CP plot for selecting
the best subset model.
in this example, all six criteria identify the 13-parameter model as the best model.
However, beyond the six variables, no substantial gain was noted. The coefficient
progression plot displayed in Figure 5.7 shows the stability of the standardized regression coefficients as a result of adding new variables in each model-selection step. The
problem of multicollinearity among the predictor variables was not evident since all
the standardized regression coefficients have values less than ±1. The following six
variables, X4, X7, X1, X2, X13, and X11, were identified as the most contributing variables in the model selection sequence. Although X15 was included in the second step,
it was later excluded from the model. Thus, these features enable the analysts to identify the most contributing variables and help them perform further investigations.
Because this model-selection step only includes the linear effects of the variables, it is recommended that this step be used as a preliminary model selection
step rather than the final concluding step. Furthermore, the REGDIAG2 macro
also has a feature for selecting the best-candidate models using AICC and SBC
(Tables 5.2 and 5.3). Next we will examine the predictor variables selected in the
best-candidate models.
© 2010 by Taylor and Francis Group, LLC
K10535_Book.indb 178
5/18/10 3:37:25 PM
Supervised Learning Methods ◾ 179
Fit Criteria for Y2
AICC
AIC
Adj R-Sq
SBC
C(p)
0
BIC
5
Step
10
15
Best criterion value
0
5
Step
10
15
Step selected by SBC
Figure 5.6 Model selection using SAS macro REGDIAG2: Fit criteria plots derived
from using the ODS graphics feature in GLMSELECT procedure.
Both minimum AICC and SBC criteria identified the same six-variable model
(X1, X2, X4, X7, X11, and X10) as the best model. The first five variables were also
selected as the best contributing variables by the LASSO method (Figure 5.6), and
the CP method picked the same six variables as the best model (Table 5.1). The
ΔSBC criterion is very conservative and picked only one model as the best candidate where as Δ AICC method identified five models as the best candidates. The
standardized regression coefficients of the best candidate model’s predictors were
very stable, indicating the impact of multicollinearity is very minimal. Then, based
on the preliminary model selection step, the following X1, X2, X4, X7, X11, and
X10 variables were identified as the best linear predictors, and we can proceed with
the second step of the analysis.
5.12.1.2 Step 2: Graphical Exploratory Analysis
and Regression Diagnostic Plots
Open the REGDIAG2.SAS macro-call file in the SAS EDITOR window and
click RUN to open the REGDIAG2 macro-call window (Figure. 5.8). Input the
appropriate macro-input values by following the suggestions given in the help file
© 2010 by Taylor and Francis Group, LLC
K10535_Book.indb 179
5/18/10 3:37:25 PM
180 ◾ Statistical Data Mining Using SAS Application
Coeﬃcient Progression for Y2
Standardized Coeﬃcient
0.50
X2
X1
X10
X6
X9
0.25
0.00
X8
X12
–0.25
–0.50
360
340
320
300
280
260
Selected step
In
te
rc
ep
1+ t
X
2+ 4
X1
3+ 5
X7
4+
X1
5+
X
6+ 2
X1
7– 3
X1
8+ 5
X1
9+ 1
X
10 15
+X
1
11 0
+X
12 6
+X
1
13 2
+X
14 8
+X
15 14
+X
16 9
+X
17 5
+X
3
SBC
X11
Eﬀect Sequence
Figure 5.7 Model selection using SAS macro REGDIAG2: Standardized regression coefficient and SBC progression plots by model selection steps derived from
using the ODS graphics feature in GLMSELECT procedure.
(Appendix 2). Leave the group variable option blank, because all the predictors
used are continuous. Input YES in macro field #14 to request additional regression
diagnostics plots using the selected predicted variables in step 1.
The three model selection plots—CP plot (Figure 5.9), fit criteria plot
(Figure 5.10), and coefficient progression plot (Figure 5.11)—on the predicted variables selected in step 1 (6 variables: X1, X2, X4, X7, X11, and X10) further confirmed that these are the best linear predictors in all model selection criteria. Thus,
in the second step, data exploration and diagnostic plot analysis were carried out
using these six predictor variables.
Simple linear regression and augmented partial residual (APR) plots for all six
predictor variables are presented in Figure 5.12. The linear/quadratic regression
parameter estimates for the simple and multiple linear regressions and their significance levels are also displayed in the titles of the APR plots. The simple linear
regression line describes the relationship between the response and a given predictor variable in a simple linear regression. The APR line shows the quadratic
regression effect of the ith predictor on the response variable after accounting for
© 2010 by Taylor and Francis Group, LLC
K10535_Book.indb 180
5/18/10 3:37:27 PM
© 2010 by Taylor and Francis Group, LLC
K10535_Book.indb 181
Table 5.2 Macro REGDIAG2—Standardized Regression Coefficient Estimates and the Several Model Selection Criteria
for the Best-Candidate Models in All Possible MAXR2 Selection Methods Using the Selection Criterion Delta AICC < 2
4 Engine
RPM (revs
per minute Revolutions
per Mile
at
HP
(maximum) maximum (in highest
gear) X6
HP)X5
X4
Number of
Cylinders
X2
Engine
Size
(liters)
X3
Y2
19.3219
2.44057
3.14933
.
3.53493
.
.
.
.
Y2
19.4307
2.21502
3.31102
.
2.66918
.
1.27614
.
.
Y2
19.3455
2.38315
3.37800
.
3.52546
.
0.77161
.
.
Y2
19.3628
2.36426
3.02281
.
3.03817
.
.
.
.
Y2
19.3221
2.49279
3.10444
.
3.56846
.
.
.
.
U-Turn
Space (feet)
X12
Rear Seat
Room
(inches)
X13
Dependent
Variable
Car
Width
Wheelbase (inches)
X11
(inches) X10
Passenger
Capacity
(persons)
X8
Car Length
(inches) X9
3.31439
.
.
2.80894
−6.1088
.
.
.
.
7
2.33768
.
.
2.22848
−5.9390
.
.
.
3.10061
9
3.25036
.
.
2.98053
−5.7254
.
.
.
.
8
2.80552
.
.
2.30391
−6.3801
.
.
.
1.81189
8
3.32973
.
.
2.90899
−5.7885
−0.523
.
.
.
8
Fuel Tank
Capacity
(gallons) X7
Luggage
Weight
Capacity (cu (pounds)
ft X14
X15
Number of
Parameters
in Model
5/18/10 3:37:27 PM
(continued)
Supervised Learning Methods ◾ 181
Intercept
Air Bags
(0 = none,
1 = driver only,
2 = driver and
passenger) X1
Schwarz’s
Bayesian
Criterion
AICC
DELTA_
AICC
DELTA_
SBC
W_AICC
W_AICCR
0.80133
244.722
229.279
0.00000
0.00000
0.33421
1.00000
0.80500
249.782
230.401
1.12177
5.05964
0.19074
0.57070
0.80147
247.960
230.519
1.24024
3.23805
0.17977
0.53788
0.80115
248.090
230.649
1.37000
3.36781
0.16847
0.50409
0.79975
248.658
231.217
1.93826
3.93607
0.12681
0.37941
Adjusted
r-Squared
182 ◾ Statistical Data Mining Using SAS Application
© 2010 by Taylor and Francis Group, LLC
K10535_Book.indb 182
Table 5.2 Macro REGDIAG2—Standardized Regression Coefficient Estimates and
the Several Model Selection Criteria for the Best-Candidate Models in All Possible
MAXR2 Selection Methods Using the Selection Criterion Delta AICC < 2 (Continued)
5/18/10 3:37:27 PM
© 2010 by Taylor and Francis Group, LLC
K10535_Book.indb 183
Table 5.3 Macro REGDIAG2—Standardized Regression Coefficient Estimates and the Several Model-Selection Criteria
for the Best-Candidate Models in All Possible MAXR2 Selection Methods Using the Selection Criterion Delta SBC <2
4 Engine
Revolutions
per Mile
(in highest
gear) X6
.
3.53493
.
.
Wheelbase
(inches) X10
Car Width
(inches)
X11
U-Turn
Space (feet)
X12
Rear Seat
Room
(inches) X13
Luggage
Capacity
(cu ft) X14
Weight
(pounds)
X15
Number of
Parameters
in Model
.
2.80894
−6.1088
.
.
.
.
7
Schwarz’s
Bayesian
criterion
AICC
DELTA_
AICC
DELTA_
SBC
W_SBC
W_SBCR
244.722
229.279
0
0
1
1
Intercept
Number of
Cylinders
X2
Engine
Size
(liters) X3
19.3219
2.44057
3.14933
Fuel Tank
Capacity
(gallons) X7
Passenger
Capacity
(persons)
X8
Car Length
(inches) X9
3.31439
.
Adjusted
r-squared
Dependent
Variable
Y2
0.80133
5/18/10 3:37:27 PM
Supervised Learning Methods ◾ 183
HP
(maximum)
X4
RPM (revs
per minute
at maximum
HP) X5
Air Bags
(0 = none,
1 = driver only,
2 = driver and
passenger) X1
184 ◾ Statistical Data Mining Using SAS Application
Figure 5.8 Screen copy of REGDIAG2 macro-call window showing the macrocall parameters required for performing regression diagnostic plots in MLR.
the linear effects of other predictors on the response. The APR plot is very effective in detecting significant outliers and nonlinear relationships. Significant outliers
and/or influential observations are identified and marked on the APR plot if the
absolute STUDENT value exceeds 2.5, or the DFFITS statistic exceeds 1.5. These
influential statistics are derived from the MLR model involving all predictor variables. If the correlations among all predictor variables are negligible, the simple and
the partial regression lines should have similar slopes.
The APR plots for the six-predictor variables showed significant linear relationships between the six predictors and median price. A big difference in the magnitude of the partial (adjusted) and the simple (unadjusted) regression effects for all
six predictors on median price were clearly evident (Figure 5.12). The quadratic
effects of all six predictor variables on the median price were not significant at the
5% level. Five significant outliers were also detected in these APR plots.
Partial leverage plots (PL) for all six predictor variables are presented in
Figure 5.13. The PL display shows three curves: (a) the horizontal reference line
that goes through the response variable mean, (b) the partial regression line, which
quantifies the slope of the partial regression coefficient of the ith variable in the
MLR, and (c) the 95% confidence band for partial regression line. The partial
regression parameter estimates for the ith variable in the multiple linear regression
© 2010 by Taylor and Francis Group, LLC
K10535_Book.indb 184
5/18/10 3:37:28 PM