Tải bản đầy đủ - 0 (trang)
12 Case Study 1: Modeling Multiple Linear Regressions

12 Case Study 1: Modeling Multiple Linear Regressions

Tải bản đầy đủ - 0trang

174  ◾  Statistical Data Mining Using SAS Application











5.Save _score_ and regest datasets for future use: These two datasets are created and saved as temporary SAS datasets in the work folder and also exported

to Excel worksheets and saved in the user-specified output folder. The _score_

data contains the observed variables; the predicted scores, including observations with missing response value; residuals; and confidence-interval estimates. This dataset could be used as the base for developing the scorecards

for each observation. Also, the second SAS data called regest contains the

parameter estimates that could be used in the RSCORE macro for scoring

different datasets containing the same variables.

6.If–then analysis and lift charts: Perform IF-THEN analysis and construct

a lift chart to estimate the differences in the predicted response when one of

the continuous predictor variables is fixed at a given value.



Multiple Linear Regression Analysis of 1993 Car Attribute Data

Data name



SAS dataset CARS93



Multiattributes



Y2: Midprice

X1 Air bags (0 = none, 1 = driver only, 2 = driver and

passenger)

X2  Number of cylinders

X3  Engine size (liters)

X4  HP (maximum)

X5  RPM (revs per minute at maximum HP)

X6  Engine revolutions per mile (in highest gear)

X7  Fuel tank capacity (gallons)

X8  Passenger capacity (persons)

X9  Car length (inches)

X10  Wheelbase (inches)

X11  Car width (inches)

X12  U-turn space (feet)

X13  Rear seat room (inches)

X14  Luggage capacity (cubic feet)

X15  Weight (pounds)



Number of observations 92

Car93 Data Source:



Lock38



© 2010 by Taylor and Francis Group, LLC



K10535_Book.indb 174



5/18/10 3:37:24 PM



Supervised Learning Methods  ◾  175



5.12.1.1 Step 1: Preliminary Model Selection

Open the REGDIAG2.SAS macro-call file in the SAS EDITOR window and click

RUN to open the REGDIAG2 macro-call window (Figure 5.4). Input the appropriate macro-input values by following the suggestions given in the help file (Appendix 2).

Leave the group variable option blank since all the predictors used are continuous.

Leave the macro field #14 BLANK to skip regression diagnostics and to run MLR.

◾◾ Special note to SAS Enterprise Guide (EG) Code Window Users: Because

these user-friendly SAS macro applications included in this book, use SAS

WINDOW/DISPLAY commands, and these commands are not compatible

with SAS EG, open the traditional REGDIAG macro-call file included in the

\dmsas2e\maccal\nodisplay\ into the SAS editor. Read the instructions given

in Appendix 3 regarding using the traditional macro-call files in the SAS EG/

SAS Learning Edition (LE) code window.

Model selection: Variable selection using MAX R 2 selection method: The REGDIAG2

macro utilizes all possible regression models using the MAXR 2 selection methods

and output the best two models for all subsets (Table 5.1). Because 15 continuous

predictor variables were used in the model selection, the full model had 15 predictors. Fifteen subsets are possible with 15 predictor variables. By comparing the R 2 ,

R 2(adj), RMSE, C(p), and AIC values between the full model and all subsets, we

can conclude that the 6-variable subset model is superior to all other subsets.

The Mallows C(p) measures the total squared error for a subset that equals total

error variance plus the bias introduced by not including the important variables in

the subset. The C(p) plot (Figure 5.5) shows the C(p) statistic against the number

of predictor variables for the full model and the best two models for each subset.

Additionally, the RMSE statistic for the full model and best two regression models

in each subset is also shown in the C(p) plot. Furthermore, the diameter of the

bubbles in the C(p) plot is proportional to the magnitude of RMSE. Consequently,

dropping any variable from the six-variable model is not recommended because, the

C(p), RMSE, and AIC values jump up so high. These results clearly indicate that

C(p), RMSE, and AIC statistics are better indicators for variable selection than R 2

and R 2(adj). Thus, the C(p) plot and the summary table of model selection statistics

produced by the REGDIAG2 macro can be used effectively in selecting the best

subset in regression models with many (5 to 25) predictor variables.

LASSO, the new model selection method implemented in the new SAS procedure GLMSELECT, is also utilized in the REGDIAG2 macro for screening all listed

predictor variables and examine and visualize the contribution of each predictor in

the model selection. Two informative diagnostic plots (Figures 5.6 and 5.7) generated by the ODS graphics feature in the GLMSELECT can be used to visualize the

importance of the predictor variables. The fit criteria plot (Figure 5.6) displays the

trend plots of six model selection criteria versus the number of model parameters, and



© 2010 by Taylor and Francis Group, LLC



K10535_Book.indb 175



5/18/10 3:37:24 PM



Number

in Model



R-Square



Adjusted

R-Square



C(p)



AIC



Root

MSE



SBC



1



0.6670



0.6627



48.5856



266.1241



5.10669



270.91297



X4



1



0.6193



0.6145



66.5603



276.9593



5.45992



281.74819



X15



2



0.7006



0.6929



37.8970



259.4966



4.87278



266.67998



X4 X7



2



0.6996



0.6919



38.2671



259.7618



4.88076



266.94510



X1 X4



3



0.7364



0.7261



26.4105



251.1920



4.60207



260.76981



X4 X11 X15



3



0.7276



0.7169



29.7340



253.8557



4.67837



263.43350



X4 X10 X11



4



0.7710



0.7589



15.3699



241.8019



4.31775



253.77414



X1 X2 X11 X15



4



0.7666



0.7544



16.9950



243.3118



4.35818



255.28403



X1 X4 X11 X15



5



0.7960



0.7824



7.9336



234.4305



4.10214



248.79718



X1 X2 X4 X7 X11



5



0.7943



0.7805



8.5844



235.1128



4.11945



249.47949



X1 X2 X7 X11 X15



6



0.8162



0.8013



2.2959



227.9613



3.91941



244.72243



X1 X2 X4 X7 X10 X11



6



0.8079



0.7924



5.4282



231.5424



4.00701



248.30352



X1 X2 X4 X7 X11 X15



7



0.8188



0.8015



3.3136



228.8049



3.91809



247.96048



X1 X2 X4 X6 X7 X10 X11



7



0.8185



0.8011



3.4231



228.9346



3.92123



248.09024



X1 X2 X4 X7 X10 X11 X15



8



0.8245



0.8050



3.1778



228.2320



3.88305



249.78208



X1 X2 X4 X6 X7 X10 X11 X15



Variables in Model



176  ◾  Statistical Data Mining Using SAS Application



© 2010 by Taylor and Francis Group, LLC



K10535_Book.indb 176



Table 5.1  Macro REGDIAG2—Best Two Subsets in All Possible MAXR2 Selection Method



5/18/10 3:37:24 PM



© 2010 by Taylor and Francis Group, LLC



K10535_Book.indb 177



0.8208



0.8009



4.5708



229.9193



3.92370



251.46934



X1 X2 X4 X7 X10 X11 X12 X15



9



0.8259



0.8038



4.6640



229.6007



3.89509



253.54524



X1 X2 X4 X6 X7 X10 X11 X12 X15



9



0.8248



0.8026



5.0546



230.0812



3.90666



254.02565



X1 X2 X4 X6 X7 X8 X10 X11 X15



10



0.8261



0.8013



6.5653



231.4789



3.91986



257.81784



X1 X2 X4 X6 X7 X9 X10 X11 X12 X15



10



0.8261



0.8013



6.5721



231.4873



3.92007



257.82622



X1 X2 X4 X6 X7 X8 X10 X11 X12 X15



11



0.8266



0.7989



8.4032



233.2784



3.94328



262.01177



X1 X2 X4 X6 X7 X8 X10 X11 X12 X13 X15



11



0.8265



0.7988



8.4254



233.3058



3.94395



262.03921



X1 X2 X4 X6 X7 X8 X9 X10 X11 X12 X15



12



0.8270



0.7964



10.2462



235.0837



3.96740



266.21150



X1 X2 X4 X6 X7 X8 X9 X10 X11 X12 X13 X15



12



0.8268



0.7963



10.3043



235.1558



3.96917



266.28364



X1 X2 X4 X6 X7 X8 X10 X11 X12 X13 X14 X15



13



0.8273



0.7938



12.1050



236.9082



3.99257



270.43044



X1 X2 X4 X5 X6 X7 X8 X9 X10 X11 X12 X13 X15



13



0.8272



0.7937



12.1621



236.9792



3.99432



270.50152



X1 X2 X4 X6 X7 X8 X9 X10 X11 X12 X13 X14 X15



14



0.8276



0.7911



14.0000



238.7775



4.01946



274.69420



X1 X2 X4 X5 X6 X7 X8 X9 X10 X11 X12 X13 X14 X15



14



0.8273



0.7907



14.1049



238.9081



4.02270



274.82480



X1 X2 X3 X4 X5 X6 X7 X8 X9 X10 X11 X12 X13 X15



15



0.8276



0.7878



16.0000



240.7775



4.05026



279.08865



X1 X2 X3 X4 X5 X6 X7 X8 X9 X10 X11 X12 X13 X14 X15



5/18/10 3:37:24 PM



Supervised Learning Methods  ◾  177



8



178  ◾  Statistical Data Mining Using SAS Application

CP/P-Ratio & RMSE (area of the bubble) Plot

30



25

RMSE



5.107



Cp/_P_ratio



20



15

4.881

4.873



10



4.678

4.602



5



4.358

4.318

4.050

4.119

4.102

4.019

4.023

3.993

3.994

3.967

3.969

4.007

3.943

3.944

3.920

3.924

3.907

3.895

3.918

3.921

3.883

3.919



0



1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

Number of Predictor Variables



Figure 5.5  Model selection using SAS macro REGDIAG2: CP plot for selecting

the best subset model.



in this example, all six criteria identify the 13-parameter model as the best model.

However, beyond the six variables, no substantial gain was noted. The coefficient

progression plot displayed in Figure 5.7 shows the stability of the standardized regression coefficients as a result of adding new variables in each model-selection step. The

problem of multicollinearity among the predictor variables was not evident since all

the standardized regression coefficients have values less than ±1. The following six

variables, X4, X7, X1, X2, X13, and X11, were identified as the most contributing variables in the model selection sequence. Although X15 was included in the second step,

it was later excluded from the model. Thus, these features enable the analysts to identify the most contributing variables and help them perform further investigations.

Because this model-selection step only includes the linear effects of the variables, it is recommended that this step be used as a preliminary model selection

step rather than the final concluding step. Furthermore, the REGDIAG2 macro

also has a feature for selecting the best-candidate models using AICC and SBC

(Tables 5.2 and 5.3). Next we will examine the predictor variables selected in the

best-candidate models.



© 2010 by Taylor and Francis Group, LLC



K10535_Book.indb 178



5/18/10 3:37:25 PM



Supervised Learning Methods  ◾  179

Fit Criteria for Y2



AICC



AIC



Adj R-Sq



SBC



C(p)

0



BIC

5



Step



10



15



Best criterion value



0



5



Step



10



15



Step selected by SBC



Figure 5.6  Model selection using SAS macro REGDIAG2: Fit criteria plots derived

from using the ODS graphics feature in GLMSELECT procedure.



Both minimum AICC and SBC criteria identified the same six-variable model

(X1, X2, X4, X7, X11, and X10) as the best model. The first five variables were also

selected as the best contributing variables by the LASSO method (Figure 5.6), and

the CP method picked the same six variables as the best model (Table 5.1). The

ΔSBC criterion is very conservative and picked only one model as the best candidate where as Δ AICC method identified five models as the best candidates. The

standardized regression coefficients of the best candidate model’s predictors were

very stable, indicating the impact of multicollinearity is very minimal. Then, based

on the preliminary model selection step, the following X1, X2, X4, X7, X11, and

X10 variables were identified as the best linear predictors, and we can proceed with

the second step of the analysis.



5.12.1.2 Step 2: Graphical Exploratory Analysis

and Regression Diagnostic Plots

Open the REGDIAG2.SAS macro-call file in the SAS EDITOR window and

click RUN to open the REGDIAG2 macro-call window (Figure. 5.8). Input the

appropriate macro-input values by following the suggestions given in the help file



© 2010 by Taylor and Francis Group, LLC



K10535_Book.indb 179



5/18/10 3:37:25 PM



180  ◾  Statistical Data Mining Using SAS Application

Coefficient Progression for Y2



Standardized Coefficient



0.50

X2

X1

X10

X6

X9



0.25

0.00



X8

X12



–0.25

–0.50

360

340

320

300

280

260



Selected step



In

te

rc

ep

1+ t

X

2+ 4

X1

3+ 5

X7

4+

X1

5+

X

6+ 2

X1

7– 3

X1

8+ 5

X1

9+ 1

X

10 15

+X

1

11 0

+X

12 6

+X

1

13 2

+X

14 8

+X

15 14

+X

16 9

+X

17 5

+X

3



SBC



X11



Effect Sequence



Figure 5.7  Model selection using SAS macro REGDIAG2: Standardized regression coefficient and SBC progression plots by model selection steps derived from

using the ODS graphics feature in GLMSELECT procedure.



(Appendix 2). Leave the group variable option blank, because all the predictors

used are continuous. Input YES in macro field #14 to request additional regression

diagnostics plots using the selected predicted variables in step 1.

The three model selection plots—CP plot (Figure  5.9), fit criteria plot

(Figure 5.10), and coefficient progression plot (Figure 5.11)—on the predicted variables selected in step 1 (6 variables: X1, X2, X4, X7, X11, and X10) further confirmed that these are the best linear predictors in all model selection criteria. Thus,

in the second step, data exploration and diagnostic plot analysis were carried out

using these six predictor variables.

Simple linear regression and augmented partial residual (APR) plots for all six

predictor variables are presented in Figure  5.12. The linear/quadratic regression

parameter estimates for the simple and multiple linear regressions and their significance levels are also displayed in the titles of the APR plots. The simple linear

regression line describes the relationship between the response and a given predictor variable in a simple linear regression. The APR line shows the quadratic

regression effect of the ith predictor on the response variable after accounting for



© 2010 by Taylor and Francis Group, LLC



K10535_Book.indb 180



5/18/10 3:37:27 PM



© 2010 by Taylor and Francis Group, LLC



K10535_Book.indb 181



Table 5.2  Macro REGDIAG2—Standardized Regression Coefficient Estimates and the Several Model Selection Criteria

for the Best-Candidate Models in All Possible MAXR2 Selection Methods Using the Selection Criterion Delta AICC < 2

4 Engine

RPM (revs

per minute Revolutions

per Mile

at

HP

(maximum) maximum (in highest

gear) X6

HP)X5

X4



Number of

Cylinders

X2



Engine

Size

(liters)

X3



Y2



19.3219



2.44057



3.14933



.



3.53493



.



.



.



.



Y2



19.4307



2.21502



3.31102



.



2.66918



.



1.27614



.



.



Y2



19.3455



2.38315



3.37800



.



3.52546



.



0.77161



.



.



Y2



19.3628



2.36426



3.02281



.



3.03817



.



.



.



.



Y2



19.3221



2.49279



3.10444



.



3.56846



.



.



.



.



U-Turn

Space (feet)

X12



Rear Seat

Room

(inches)

X13



Dependent

Variable



Car

Width

Wheelbase (inches)

X11

(inches) X10



Passenger

Capacity

(persons)

X8



Car Length

(inches) X9



3.31439



.



.



2.80894



−6.1088



.



.



.



.



7



2.33768



.



.



2.22848



−5.9390



.



.



.



3.10061



9



3.25036



.



.



2.98053



−5.7254



.



.



.



.



8



2.80552



.



.



2.30391



−6.3801



.



.



.



1.81189



8



3.32973



.



.



2.90899



−5.7885



−0.523



.



.



.



8



Fuel Tank

Capacity

(gallons) X7



Luggage

Weight

Capacity (cu (pounds)

ft X14

X15



Number of

Parameters

in Model



5/18/10 3:37:27 PM



(continued)



Supervised Learning Methods  ◾  181



Intercept



Air Bags

(0 = none,

1 = driver only,

2 = driver and

passenger) X1



Schwarz’s

Bayesian

Criterion



AICC



DELTA_

AICC



DELTA_

SBC



W_AICC



W_AICCR



0.80133



244.722



229.279



0.00000



0.00000



0.33421



1.00000



0.80500



249.782



230.401



1.12177



5.05964



0.19074



0.57070



0.80147



247.960



230.519



1.24024



3.23805



0.17977



0.53788



0.80115



248.090



230.649



1.37000



3.36781



0.16847



0.50409



0.79975



248.658



231.217



1.93826



3.93607



0.12681



0.37941



Adjusted

r-Squared



182  ◾  Statistical Data Mining Using SAS Application



© 2010 by Taylor and Francis Group, LLC



K10535_Book.indb 182



Table 5.2  Macro REGDIAG2—Standardized Regression Coefficient Estimates and

the Several Model Selection Criteria for the Best-Candidate Models in All Possible

MAXR2 Selection Methods Using the Selection Criterion Delta AICC < 2 (Continued)



5/18/10 3:37:27 PM



© 2010 by Taylor and Francis Group, LLC



K10535_Book.indb 183



Table 5.3  Macro REGDIAG2—Standardized Regression Coefficient Estimates and the Several Model-Selection Criteria

for the Best-Candidate Models in All Possible MAXR2 Selection Methods Using the Selection Criterion Delta SBC <2

4 Engine

Revolutions

per Mile

(in highest

gear) X6



.



3.53493



.



.



Wheelbase

(inches) X10



Car Width

(inches)

X11



U-Turn

Space (feet)

X12



Rear Seat

Room

(inches) X13



Luggage

Capacity

(cu ft) X14



Weight

(pounds)

X15



Number of

Parameters

in Model



.



2.80894



−6.1088



.



.



.



.



7



Schwarz’s

Bayesian

criterion



AICC



DELTA_

AICC



DELTA_

SBC



W_SBC



W_SBCR



244.722



229.279



0



0



1



1



Intercept



Number of

Cylinders

X2



Engine

Size

(liters) X3



19.3219



2.44057



3.14933



Fuel Tank

Capacity

(gallons) X7



Passenger

Capacity

(persons)

X8



Car Length

(inches) X9



3.31439



.



Adjusted

r-squared



Dependent

Variable

Y2



0.80133



5/18/10 3:37:27 PM



Supervised Learning Methods  ◾  183



HP

(maximum)

X4



RPM (revs

per minute

at maximum

HP) X5



Air Bags

(0 = none,

1 = driver only,

2 = driver and

passenger) X1



184  ◾  Statistical Data Mining Using SAS Application



Figure 5.8  Screen copy of REGDIAG2 macro-call window showing the macrocall parameters required for performing regression diagnostic plots in MLR.



the linear effects of other predictors on the response. The APR plot is very effective in detecting significant outliers and nonlinear relationships. Significant outliers

and/or influential observations are identified and marked on the APR plot if the

absolute STUDENT value exceeds 2.5, or the DFFITS statistic exceeds 1.5. These

influential statistics are derived from the MLR model involving all predictor variables. If the correlations among all predictor variables are negligible, the simple and

the partial regression lines should have similar slopes.

The APR plots for the six-predictor variables showed significant linear relationships between the six predictors and median price. A big difference in the magnitude of the partial (adjusted) and the simple (unadjusted) regression effects for all

six predictors on median price were clearly evident (Figure  5.12). The quadratic

effects of all six predictor variables on the median price were not significant at the

5% level. Five significant outliers were also detected in these APR plots.

Partial leverage plots (PL) for all six predictor variables are presented in

Figure  5.13. The PL display shows three curves: (a) the horizontal reference line

that goes through the response variable mean, (b) the partial regression line, which

quantifies the slope of the partial regression coefficient of the ith variable in the

MLR, and (c) the 95% confidence band for partial regression line. The partial

regression parameter estimates for the ith variable in the multiple linear regression



© 2010 by Taylor and Francis Group, LLC



K10535_Book.indb 184



5/18/10 3:37:28 PM



Tài liệu bạn tìm kiếm đã sẵn sàng tải về

12 Case Study 1: Modeling Multiple Linear Regressions

Tải bản đầy đủ ngay(0 tr)

×