Tải bản đầy đủ - 0trang
Figure B.2: SAS Enterprise Miner Diagram
Tutorial B: Developing an Application Scorecard Model in SAS Enterprise Miner 141
Analysts can typically spend about 70% of their time preparing the data for a data mining process. If this data
preparation task can be centralized within the IT department, then all analysts are able to get a consistent and
clean view of the data that will then allow them to spend the majority of their time building accurate models.
B.1.3 Step 3 – Visualize the Data
Select the KGB data source node in the diagram (Figure B.3) and click the Property Panel, found in the
ellipsis next to Variables under the Train – Columns property panel.
Figure B.3: KGB Data Node
As illustrated in Figure B.4, select AGE, CAR, CARDS, GB, INCOME and RESID, and click Explore…
Figure B.4: Select Variables
142 Developing Credit Risk Models Using SAS Enterprise Miner and SAS/STAT
Figure B.5: Display Variable Interactions
Analysts like to get a full understanding of their data. A quick way to do this is through the interactive data
visualization. If you click the Without Vehicle bar in the CAR graphic (Figure B.5), you can select all the
customers who do not have a car in the other graphics. Right-click a graph, select Data Options, and use the
Where tab to query data displayed.
By using visual interactive data exploration, analysts can quickly assess the quality of the data and any initial
patterns that exist. They can then use this to help drive the rest of the data mining project in terms of the
modification of the data before they look at the modeling process.
Tutorial B: Developing an Application Scorecard Model in SAS Enterprise Miner 143
B.1.4 Step 4 – Partition the Data
Figure B.6: Property Panel for the Data Partition Node
In the process of developing a scorecard, you perform predictive modeling. Thus, it is advisable to partition
your data set into training and validation samples. If the total sample size permits, having a separate test sample
permits a more robust evaluation of the resulting scorecard. The Data Partition node, the property panel for
which is shown above in Figure B.6, is used to partition the KGB data set.
B.1.5 Step 5 –Perform Screening and Grouping with Interactive Grouping
To perform univariate characteristic screening and grouping, an Interactive Grouping node (Figure B.7) is
used in the process flow diagram.
Figure B.7: Interactive Grouping Node
The Interactive Grouping node can automatically group the characteristics for you; however, the Interactive
Grouping node also enables you to perform the grouping interactively (on the Property Panel, select Train
144 Developing Credit Risk Models Using SAS Enterprise Miner and SAS/STAT
Performing interactive grouping is important because the results of the grouping affect the predictive power of
the characteristics, and the results of the screening often indicate the need for regrouping. Thus, the process of
grouping and screening is iterative, rather than a sequential set of discrete steps.
Grouping refers to the process of purposefully censoring your data. Grouping offers the following advantages:
It offers an easier way to deal with rare classes and outliers with interval variables.
It makes it easy to understand relationships, and therefore gain far more knowledge of the portfolio.
Nonlinear dependencies can be modeled with linear models.
It gives the user control over the development process. By shaping the groups, you shape the final
composition of the scorecard.
The process of grouping characteristics enables the user to develop insights into the behavior of risk
predictors and to increase knowledge of the portfolio, which can help in developing better strategies
for portfolio management.
B.1.6 Step 6 – Create a Scorecard and Fit a Logistic Regression Model
The Scorecard node (Figure B.8) fits a logistic regression model and computes the scorecard points for each
attribute. With the SAS EM Scorecard you can use either the Weights of Evidence (WOE) variables or the
group variables that are exported by the Interactive Grouping node as inputs for the logistic regression model.
Figure B.8: Scorecard Node
The Scorecard node provides four methods of model selection and seven selection criteria for the logistic
regression model. The scorecard points of each attribute are based on the coefficients of the logistic regression
model. The Scorecard node also enables you to manually assign scorecard points to attributes. The scaling of
the scorecard points is also controlled by the three scaling options within the properties of the Scorecard node.
B.1.7 Step 7 – Create a Rejected Data Source
The REJECTS data set contains records that represent previous applicants who were denied credit. The
REJECTS data set does not have a target variable.
The Reject Inference node automatically creates the target variable for the REJECTS data when it creates the
augmented data set. The REJECTS data set must include the same characteristics as the KGB data. A role of
SCORE is assigned to the REJECTS data source.
B.1.8 Step 8 – Perform Reject Inference and Create an Augmented Data Set
Credit scoring models are built with a fundamental bias (selection bias). The sample data that is used to develop
a credit scoring model is structurally different from the "through-the-door" population to which the credit
scoring model is applied. The non-event or event target variable that is created for the credit scoring model is
based on the records of applicants who were all accepted for credit. However, the population to which the credit
scoring model is applied includes applicants who would have been rejected under the scoring rules that were
used to generate the initial model. One remedy for this selection bias is to use reject inference. The reject
inference approach uses the model that was trained using the accepted applications to score the rejected
applications. The observations in the rejected data set are classified as inferred non-event and inferred event.
The inferred observations are then added to the KGB data set to form an augmented data set.
This augmented data set, which represents the "through-the-door" population, serves as the training data set for
a second scorecard model.