Tải bản đầy đủ - 0 (trang)
Figure B.2: SAS Enterprise Miner Diagram

Figure B.2: SAS Enterprise Miner Diagram

Tải bản đầy đủ - 0trang

Tutorial B: Developing an Application Scorecard Model in SAS Enterprise Miner 141

Analysts can typically spend about 70% of their time preparing the data for a data mining process. If this data

preparation task can be centralized within the IT department, then all analysts are able to get a consistent and

clean view of the data that will then allow them to spend the majority of their time building accurate models.

B.1.3 Step 3 – Visualize the Data

Select the KGB data source node in the diagram (Figure B.3) and click the Property Panel, found in the

ellipsis next to Variables under the Train – Columns property panel.

Figure B.3: KGB Data Node



As illustrated in Figure B.4, select AGE, CAR, CARDS, GB, INCOME and RESID, and click Explore…

Figure B.4: Select Variables



142 Developing Credit Risk Models Using SAS Enterprise Miner and SAS/STAT

Figure B.5: Display Variable Interactions



Analysts like to get a full understanding of their data. A quick way to do this is through the interactive data

visualization. If you click the Without Vehicle bar in the CAR graphic (Figure B.5), you can select all the

customers who do not have a car in the other graphics. Right-click a graph, select Data Options, and use the

Where tab to query data displayed.

By using visual interactive data exploration, analysts can quickly assess the quality of the data and any initial

patterns that exist. They can then use this to help drive the rest of the data mining project in terms of the

modification of the data before they look at the modeling process.



Tutorial B: Developing an Application Scorecard Model in SAS Enterprise Miner 143



B.1.4 Step 4 – Partition the Data

Figure B.6: Property Panel for the Data Partition Node



In the process of developing a scorecard, you perform predictive modeling. Thus, it is advisable to partition

your data set into training and validation samples. If the total sample size permits, having a separate test sample

permits a more robust evaluation of the resulting scorecard. The Data Partition node, the property panel for

which is shown above in Figure B.6, is used to partition the KGB data set.

B.1.5 Step 5 –Perform Screening and Grouping with Interactive Grouping

To perform univariate characteristic screening and grouping, an Interactive Grouping node (Figure B.7) is

used in the process flow diagram.

Figure B.7: Interactive Grouping Node



The Interactive Grouping node can automatically group the characteristics for you; however, the Interactive

Grouping node also enables you to perform the grouping interactively (on the Property Panel, select Train

Interactive Grouping).



144 Developing Credit Risk Models Using SAS Enterprise Miner and SAS/STAT

Performing interactive grouping is important because the results of the grouping affect the predictive power of

the characteristics, and the results of the screening often indicate the need for regrouping. Thus, the process of

grouping and screening is iterative, rather than a sequential set of discrete steps.

Grouping refers to the process of purposefully censoring your data. Grouping offers the following advantages:















It offers an easier way to deal with rare classes and outliers with interval variables.

It makes it easy to understand relationships, and therefore gain far more knowledge of the portfolio.

Nonlinear dependencies can be modeled with linear models.

It gives the user control over the development process. By shaping the groups, you shape the final

composition of the scorecard.

The process of grouping characteristics enables the user to develop insights into the behavior of risk

predictors and to increase knowledge of the portfolio, which can help in developing better strategies

for portfolio management.



B.1.6 Step 6 – Create a Scorecard and Fit a Logistic Regression Model

The Scorecard node (Figure B.8) fits a logistic regression model and computes the scorecard points for each

attribute. With the SAS EM Scorecard you can use either the Weights of Evidence (WOE) variables or the

group variables that are exported by the Interactive Grouping node as inputs for the logistic regression model.

Figure B.8: Scorecard Node



The Scorecard node provides four methods of model selection and seven selection criteria for the logistic

regression model. The scorecard points of each attribute are based on the coefficients of the logistic regression

model. The Scorecard node also enables you to manually assign scorecard points to attributes. The scaling of

the scorecard points is also controlled by the three scaling options within the properties of the Scorecard node.

B.1.7 Step 7 – Create a Rejected Data Source

The REJECTS data set contains records that represent previous applicants who were denied credit. The

REJECTS data set does not have a target variable.

The Reject Inference node automatically creates the target variable for the REJECTS data when it creates the

augmented data set. The REJECTS data set must include the same characteristics as the KGB data. A role of

SCORE is assigned to the REJECTS data source.

B.1.8 Step 8 – Perform Reject Inference and Create an Augmented Data Set

Credit scoring models are built with a fundamental bias (selection bias). The sample data that is used to develop

a credit scoring model is structurally different from the "through-the-door" population to which the credit

scoring model is applied. The non-event or event target variable that is created for the credit scoring model is

based on the records of applicants who were all accepted for credit. However, the population to which the credit

scoring model is applied includes applicants who would have been rejected under the scoring rules that were

used to generate the initial model. One remedy for this selection bias is to use reject inference. The reject

inference approach uses the model that was trained using the accepted applications to score the rejected

applications. The observations in the rejected data set are classified as inferred non-event and inferred event.

The inferred observations are then added to the KGB data set to form an augmented data set.

This augmented data set, which represents the "through-the-door" population, serves as the training data set for

a second scorecard model.



Tài liệu bạn tìm kiếm đã sẵn sàng tải về

Figure B.2: SAS Enterprise Miner Diagram

Tải bản đầy đủ ngay(0 tr)

×