Tải bản đầy đủ
6…SPSS Procedures for Performing Factor Analysis on Defaulter Prediction

# 6…SPSS Procedures for Performing Factor Analysis on Defaulter Prediction

Tải bản đầy đủ

11.6

SPSS Procedures for Performing Factor Analysis on Defaulter Prediction

Step 2

Step 3

Step 4

Step 5

249

Move repayment behaviour into the Dependent variable box and age,
income, number of dependents, job, education and other loan into
Covariates boxMake sure Enter is the selected Method. (This enters all
the variables in the covariates box into the logistic regression equation
simultaneously). See Fig. 11.4
If you have categorical independent variable in the study, click on
Categorical and move all the categorical independent variables from the
left panel window of Covariates to right panel window of Categorical
Covariates to get Fig. 11.5. Then click on Continue to get back to
Logistic Regression window
Click on save to produce probabilities and group membership, which will
give Fig. 11.6. Then click on Continue to get back to Logistic
Regression window
Click on Options to produce Fig. 11.7. Click on Classification Plots and
Hosmer–Lemeshow Goodness of Fit. Then click on Continue to get
back to Logistic Regression window. Then Click on OK to get the
output window

Table 11.3 Defaulter prediction data (First 20 samples)
Account
Repayment
Age Gender Income Number
number
behaviour
dependents

Job

Education Other_loan

21.00
31.00
51.00
71.00
74.00
91.00
111.00
131.00
141.00
191.00
201.00
241.00
251.00
261.00
271.00
283.00
291.00
311.00
312.00

0.00
1.00
1.00
1.00
1.00
1.00
1.00
0.00
1.00
0.00
0.00
0.00
1.00
1.00
0.00
0.00
1.00
0.00
0.00

0.00
0.00
1.00
1.00
0.00
0.00
0.00
0.00
0.00
1.00
1.00
0.00
0.00
0.00
1.00
1.00
0.00
1.00
1.00

0.00
0.00
1.00
0.00
0.00
0.00
1.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
1.00
0.00
0.00
0.00

56.00
43.00
56.00
64.00
49.00
46.00
52.00
63.00
42.00
55.00
74.00
53.00
58.00
56.00
69.00
51.00
43.00
64.00
44.00

0.00
0.00
1.00
1.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
1.00
0.00
1.00
0.00

20.00
25.00
6.00
40.00
42.00
8.00
26.00
6.00
43.00
23.00
26.00
10.00
40.00
30.00
10.00
32.00
12.00
41.00
23.00

0.00
0.00
1.00
0.00
0.00
0.00
1.00
0.00
0.00
1.00
0.00
0.00
0.00
0.00
0.00
1.00
0.00
1.00
0.00

0.00
0.00
1.00
1.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
1.00
0.00
1.00
0.00

250

Fig. 11.3 IBM SPSS 20.0 binary logistic selection

Fig. 11.4 IBM SPSS logistic regression window

11 Binary Logistic Regression

11.6

SPSS Procedures for Performing Factor Analysis on Defaulter Prediction

Fig. 11.5 Defining of categorical independent variables

Fig. 11.6 Selection of probabilities and group membership

251

252

11 Binary Logistic Regression

Fig. 11.7 Logistic regression option window

11.7 IBM SPSS 20.0 Syntax for Binary Logistic Regression
GET
FILE = ’G:\LIBRARY\I-BOOK DEVELOPMENT\Logistic Regression.sav’.
DATASET NAME DataSet1 WINDOW = FRONT.
LOGISTIC REGRESSION VARIABLES RepaymentBehaviour
/METHOD = ENTER age Gender Income No_Dep Job Education
Other_Loan
/CONTRAST (Gender) = Indicator
/CONTRAST (No_Dep) = Indicator
/CONTRAST (Job) = Indicator
/CONTRAST (Education) = Indicator
/CONTRAST (Other_Loan) = Indicator
/SAVE = PRED PGROUP
/CLASSPLOT
/PRINT = GOODFIT
/CRITERIA = PIN(0.05) POUT(0.10) ITERATE(20) CUT(0.5).

11.8 IBM SPSS 20.0 Output for Logistic Regression
Table 11.4 shows the number of observations Included in Analysis and the
number of observation used (Total) in the analysis. The number of observations
Included in Analysis may be less than the Total if there are missing values for

11.8

IBM SPSS 20.0 Output for Logistic Regression

Table 11.4 Case processing summary
Unweighted casesa
Selected cases

Included in analysis
Missing cases
Total

Unselected cases
Total
a

253

N

Percent (%)

609
0
609
0
609

100.0
0.0
100.0
0.0
100.0

If weight is in effect, see classification table for the total number of cases

Table 11.5 Dependent variable encoding
Original value

Internal value

Defaulter
Non-defaulter

0
1

Table 11.6 Categorical variables codings

Other loan possessed by the
account holder
Number of people depended
on Account holder
Fixed job versus temporary
Education of the account holder
(School vs. College)
Gender of the account holder

Yes
No
2 or \2 dependents
More than 2 dependents
Permanent Job
Temporary Job
School education
College education
Male
Female

Frequency

Parameter coding
(1)

511
98
443
166
222
387
354
255
487
122

1.000
0.000
1.000
0.000
1.000
0.000
1.000
0.000
1.000
0.000

any variables in the equation. By default, SPSS does a listwise deletion of
incomplete cases. In the current example, both are same, because there are no
missing values.
Table 11.5 shows the categorical representation for the dichotomously dependent variable-Repayment Behaviour. 0 used for defaulter category and 1 for nondefaulter category.
Table 11.6 explicitly shows the labelling used by the researcher to represent the
categorically independent variables.
The Classification Table shown in Table 11.7. As mentioned earlier, it is
common practice to use 0.5 as the cut-off for predicting occurrence. That is, to
predict non-occurrence of the event of interest whenever p \ 0.5 and to predict
occurrence if p [ 0.5. The Classification table indicates how many correct and
incorrect predictions would be made for a wide range of probability cut-off points
used for the model. In this case, 88.3 per cent of the cases are correctly classified

254

11 Binary Logistic Regression

Table 11.7 Classification tablea,b
Observed

Predicted
Repayment behaviour

Step 0

Repayment behaviour

Defaulter
Non-defaulter

Percentage correct

Defaulter

Non-defaulter

538
71

0
0

100.0
0.0
88.3

Overall percentage
a
b

Constant is included in the model
The cut value is 0.500

Table 11.8 Variables not in the equation
Step 0

Variables

Age
Gender(1)
Income
No_Dep(1)
Job(1)
Education(1)
Other_Loan(1)

Overall statistics

Table 11.9 Omnibus tests of model coefficients
Chi square
Step 1

Step
Block
Model

46.798
46.798
46.798

Score

df

Significant

11.819
16.246
17.302
7.483
5.429
11.536
6.775
49.121

1
1
1
1
1
1
1
7

0.001
0.000
0.000
0.006
0.020
0.001
0.009
0.000

df

Significant

7
7
7

0.000
0.000
0.000

using the 0.50 cut-off point, which is similar to ‘Hit Ratio’ in discriminant
analysis.
Table 11.8 shows how individually these independent variables predict the
dependent variable. In this study, all the variables are found to be significant at 5
per cent (p \ 0.05).
In SPSS, we can adopt different methods to prove the contribution or predictability of this independent variable on the dichotomously dependent variable. One
of the test that SPSS follows is Omnibus Tests of Model Coefficients in
Table 11.9. This test will give an inference that, when we consider all the independent together, the model specified is significant or not. In this example, it found
that all the variable taken together the specified Model is significant (X2 = 46.79,
df = 7, N = 75, p \ 0.001).

11.9

Assessing a Model’s Fit and Predictive Ability

255

11.9 Assessing a Model’s Fit and Predictive Ability
There are several statistics printed by SPSS that can be used to assess model fit.
The important among them are as follows:
(i) The R2 table, which is the Cox and Snell R2, generalized coefficient of
determination. The closer the values of R2 to 1, the better the fit of the model.
Cox and Snell R2 may not achieve a maximum value of 1. The second R2,
Nagelkerke R2, is a better one to use (Table 11.10).
(ii) Observe the Hosmer and Lemeshow tables shown in Table 11.11. SPSS
computes a Chi square from observed and expected frequencies in the
Table 11.12. Large Chi square values (and correspondingly small p-values)
indicate a lack of fit for the model. In our example, the Hosmer and Lemeshow
Chi square test for the final model yields a p value of 0.225, thus suggesting a
model with satisfactory predictive value. Note that the Hosmer and Lemeshow
Chi square test is not a test of importance of specific model parameter
In Table 11.13, Estimates are the binary logit regression estimates or coefficients for the Parameters in the model. The logistic regression model models the

Table 11.10 Model summary
Step
-2 Log likelihood
391.761a

1
a

Cox & Snell R2

Nagelkerke R2

0.074

0.144

Estimation terminated at iteration number 6 because parameter estimates changed by \0.001

Table 11.11 Hosmer and lemeshow test
Step
Chi square

df

Significant

1

8

0.092

13.636

Table 11.12 Contingency table for Hosmer and Lemeshow test
Repayment behaviour = defaulter
Repayment behaviour = non-D
Step 1

1
2
3
4
5
6
7
8
9
10

Observed

Expected

Observed

Expected

62
59
60
57
51
55
54
57
46
37

60.404
59.536
57.839
57.967
56.241
55.491
54.302
51.725
47.489
37.006

0
3
1
5
10
6
7
4
15
20

1.596
2.464
3.161
4.033
4.759
5.509
6.698
9.275
13.511
19.994

Total
62
62
61
62
61
61
61
61
61
57

256

11 Binary Logistic Regression

Table 11.13 Variables in the equation
B
Step 1a

a

age
Gender(1)
Income
No_Dep(1)
Job(1)
Education(1)
Other_Loan(1)
Constant

0.035
-1.536
0.010
-0.244
-0.681
-0.861
1.749
-4.118

S.E.

Wald

df

Significant

Exp(B)

0.015
0.526
0.004
0.298
0.307
0.331
0.616
1.056

5.194
8.511
7.510
0.672
4.922
6.786
8.053
15.195

1
1
1
1
1
1
1
1

0.023
0.004
0.006
0.412
0.027
0.009
0.005
0.000

1.036
0.215
1.010
0.784
0.506
0.423
5.749
0.016

Variable(s) entered on step 1: age, Gender, Income, No_Dep, Job, Education, Other_Loan

log odds of a positive response (probability modelled for Non-Defaulter = 1) as a
linear combination the predictor variables. This is written as follows:
ProbNonÀdefaulter
1 À ProbNonÀdefaulter
¼ À4:118 þ 0:035 Ã Age À 1:536 Ã Gender þ 0:010 Ã Income

LOGITi ¼ ln

À 0:244 Ã Number of Dependents À 0:681 Ã Job
À 0:861 Ã Education þ 1:749 Ã OtherLoan
SPSS will give the output of both logistic coefficients and exponentiated
logistic coefficients. According to Hair et al. (2010), the original logistic coefficients are most appropriate for determining the direction of the relationship and
less useful in determining the magnitude of relationships. Exponentiated coefficients directly reflect the magnitude of the change in the odds value. Because they
are exponents, they are interpreted with slight difference. The exponentiated
coefficients less than 1.0 reflect negative relationships, while values above 1.0
denote positive relationships.
Age: This is the estimated logistic regression coefficient for the variable age,
given the other variables are held constant in the model. The difference in log-odds
is expected to be 0.035 units higher for non-defaulter compared with defaulter,
while holding the other variables constant in the model. We got an exponentiated
coefficient value of 1.036 for age. For assessing magnitude, the easier approach to
determine the change in probability from these values is:
Percentage change in odds = (Exponentiated coefficient-1.0) * 100
= (1.036-1) * 100 = 3.6 %
which means if the exponentiated coefficient is 1.036, a one unit change in the
independent variable will increase the odds by 3.6 %.
Gender (1): This is a dichotomous independent variable and we considered
male group (male = 1, female = 0) as our reference category. The value we
estimated is the estimate logistic regression coefficient for a one unit change in
gender, given the other variables in the model are held constant. The logit