6…SPSS Procedures for Performing Factor Analysis on Defaulter Prediction
Tải bản đầy đủ
11.6
SPSS Procedures for Performing Factor Analysis on Defaulter Prediction
Step 2
Step 3
Step 4
Step 5
249
Move repayment behaviour into the Dependent variable box and age,
income, number of dependents, job, education and other loan into
Covariates boxMake sure Enter is the selected Method. (This enters all
the variables in the covariates box into the logistic regression equation
simultaneously). See Fig. 11.4
If you have categorical independent variable in the study, click on
Categorical and move all the categorical independent variables from the
left panel window of Covariates to right panel window of Categorical
Covariates to get Fig. 11.5. Then click on Continue to get back to
Logistic Regression window
Click on save to produce probabilities and group membership, which will
give Fig. 11.6. Then click on Continue to get back to Logistic
Regression window
Click on Options to produce Fig. 11.7. Click on Classification Plots and
Hosmer–Lemeshow Goodness of Fit. Then click on Continue to get
back to Logistic Regression window. Then Click on OK to get the
output window
Table 11.3 Defaulter prediction data (First 20 samples)
Account
Repayment
Age Gender Income Number
number
behaviour
dependents
Job
Education Other_loan
21.00
31.00
51.00
71.00
74.00
91.00
111.00
131.00
141.00
191.00
201.00
241.00
251.00
261.00
271.00
283.00
291.00
311.00
312.00
0.00
1.00
1.00
1.00
1.00
1.00
1.00
0.00
1.00
0.00
0.00
0.00
1.00
1.00
0.00
0.00
1.00
0.00
0.00
0.00
0.00
1.00
1.00
0.00
0.00
0.00
0.00
0.00
1.00
1.00
0.00
0.00
0.00
1.00
1.00
0.00
1.00
1.00
0.00
0.00
1.00
0.00
0.00
0.00
1.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
1.00
0.00
0.00
0.00
56.00
43.00
56.00
64.00
49.00
46.00
52.00
63.00
42.00
55.00
74.00
53.00
58.00
56.00
69.00
51.00
43.00
64.00
44.00
0.00
0.00
1.00
1.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
1.00
0.00
1.00
0.00
20.00
25.00
6.00
40.00
42.00
8.00
26.00
6.00
43.00
23.00
26.00
10.00
40.00
30.00
10.00
32.00
12.00
41.00
23.00
0.00
0.00
1.00
0.00
0.00
0.00
1.00
0.00
0.00
1.00
0.00
0.00
0.00
0.00
0.00
1.00
0.00
1.00
0.00
0.00
0.00
1.00
1.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
1.00
0.00
1.00
0.00
250
Fig. 11.3 IBM SPSS 20.0 binary logistic selection
Fig. 11.4 IBM SPSS logistic regression window
11 Binary Logistic Regression
11.6
SPSS Procedures for Performing Factor Analysis on Defaulter Prediction
Fig. 11.5 Defining of categorical independent variables
Fig. 11.6 Selection of probabilities and group membership
251
252
11 Binary Logistic Regression
Fig. 11.7 Logistic regression option window
11.7 IBM SPSS 20.0 Syntax for Binary Logistic Regression
GET
FILE = ’G:\LIBRARY\I-BOOK DEVELOPMENT\Logistic Regression.sav’.
DATASET NAME DataSet1 WINDOW = FRONT.
LOGISTIC REGRESSION VARIABLES RepaymentBehaviour
/METHOD = ENTER age Gender Income No_Dep Job Education
Other_Loan
/CONTRAST (Gender) = Indicator
/CONTRAST (No_Dep) = Indicator
/CONTRAST (Job) = Indicator
/CONTRAST (Education) = Indicator
/CONTRAST (Other_Loan) = Indicator
/SAVE = PRED PGROUP
/CLASSPLOT
/PRINT = GOODFIT
/CRITERIA = PIN(0.05) POUT(0.10) ITERATE(20) CUT(0.5).
11.8 IBM SPSS 20.0 Output for Logistic Regression
Table 11.4 shows the number of observations Included in Analysis and the
number of observation used (Total) in the analysis. The number of observations
Included in Analysis may be less than the Total if there are missing values for
11.8
IBM SPSS 20.0 Output for Logistic Regression
Table 11.4 Case processing summary
Unweighted casesa
Selected cases
Included in analysis
Missing cases
Total
Unselected cases
Total
a
253
N
Percent (%)
609
0
609
0
609
100.0
0.0
100.0
0.0
100.0
If weight is in effect, see classification table for the total number of cases
Table 11.5 Dependent variable encoding
Original value
Internal value
Defaulter
Non-defaulter
0
1
Table 11.6 Categorical variables codings
Other loan possessed by the
account holder
Number of people depended
on Account holder
Fixed job versus temporary
Education of the account holder
(School vs. College)
Gender of the account holder
Yes
No
2 or \2 dependents
More than 2 dependents
Permanent Job
Temporary Job
School education
College education
Male
Female
Frequency
Parameter coding
(1)
511
98
443
166
222
387
354
255
487
122
1.000
0.000
1.000
0.000
1.000
0.000
1.000
0.000
1.000
0.000
any variables in the equation. By default, SPSS does a listwise deletion of
incomplete cases. In the current example, both are same, because there are no
missing values.
Table 11.5 shows the categorical representation for the dichotomously dependent variable-Repayment Behaviour. 0 used for defaulter category and 1 for nondefaulter category.
Table 11.6 explicitly shows the labelling used by the researcher to represent the
categorically independent variables.
The Classification Table shown in Table 11.7. As mentioned earlier, it is
common practice to use 0.5 as the cut-off for predicting occurrence. That is, to
predict non-occurrence of the event of interest whenever p \ 0.5 and to predict
occurrence if p [ 0.5. The Classification table indicates how many correct and
incorrect predictions would be made for a wide range of probability cut-off points
used for the model. In this case, 88.3 per cent of the cases are correctly classified
254
11 Binary Logistic Regression
Table 11.7 Classification tablea,b
Observed
Predicted
Repayment behaviour
Step 0
Repayment behaviour
Defaulter
Non-defaulter
Percentage correct
Defaulter
Non-defaulter
538
71
0
0
100.0
0.0
88.3
Overall percentage
a
b
Constant is included in the model
The cut value is 0.500
Table 11.8 Variables not in the equation
Step 0
Variables
Age
Gender(1)
Income
No_Dep(1)
Job(1)
Education(1)
Other_Loan(1)
Overall statistics
Table 11.9 Omnibus tests of model coefficients
Chi square
Step 1
Step
Block
Model
46.798
46.798
46.798
Score
df
Significant
11.819
16.246
17.302
7.483
5.429
11.536
6.775
49.121
1
1
1
1
1
1
1
7
0.001
0.000
0.000
0.006
0.020
0.001
0.009
0.000
df
Significant
7
7
7
0.000
0.000
0.000
using the 0.50 cut-off point, which is similar to ‘Hit Ratio’ in discriminant
analysis.
Table 11.8 shows how individually these independent variables predict the
dependent variable. In this study, all the variables are found to be significant at 5
per cent (p \ 0.05).
In SPSS, we can adopt different methods to prove the contribution or predictability of this independent variable on the dichotomously dependent variable. One
of the test that SPSS follows is Omnibus Tests of Model Coefficients in
Table 11.9. This test will give an inference that, when we consider all the independent together, the model specified is significant or not. In this example, it found
that all the variable taken together the specified Model is significant (X2 = 46.79,
df = 7, N = 75, p \ 0.001).
11.9
Assessing a Model’s Fit and Predictive Ability
255
11.9 Assessing a Model’s Fit and Predictive Ability
There are several statistics printed by SPSS that can be used to assess model fit.
The important among them are as follows:
(i) The R2 table, which is the Cox and Snell R2, generalized coefficient of
determination. The closer the values of R2 to 1, the better the fit of the model.
Cox and Snell R2 may not achieve a maximum value of 1. The second R2,
Nagelkerke R2, is a better one to use (Table 11.10).
(ii) Observe the Hosmer and Lemeshow tables shown in Table 11.11. SPSS
computes a Chi square from observed and expected frequencies in the
Table 11.12. Large Chi square values (and correspondingly small p-values)
indicate a lack of fit for the model. In our example, the Hosmer and Lemeshow
Chi square test for the final model yields a p value of 0.225, thus suggesting a
model with satisfactory predictive value. Note that the Hosmer and Lemeshow
Chi square test is not a test of importance of specific model parameter
In Table 11.13, Estimates are the binary logit regression estimates or coefficients for the Parameters in the model. The logistic regression model models the
Table 11.10 Model summary
Step
-2 Log likelihood
391.761a
1
a
Cox & Snell R2
Nagelkerke R2
0.074
0.144
Estimation terminated at iteration number 6 because parameter estimates changed by \0.001
Table 11.11 Hosmer and lemeshow test
Step
Chi square
df
Significant
1
8
0.092
13.636
Table 11.12 Contingency table for Hosmer and Lemeshow test
Repayment behaviour = defaulter
Repayment behaviour = non-D
Step 1
1
2
3
4
5
6
7
8
9
10
Observed
Expected
Observed
Expected
62
59
60
57
51
55
54
57
46
37
60.404
59.536
57.839
57.967
56.241
55.491
54.302
51.725
47.489
37.006
0
3
1
5
10
6
7
4
15
20
1.596
2.464
3.161
4.033
4.759
5.509
6.698
9.275
13.511
19.994
Total
62
62
61
62
61
61
61
61
61
57
256
11 Binary Logistic Regression
Table 11.13 Variables in the equation
B
Step 1a
a
age
Gender(1)
Income
No_Dep(1)
Job(1)
Education(1)
Other_Loan(1)
Constant
0.035
-1.536
0.010
-0.244
-0.681
-0.861
1.749
-4.118
S.E.
Wald
df
Significant
Exp(B)
0.015
0.526
0.004
0.298
0.307
0.331
0.616
1.056
5.194
8.511
7.510
0.672
4.922
6.786
8.053
15.195
1
1
1
1
1
1
1
1
0.023
0.004
0.006
0.412
0.027
0.009
0.005
0.000
1.036
0.215
1.010
0.784
0.506
0.423
5.749
0.016
Variable(s) entered on step 1: age, Gender, Income, No_Dep, Job, Education, Other_Loan
log odds of a positive response (probability modelled for Non-Defaulter = 1) as a
linear combination the predictor variables. This is written as follows:
ProbNonÀdefaulter
1 À ProbNonÀdefaulter
¼ À4:118 þ 0:035 Ã Age À 1:536 Ã Gender þ 0:010 Ã Income
LOGITi ¼ ln
À 0:244 Ã Number of Dependents À 0:681 Ã Job
À 0:861 Ã Education þ 1:749 Ã OtherLoan
SPSS will give the output of both logistic coefficients and exponentiated
logistic coefficients. According to Hair et al. (2010), the original logistic coefficients are most appropriate for determining the direction of the relationship and
less useful in determining the magnitude of relationships. Exponentiated coefficients directly reflect the magnitude of the change in the odds value. Because they
are exponents, they are interpreted with slight difference. The exponentiated
coefficients less than 1.0 reflect negative relationships, while values above 1.0
denote positive relationships.
Age: This is the estimated logistic regression coefficient for the variable age,
given the other variables are held constant in the model. The difference in log-odds
is expected to be 0.035 units higher for non-defaulter compared with defaulter,
while holding the other variables constant in the model. We got an exponentiated
coefficient value of 1.036 for age. For assessing magnitude, the easier approach to
determine the change in probability from these values is:
Percentage change in odds = (Exponentiated coefficient-1.0) * 100
= (1.036-1) * 100 = 3.6 %
which means if the exponentiated coefficient is 1.036, a one unit change in the
independent variable will increase the odds by 3.6 %.
Gender (1): This is a dichotomous independent variable and we considered
male group (male = 1, female = 0) as our reference category. The value we
estimated is the estimate logistic regression coefficient for a one unit change in
gender, given the other variables in the model are held constant. The logit