2…SPSS Output Interpretation for Hierarchical Clustering
Tải bản đầy đủ
10.2
SPSS Output Interpretation for Hierarchical Clustering
235
Table 10.2 Agglomeration schedule for first 20 stages of clustering
Agglomeration schedule
Stage
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
Cluster combined
Cluster 1
Cluster 2
8
14
11
32
90
19
74
77
30
37
55
26
57
1
38
2
20
16
84
69
10
99
98
93
91
88
86
83
81
80
73
60
59
58
54
46
35
32
85
76
Coefficients
0.000
0.500
1.000
1.500
2.000
2.500
3.000
3.500
4.000
4.500
5.000
5.500
6.000
6.500
7.000
7.500
8.000
8.833
9.833
10.833
Stage cluster first appears
Cluster 1
Cluster 2
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
4
0
0
Next stage
61
34
28
18
29
50
55
30
38
37
52
76
60
41
36
29
49
67
42
56
clustered in fourth stage. In the last column, Next Stage indicates at which stage
the cases are clustering again. For example, in stage one, under Next Stage the
value mentioned is 61, it is nothing but the stage at which the first stage cluster
(cases 8 and 10) is clustering again with another case/clusters.
Another way to identify the formation of clusters in hierarchical cluster analysis
is through icicle plot as mentioned in Fig. 10.1. The plot resembles a row of icicles
hanging from eaves. Figure 10.1 mentioned a vertical icicle plot, in which the
rows of the plot represent the cases (here 99 cases) that are being clustered and
columns represent the stages involved in the formation of clusters. The vertical
icicle plot should be read from ‘‘left to right’’. Once we read from left to right, in
between case numbers 8 and 10, there are no white spaces. That is, it supports the
agglomeration schedule that cases 8 and 10 are clustering in the first stage. As we
go again to the right, we can see in the plot that in between case numbers 14 and 99
a little bit white space. Therefore, we can infer that in second stage, case numbers
14 and 99 are clustering in the second stage. This process continues until all the
cases are identified, which belong to a single cluster.
Yet, another graphical way to identify the number of cluster solution is through
looking at the dendrogram as shown in Fig. 10.2. It is a tree diagram, considered to
be a critical component of hierarchical clustering output. This graphical output
displays a relative similarity between cases considered for cluster analysis. In
236
10
Cluster Analysis
Fig. 10.2 Dendrogram
Fig. 10.2, looking at the dendrogram, we can interpret the clustering through
reading the diagram from ‘‘left to right’’. In the upper part of the diagram, it is
represented as ‘‘Rescaled Distance Cluster Combine’’. This shows that cluster
10.2
SPSS Output Interpretation for Hierarchical Clustering
237
distances are rescaled to get the range of the output from ‘‘0 to 25’’, in which 0
represents no distance and 25 represents highest distance.
10.2.1 Step 4: Decide Number of Clusters to be Retained
in the Final Cluster Solution
In hierarchical cluster analysis, the prominent part is to decide the number of
clusters. There are no hard and fast rules for final cluster solution, and there are
few guidelines that can be followed while deciding the number of cluster. The
following are some of the general guidelines for deciding number of clusters
(Table 10.3):
Tables 10.4, 10.5, and 10.6 show relative cluster size or frequency distribution
of two, three, and four cluster solutions, respectively. From the tables, it is evident
that in case of cluster 4, the distribution is more or less equal.
Table 10.3 Cluster solution determination
Theoretical base
Agglomeration
schedule
Dendrogram
Relative cluster
size
In this method, use the This method of cluster Drawing an imaginary In this case, one can
theoretical base or
determination is
vertical line
restrict the clusters
experience of the
not possible in all
through
to a limited number
researcher can be
the cases. In good
dendrogram will
(e.g., 3 or 4 or 5),
used to decide the
cluster solution,
show the number
so that we will get
number of clusters
while reading
of cluster solution.
a series of cluster
agglomeration
Looking at the
solutions from 2 to
schedule
number of cut
that number. Then
coefficients, a
points, one can
draw the relative
sudden jump
assess the number
frequency of cases
appears. The point
of clusters. If there
in each cluster.
just before the
are four cut points,
Finally, one can
sudden jump
then we can say
select that cluster
appears in the
that there are four
in which the
coefficient column
clusters
relative frequency
is the point of
distribution of
stopping point for
cases is almost
merging clusters
equal across
clusters
Table 10.4 Two clusters
Frequency
Percent
Valid percent
Cumulative percent
Group 1
Group 2
Total
23.2
76.8
100.0
23.2
76.8
100.0
23.2
100.0
23
76
99
238
10
Cluster Analysis
Table 10.5 Three clusters
Frequency
Percent
Valid percent
Cumulative percent
Group 1
Group 2
Group 3
Total
23.2
60.6
16.2
100.0
23.2
60.6
16.2
100.0
23.2
83.8
100.0
Table 10.6 Four clusters
Frequency
Percent
Valid percent
Cumulative percent
Group
Group
Group
Group
Total
23.2
24.2
16.2
36.4
100.0
23.2
24.2
16.2
36.4
100.0
23.2
47.5
63.6
100.0
1
2
3
4
23
60
16
99
23
24
16
36
99
10.2.2 Step 5: Calculate Cluster Centroid and Give Meaning
to Cluster Solution
After the determination of final cluster solution (here in this example, the final
cluster solution is 4), it is very important to find the meaning of the cluster solution
in terms of importance of cluster variate. It can be achieved though the determination of cluster centroids. Here, we have generated cluster centroid using a
multiple discriminant analysis. Table 10.7 shows the results generated through
discriminant analysis. From the results, it found that the group 1 or cluster 1 people
are showing high importance to all the seven variables. We can call them as
‘‘severe exploratory consumers’’. The second group of consumers are showing a
buying tendency, which is above average, so we can call them as ‘‘superior
exploratory consumers’’. The third group of people are showing a buying
behaviour tendency, which is mediocre. Therefore, we will call them as ‘‘Mediocre
consumers’’. Finally, the last group is the lowest in this category, so we will call
them as ‘‘stumpy consumers’’.
10.2.3 Step 6: Assess the Cluster Validity and Model Fit
The following are some of the suggestive procedures for confirming validity and
model fit in cluster analysis.
1. Run cluster analysis on the same data with different distance measures and
compare the results across distance measures. In addition, researcher can use
different methods of clustering on the same data, and later on, results can be
analysed and compared.
10.2
SPSS Output Interpretation for Hierarchical Clustering
Table 10.7 Group statistics
Four clusters
Mean
Group 1
Group 2
Group 3
Group 4
Total
V1
V2
V3
V4
V5
V6
V7
V1
V2
V3
V4
V5
V6
V7
V1
V2
V3
V4
V5
V6
V7
V1
V2
V3
V4
V5
V6
V7
V1
V2
V3
V4
V5
V6
V7
4.6957
5.3043
5.7826
5.1739
5.0000
5.4348
5.0435
3.6250
4.0833
4.0417
3.0000
3.9583
4.5417
4.0000
2.9375
3.7500
3.9375
3.1875
3.8750
1.7500
2.8750
4.5000
4.8056
5.1944
3.8889
4.6111
3.3889
4.1111
4.0808
4.5758
4.8485
3.8586
4.4242
3.8788
4.1010
Standard deviation
0.82212
0.70290
0.42174
0.65033
0.95346
0.66237
0.63806
0.57578
0.71728
0.69025
0.93250
0.62409
1.14129
0.78019
0.77190
1.23828
1.18145
0.98107
0.95743
0.77460
1.14746
0.73679
0.82183
0.66845
1.00791
1.20185
1.04957
0.91894
0.96549
1.01107
1.03375
1.21227
1.06991
1.54704
1.09260
239
Valid N (listwise)
Unweighted
Weighted
23
23
23
23
23
23
23
24
24
24
24
24
24
24
16
16
16
16
16
16
16
36
36
36
36
36
36
36
99
99
99
99
99
99
99
23.000
23.000
23.000
23.000
23.000
23.000
23.000
24.000
24.000
24.000
24.000
24.000
24.000
24.000
16.000
16.000
16.000
16.000
16.000
16.000
16.000
36.000
36.000
36.000
36.000
36.000
36.000
36.000
99.000
99.000
99.000
99.000
99.000
99.000
99.000
2. Divide the data into two parts and perform cluster analysis for these two halves.
Cluster centroids can be compared for their consistency for the split samples.
3. Add or delete the original set of variables and perform cluster analysis and
compare the results for each set of variables.
240
10
Cluster Analysis
10.3 SPSS Procedure for Hierarchical Cluster Analysis
=>Open the data
=[ Analyse =[Classify =[Hierarchical Cluster Analysis
10.3
SPSS Procedure for Hierarchical Cluster Analysis
241
=[ Select all the seven variables and place it on the Variables box
=[ Statistics =[Click on Agglomeration Schedule and Proximity matrix,
then click Continue