1…Steps for Conducting the Cluster Analysis
Tải bản đầy đủ
230
Fig. 10.1 ICICLE plot
10
Cluster Analysis
10.1
Steps for Conducting the Cluster Analysis
231
Problem Definition
Selection of Similarity
Measure
Selection of Type of
Clustering
Decide Number of Final
Cluster Solution
Calculate Cluster Centroid and
Give a Meaning to Clusters
Assesses Cluster Validity and
Model Fit
The following section details the steps for performing cluster analysis in detail
using an example.
10.1.1 Step 1: Problem Definition
The first and crucial step in the cluster analysis is to define or formulate the
problem in a precise manner in terms of properly defined variables. For
understanding the cluster analysis in detail, let us formulate a problem. A
famous supermarket chains in the country want to identify exploratory buying
behaviour tendencies1 of its loyal customers. As part of its research, the
company identified and prepared the list of its 99 customers from its customer
1
Exploratory buying behaviour tendencies can be defined as the differentiating variable of
people’s disposition to engage in buying behaviour.
232
10
Cluster Analysis
database. For understanding the exploratory buying behaviour of its customers,
the company administered a questionnaire to the 99 loyal customers. The
questionnaire consists of seven exploratory buying behaviour statements that
were measured on a 7-point rating scale with 1 strongly disagree and 7 strongly
agree. The detailed description about the statements used in the survey is given
as follows:
V1: Event though certain food products are available in a number of different
flavours, I tend to buy the same flavour which I have been using (Reverse coded).
V2: When I see a new brand on the shelf, I am not afraid of giving it a try
V3: I am very cautious in trying new or different products (Reverse coded)
V4: I enjoy taking chances in buying unfamiliar brands just to get some variety in
my purchase
V5: I do not like to shop around just out of curiosity
V6: I like to shop around and look at displays
V7: I do not like to talk to my friends about my purchases.
The data file attached with this note gives the scores obtained by these 99 loyal
customers for seven statements to determine the exploratory buying behaviour.
10.1.2 Step 2: Selection of Appropriate Distance
or Similarity Measure
It has already been discussed in the introduction that cluster analysis works based
on distance or similarity. Thus, it is essential to select an appropriate distance
measure by which the similarity between the cases or subjects can be identified, to
club similar object in a cluster.
There are two main widely accepted and used distance measures for measuring
similarity between objects, Squared Euclidean Distance and Euclidean Distance.
The squared Euclidean distance is the sum of squared differences between the
values for each variable. Euclidean distance is the square root of the sum of
squared differences between the values for each variable. There are other different
methods of distances are available, but it is beyond the scope of this book, and
therefore, in the example aforementioned, we will limit the distance measure to
either Squared Euclidean Distance or Euclidean Distance.
10.1.3 Step 3: Selection of Clustering Type
There are two major types of cluster analysis: (1) Hierarchical Cluster Analysis
and (2) Non-Hierarchical Cluster Analysis. Hierarchical cluster analysis
involves n-1 clustering decisions, in which n equals number of observations. This
10.1
Steps for Conducting the Cluster Analysis
233
type of cluster analysis combines observations into a hierarchy or tree-like
structure. Hierarchical clustering technique can be again classified into Agglomerative Clustering and Divisive Clustering. In an agglomerative method of
clustering, a hierarchical process follows and clustering starts with each object or
observation as a single cluster, or with N number of clusters and end up with a
single cluster. For example, if we want to cluster 100 observations using hierarchical procedure with agglomerative type, clustering initially starts with 100
individual clusters, each includes a single observation. At first step, the two most
similar objects are being combined, leaving other 99 clusters. At the very next
step, clustering happens for the other most similar objects, so that there would be
around 98 clusters. This process continues until the last step where the final two
remaining clusters are combined into a single cluster. Divisive clustering is the
opposite of agglomerative clustering in which clustering starts with single cluster
and end up with N number of clusters or each object is a single-member cluster.
There are five major approaches in agglomerative clustering, and these are the
following:
1. Single Linkage or Nearest Neighbourhood Method
This method of agglomerative clustering procedure based on the principle of
shortest distance. In this method, the similarity between clusters as the shortest
distance from any object in one cluster to any object in the other.
2. Complete Linkage or Farthest Neighbourhood Method
This method of clustering is similar to single linkage algorithm, except that
cluster similarity is based on maximum distance between observations in each
cluster. In this method, all objects in a cluster are linked to each other at some
maximum distance.
3. Average Linkage
In This method of clustering, similarity of any two clusters, is the average
similarity of all individuals in one cluster with all individuals in another. This
method of clustering is most preferred among linkage methods, because it
avoids taking two extreme members while considering the distance between the
two clusters. The other advantage for this method is that it considers all the
objects of the cluster rather than two extreme members of the clusters.
4. Centroid Method
In this method of clustering, the similarity between two clusters is the distance
between the cluster centroids. The term cluster centroid explains the mean
values of the observations on the variables in the cluster variate. In this technique, every time individuals are grouped, and a new centroid is computed.
5. Ward’s Method
This kind of clustering follows a variance-based approach. In this method of
clustering, within cluster variance is to be minimised. In Ward’s method, the
distance is the total sum of squared deviation (error sum of squares). In this
technique, at each stage, two clusters are joined to produce smallest increase in
the overall sum of the squared within the cluster distances.
234
10
Cluster Analysis
10.2 SPSS Output Interpretation for Hierarchical
Clustering
Table 10.1 shows proximity or similarity matrix of first eight cases. It is a data
matrix that represents the squared Euclidean distance measure of pairs of objects.
The value of case number 1 versus 2 is 29.0, and it shows the sum of squared
differences between the values for each variable of these two cases or the extent of
dissimilarities between these two cases. Larger value indicates larger difference or
dissimilarity between the pairs of objects. This table shows the initial information
about clustering of cases.
Table 10.2 presents the agglomeration schedule for the first 20 stages of
clustering. Agglomeration schedule is a table which gives information about how
the subjects are joined at each stage of the cluster analysis. In this table, column
one indicates the stages of clustering. For 99 cases, there would be 98 stages of
clustering. The second column Cluster combined, under that there are two subcolumns, Clusters 1 and 2 show the stages at which the two cases combined. For
example, in this study, out of 98 stages, cluster analysis start with clubbing case
numbers 10 and 8. The third column Coefficients show the distance. In our
example, the distance between case numbers 8 and 10 is 0.00, and it is the lowest
value among all other pairs. Therefore, case numbers 8 and 10 are considered to be
the first cluster solution. In the second stage, case number 14 and 99 is being
clubbed so as to form second cluster, because the coefficient value is 0.500, it is the
second lowest among all other cases. The fourth column Stage Cluster First
Appears, under this column, there are two subcolumns, which indicate the stage at
which the respective cases have been clustered previously. For example, in the first
row (stage 1), the corresponding value of two zeros in Cluster 1 and Cluster 2
signifies that in the first stage, the cases (8 in cluster 1 and 10 cluster 2) have not
clustered previously. In stage 18 of Stage Cluster First Appears column, a
number (4) is mentioned under subcolumn cluster 2, it explains that in stage 18,
clusters 16 and 32 are combined, in that the cluster 2 (case 32) already been
Table 10.1 Distance measure for first eight cases
Proximity matrix
Case
1
2
3
4
5
6
7
8
Squared Euclidean distance
1
2
3
4
5
6
7
8
0.000
29.000
30.000
26.000
35.000
13.000
61.000
11.000
29.000
0.000
5.000
9.000
16.000
12.000
22.000
18.000
30.000
5.000
0.000
10.000
11.000
11.000
13.000
19.000
26.000
9.000
10.000
0.000
13.000
5.000
15.000
9.000
35.000
16.000
11.000
13.000
0.000
10.000
14.000
12.000
13.000
12.000
11.000
5.000
10.000
0.000
20.000
4.000
61.000
22.000
13.000
15.000
14.000
20.000
0.000
32.000
11.000
18.000
19.000
9.000
12.000
4.000
32.000
0.000
10.2
SPSS Output Interpretation for Hierarchical Clustering
235
Table 10.2 Agglomeration schedule for first 20 stages of clustering
Agglomeration schedule
Stage
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
Cluster combined
Cluster 1
Cluster 2
8
14
11
32
90
19
74
77
30
37
55
26
57
1
38
2
20
16
84
69
10
99
98
93
91
88
86
83
81
80
73
60
59
58
54
46
35
32
85
76
Coefficients
0.000
0.500
1.000
1.500
2.000
2.500
3.000
3.500
4.000
4.500
5.000
5.500
6.000
6.500
7.000
7.500
8.000
8.833
9.833
10.833
Stage cluster first appears
Cluster 1
Cluster 2
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
4
0
0
Next stage
61
34
28
18
29
50
55
30
38
37
52
76
60
41
36
29
49
67
42
56
clustered in fourth stage. In the last column, Next Stage indicates at which stage
the cases are clustering again. For example, in stage one, under Next Stage the
value mentioned is 61, it is nothing but the stage at which the first stage cluster
(cases 8 and 10) is clustering again with another case/clusters.
Another way to identify the formation of clusters in hierarchical cluster analysis
is through icicle plot as mentioned in Fig. 10.1. The plot resembles a row of icicles
hanging from eaves. Figure 10.1 mentioned a vertical icicle plot, in which the
rows of the plot represent the cases (here 99 cases) that are being clustered and
columns represent the stages involved in the formation of clusters. The vertical
icicle plot should be read from ‘‘left to right’’. Once we read from left to right, in
between case numbers 8 and 10, there are no white spaces. That is, it supports the
agglomeration schedule that cases 8 and 10 are clustering in the first stage. As we
go again to the right, we can see in the plot that in between case numbers 14 and 99
a little bit white space. Therefore, we can infer that in second stage, case numbers
14 and 99 are clustering in the second stage. This process continues until all the
cases are identified, which belong to a single cluster.
Yet, another graphical way to identify the number of cluster solution is through
looking at the dendrogram as shown in Fig. 10.2. It is a tree diagram, considered to
be a critical component of hierarchical clustering output. This graphical output
displays a relative similarity between cases considered for cluster analysis. In