Tải bản đầy đủ

1…Steps for Conducting the Cluster Analysis

230

Fig. 10.1 ICICLE plot

10

Cluster Analysis

10.1

Steps for Conducting the Cluster Analysis

231

Problem Definition

Selection of Similarity

Measure

Selection of Type of

Clustering

Decide Number of Final

Cluster Solution

Calculate Cluster Centroid and

Give a Meaning to Clusters

Assesses Cluster Validity and

Model Fit

The following section details the steps for performing cluster analysis in detail

using an example.

10.1.1 Step 1: Problem Definition

The first and crucial step in the cluster analysis is to define or formulate the

problem in a precise manner in terms of properly defined variables. For

understanding the cluster analysis in detail, let us formulate a problem. A

famous supermarket chains in the country want to identify exploratory buying

behaviour tendencies1 of its loyal customers. As part of its research, the

company identified and prepared the list of its 99 customers from its customer

1

Exploratory buying behaviour tendencies can be defined as the differentiating variable of

people’s disposition to engage in buying behaviour.

232

10

Cluster Analysis

database. For understanding the exploratory buying behaviour of its customers,

the company administered a questionnaire to the 99 loyal customers. The

questionnaire consists of seven exploratory buying behaviour statements that

were measured on a 7-point rating scale with 1 strongly disagree and 7 strongly

agree. The detailed description about the statements used in the survey is given

as follows:

V1: Event though certain food products are available in a number of different

flavours, I tend to buy the same flavour which I have been using (Reverse coded).

V2: When I see a new brand on the shelf, I am not afraid of giving it a try

V3: I am very cautious in trying new or different products (Reverse coded)

V4: I enjoy taking chances in buying unfamiliar brands just to get some variety in

my purchase

V5: I do not like to shop around just out of curiosity

V6: I like to shop around and look at displays

V7: I do not like to talk to my friends about my purchases.

The data file attached with this note gives the scores obtained by these 99 loyal

customers for seven statements to determine the exploratory buying behaviour.

10.1.2 Step 2: Selection of Appropriate Distance

or Similarity Measure

It has already been discussed in the introduction that cluster analysis works based

on distance or similarity. Thus, it is essential to select an appropriate distance

measure by which the similarity between the cases or subjects can be identified, to

club similar object in a cluster.

There are two main widely accepted and used distance measures for measuring

similarity between objects, Squared Euclidean Distance and Euclidean Distance.

The squared Euclidean distance is the sum of squared differences between the

values for each variable. Euclidean distance is the square root of the sum of

squared differences between the values for each variable. There are other different

methods of distances are available, but it is beyond the scope of this book, and

therefore, in the example aforementioned, we will limit the distance measure to

either Squared Euclidean Distance or Euclidean Distance.

10.1.3 Step 3: Selection of Clustering Type

There are two major types of cluster analysis: (1) Hierarchical Cluster Analysis

and (2) Non-Hierarchical Cluster Analysis. Hierarchical cluster analysis

involves n-1 clustering decisions, in which n equals number of observations. This

10.1

Steps for Conducting the Cluster Analysis

233

type of cluster analysis combines observations into a hierarchy or tree-like

structure. Hierarchical clustering technique can be again classified into Agglomerative Clustering and Divisive Clustering. In an agglomerative method of

clustering, a hierarchical process follows and clustering starts with each object or

observation as a single cluster, or with N number of clusters and end up with a

single cluster. For example, if we want to cluster 100 observations using hierarchical procedure with agglomerative type, clustering initially starts with 100

individual clusters, each includes a single observation. At first step, the two most

similar objects are being combined, leaving other 99 clusters. At the very next

step, clustering happens for the other most similar objects, so that there would be

around 98 clusters. This process continues until the last step where the final two

remaining clusters are combined into a single cluster. Divisive clustering is the

opposite of agglomerative clustering in which clustering starts with single cluster

and end up with N number of clusters or each object is a single-member cluster.

There are five major approaches in agglomerative clustering, and these are the

following:

1. Single Linkage or Nearest Neighbourhood Method

This method of agglomerative clustering procedure based on the principle of

shortest distance. In this method, the similarity between clusters as the shortest

distance from any object in one cluster to any object in the other.

2. Complete Linkage or Farthest Neighbourhood Method

This method of clustering is similar to single linkage algorithm, except that

cluster similarity is based on maximum distance between observations in each

cluster. In this method, all objects in a cluster are linked to each other at some

maximum distance.

3. Average Linkage

In This method of clustering, similarity of any two clusters, is the average

similarity of all individuals in one cluster with all individuals in another. This

method of clustering is most preferred among linkage methods, because it

avoids taking two extreme members while considering the distance between the

two clusters. The other advantage for this method is that it considers all the

objects of the cluster rather than two extreme members of the clusters.

4. Centroid Method

In this method of clustering, the similarity between two clusters is the distance

between the cluster centroids. The term cluster centroid explains the mean

values of the observations on the variables in the cluster variate. In this technique, every time individuals are grouped, and a new centroid is computed.

5. Ward’s Method

This kind of clustering follows a variance-based approach. In this method of

clustering, within cluster variance is to be minimised. In Ward’s method, the

distance is the total sum of squared deviation (error sum of squares). In this

technique, at each stage, two clusters are joined to produce smallest increase in

the overall sum of the squared within the cluster distances.

234

10

Cluster Analysis

10.2 SPSS Output Interpretation for Hierarchical

Clustering

Table 10.1 shows proximity or similarity matrix of first eight cases. It is a data

matrix that represents the squared Euclidean distance measure of pairs of objects.

The value of case number 1 versus 2 is 29.0, and it shows the sum of squared

differences between the values for each variable of these two cases or the extent of

dissimilarities between these two cases. Larger value indicates larger difference or

dissimilarity between the pairs of objects. This table shows the initial information

about clustering of cases.

Table 10.2 presents the agglomeration schedule for the first 20 stages of

clustering. Agglomeration schedule is a table which gives information about how

the subjects are joined at each stage of the cluster analysis. In this table, column

one indicates the stages of clustering. For 99 cases, there would be 98 stages of

clustering. The second column Cluster combined, under that there are two subcolumns, Clusters 1 and 2 show the stages at which the two cases combined. For

example, in this study, out of 98 stages, cluster analysis start with clubbing case

numbers 10 and 8. The third column Coefficients show the distance. In our

example, the distance between case numbers 8 and 10 is 0.00, and it is the lowest

value among all other pairs. Therefore, case numbers 8 and 10 are considered to be

the first cluster solution. In the second stage, case number 14 and 99 is being

clubbed so as to form second cluster, because the coefficient value is 0.500, it is the

second lowest among all other cases. The fourth column Stage Cluster First

Appears, under this column, there are two subcolumns, which indicate the stage at

which the respective cases have been clustered previously. For example, in the first

row (stage 1), the corresponding value of two zeros in Cluster 1 and Cluster 2

signifies that in the first stage, the cases (8 in cluster 1 and 10 cluster 2) have not

clustered previously. In stage 18 of Stage Cluster First Appears column, a

number (4) is mentioned under subcolumn cluster 2, it explains that in stage 18,

clusters 16 and 32 are combined, in that the cluster 2 (case 32) already been

Table 10.1 Distance measure for first eight cases

Proximity matrix

Case

1

2

3

4

5

6

7

8

Squared Euclidean distance

1

2

3

4

5

6

7

8

0.000

29.000

30.000

26.000

35.000

13.000

61.000

11.000

29.000

0.000

5.000

9.000

16.000

12.000

22.000

18.000

30.000

5.000

0.000

10.000

11.000

11.000

13.000

19.000

26.000

9.000

10.000

0.000

13.000

5.000

15.000

9.000

35.000

16.000

11.000

13.000

0.000

10.000

14.000

12.000

13.000

12.000

11.000

5.000

10.000

0.000

20.000

4.000

61.000

22.000

13.000

15.000

14.000

20.000

0.000

32.000

11.000

18.000

19.000

9.000

12.000

4.000

32.000

0.000

10.2

SPSS Output Interpretation for Hierarchical Clustering

235

Table 10.2 Agglomeration schedule for first 20 stages of clustering

Agglomeration schedule

Stage

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

Cluster combined

Cluster 1

Cluster 2

8

14

11

32

90

19

74

77

30

37

55

26

57

1

38

2

20

16

84

69

10

99

98

93

91

88

86

83

81

80

73

60

59

58

54

46

35

32

85

76

Coefficients

0.000

0.500

1.000

1.500

2.000

2.500

3.000

3.500

4.000

4.500

5.000

5.500

6.000

6.500

7.000

7.500

8.000

8.833

9.833

10.833

Stage cluster first appears

Cluster 1

Cluster 2

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

4

0

0

Next stage

61

34

28

18

29

50

55

30

38

37

52

76

60

41

36

29

49

67

42

56

clustered in fourth stage. In the last column, Next Stage indicates at which stage

the cases are clustering again. For example, in stage one, under Next Stage the

value mentioned is 61, it is nothing but the stage at which the first stage cluster

(cases 8 and 10) is clustering again with another case/clusters.

Another way to identify the formation of clusters in hierarchical cluster analysis

is through icicle plot as mentioned in Fig. 10.1. The plot resembles a row of icicles

hanging from eaves. Figure 10.1 mentioned a vertical icicle plot, in which the

rows of the plot represent the cases (here 99 cases) that are being clustered and

columns represent the stages involved in the formation of clusters. The vertical

icicle plot should be read from ‘‘left to right’’. Once we read from left to right, in

between case numbers 8 and 10, there are no white spaces. That is, it supports the

agglomeration schedule that cases 8 and 10 are clustering in the first stage. As we

go again to the right, we can see in the plot that in between case numbers 14 and 99

a little bit white space. Therefore, we can infer that in second stage, case numbers

14 and 99 are clustering in the second stage. This process continues until all the

cases are identified, which belong to a single cluster.

Yet, another graphical way to identify the number of cluster solution is through

looking at the dendrogram as shown in Fig. 10.2. It is a tree diagram, considered to

be a critical component of hierarchical clustering output. This graphical output

displays a relative similarity between cases considered for cluster analysis. In

## 2014 s sreejesh, sanjay mohapatra, m r anusree (auth ) business research methods an applied orientation springer international publishing (2014)

## Part IV: Multivariate Data Analysis Using IBM SPSS 20.0

## 3…Role of Business Research in Decision-Making

## 6…Business Research and the Internet

## 1…Steps in the Research ProcessSteps in the Research Process

## 2…Part I: Exploratory Research Design

## 3…Part II: Descriptive Research Design

## 6…Errors in Survey Research

## 8…Part III: Causal Research Design

## 11…Type of Experimental DesignsExperimental designs

## 1…Identifying and Deciding on the Variables to be Measured

Tài liệu liên quan