Tải bản đầy đủ
Chapter 19. Clustering: Dendrograms and Heat Maps

Chapter 19. Clustering: Dendrograms and Heat Maps

Tải bản đầy đủ

> head(mtcars)
mpg cyl disp
21.0
6 160
21.0
6 160
22.8
4 108
21.4
6 258
18.7
8 360
18.1
6 225
gear carb
Mazda RX4
4
4
Mazda RX4 Wag
4
4
Datsun 710
4
1
Hornet 4 Drive
3
1
Hornet Sportabout
3
2
Valiant
3
1

Mazda RX4
Mazda RX4 Wag
Datsun 710
Hornet 4 Drive
Hornet Sportabout
Valiant

hp
110
110
93
110
175
105

drat
3.90
3.90
3.85
3.08
3.15
2.76

wt
2.620
2.875
2.320
3.215
3.440
3.460

qsec vs am
16.46 0 1
17.02 0 1
18.61 1 1
19.44 1 0
17.02 0 0
20.22 1 0

We would like to put the various models into clusters, such that sim‐
ilar cars are in the same cluster. There are two ways to do this. The
agglomerative method begins by making a cluster of the most closely
matched pair, then making a cluster of the next most closely
matched of either a pair of single observations or a pair of a single
observation and an existing cluster, and so on, until all the observa‐
tions are in one big cluster. The other approach, the divisive method,
breaks the total group into subgroups, those subgroups into further
subgroups, and so on. The hclust() function uses agglomeration,
but there are several methods available. We will use the default
method, “complete.”
How should we measure the similarity, or distance, between two
observations? This involves finding a measure—combining all the
available information—to determine the “distance” between one car
model and another. If we had only one variable to consider, the
absolute difference between the values of that variable for each car
model would be the obvious choice. In our example, however, there
are 11 variables, so we would like to have a distance measure that
takes all 11 into account. Let’s begin with a simpler example. Sup‐
pose that there are two cars, Car-1 and Car-2, each with measure‐
ments on two variables, x and y. So, Car-1 is a point (x1,y1) and
Car-2 is a point (x2,y2). These two points are represented in the
graph in the upper left of Figure 19-1. The shortest distance between
the points is displayed by the solid line in Figure 19-1, in the graph
in the upper-right corner.

222

|

Chapter 19: Clustering: Dendrograms and Heat Maps

Figure 19-1. Measuring distances in two-dimensional space.
Note that this line is the hypotenuse of a right triangle, so it is easy
to calculate the distance:
distance = sqrt[(x2 - x1)^2 + (y2 - y1)^2]

Remember that the square of the hypotenuse equals the sum of the
squares of the two sides. This distance, “as the crow flies,” is called
the Euclidean distance; it is the default distance measure used by the
dist() function in R. There are several other ways to measure dis‐
tance. One of them is shown in the graph in the lower left of
Figure 19-1. This is the Manhattan option, also called “taxi cab” or
“city block” distance. Depending on the particular problem you are
solving, this measure might be more appropriate. R makes this
option available as well as several others, but we will stick with
Euclidean distance for this problem. If there are three variables, you
can extend the Euclidean method as follows:

Clustering

|

223

Euclidean distance = sqrt[(x2 - x1)^2 + (y2 - y1)^2 +
(z2 - z1)^2]

Likewise, you can extend the measure to as many variables as
needed.

Putting Mathematical Expressions in Graphs
There are times when a mathematical formula or expression greatly
enhances a graph. Fortunately, R accommodates adding such
expressions by using expression as an argument to any of the
functions text(), mtext(), axis(), and legend(). The following
script produced Figure 19-1 with mathematical expressions in the
text() commands:
# script for Figure 19-1
par(mfrow = c(2,2))
x = c(2,5)
y = c(3,6)
yp = c(0,6)
xp = c(0,8)
plot(x,y, pch = 16, xlim = xp, ylim = yp,
xaxt ="n", yaxt = "n", bty ="l",
main="Variables x and y, Observations 1 and 2",
cex.main = .9,
ylab = "")
text(x = 3.2, y = 3,
labels = expression(group("(", list(x[1], y[1]), ")")))
text(x = 6.2, y = 6,
labels = expression(group("(", list(x[2], y[2]), ")")))
mtext(text = "y",
side = 2, las = 1,
cex = .8, line = 3)
plot(x, y, pch = 16, type = "o", xlim = xp, ylim = yp,
main = "Euclidean distance",
xaxt = "n", yaxt = "n", bty = "l", ylab = "")
text(3.6, 1.5, labels =
expression(sqrt((x[2] - x[1])^2 + (y[2] - y[1])^2)))
lines(x, y, type = "s", lty = "dotted")
mtext(text = "y", side = 2, las = 1, cex = .8, line = 3)
plot(x,y,
pch = 16, xlim = xp, ylim = yp,
main = "Manhattan distance",
xaxt = "n", yaxt = "n", bty = "l", ylab = "")
lines(x,y,type="s" )
text(3.6, 1.5,
labels = expression(group("|", x[2] - x[1],"|") +

224

|

Chapter 19: Clustering: Dendrograms and Heat Maps

group("|",y[2] - y [1],"|")))
mtext(text = "y", side = 2, las = 1,cex = .8, line = 3)

For usage details, refer to the plotmath help file.

Notice that the values of the variables in mtcars vary widely. For
example, disp has values well in excess of 100, but cyl is all in single
digits. This means that disp will play a much more important role in
determining the distance than cyl will, if only because of the scale
on which it is measured. Imagine that two variables were measure‐
ments of length but one was expressed in inches, whereas the other
was in feet. The exact same distance would be noted in very differ‐
ent numbers, giving the one with a higher number more influence
on the Euclidean distance. For this reason, it makes sense to convert
all the variables to a comparable measurement scale.
We can normalize (or “standardize”) the data by applying a simple
transformation. We will make each variable have a mean of 0 and a
standard deviation of 1. Let’s try this with mpg. First get the mean
and standard deviation of mpg:
> mean(mpg)
[1] 20.09062
> sd(mpg)
[1] 6.026948

If we subtract the mean from each value of mpg and divide that by
the standard deviation, we will have an mpg variable that has a mean
of 0 and standard deviation of 1:
> mpg2 = (mpg - 20.09)/6.026948
> mean(mpg2)
[1] 0.0001037009
# tiny round-off error!
> sd(mpg2)
[1] 1

This kind of process of normalization happens so frequently that R
provides a function that makes it a one-step operation:
> mpg3 = scale(mpg)
> mean(mpg3)
[1] 7.112366e-17
#tiny, tiny; for all practical purposes = 0
> sd(mpg3)
[1] 1

Fortunately, we do not need to scale each variable: we can do an
entire matrix at once. Let’s now convert the data frame to a matrix,

Clustering

|

225

make the distance measurements on a scaled matrix, compute the
clusters, and plot the dendrogram:
# Figure 19-2
attach(mtcars)
cars = as.matrix(mtcars) # convert to matrix- dist requires it
h = dist(scale(cars)) # scale cars matrix & compute dist matrix
h2 = hclust(h)
# compute clusters
plot(h2)
# plot dendrogram

The dendrogram in Figure 19-2 shows the results of the clustering
procedure.

Figure 19-2. Dendrogram of clusters in mtcars dataset.
The vertical scale, called “Height,” will help us to understand what
has happened. The figures that look rather like staples connect
observations in the same cluster. The lower down on the Height
scale the horizontal part of the staple is, the earlier that cluster was
formed. Thus, the staples that have a Height near zero were the first
ones formed and therefore are the closest in Euclidean distance.
226

|

Chapter 19: Clustering: Dendrograms and Heat Maps

Conversely, clusters with a Height of close to eight were among the
last formed and thus are relatively far apart. Clusters that are next to
each other are not necessarily close! For example, look at the right‐
hand side of the graph. The two Mazda models are very close, hav‐
ing formed a cluster at a height of about 1. The Ford Pantera, which
is next to the Mazda cluster, is not especially close to the Mazdas,
because it did not become part of a cluster with them until a height
of about 5.
It is also possible to cluster the variables, rather than the observa‐
tions, by transposing the cars matrix; that is, making the first row
become the first column, the first column become the first row, and
so on:
newcars = t(cars) # newcars is the transpose of cars
h = dist(scale(newcars))
h2 = hclust(h)
plot(h2)

Heat Maps
Another way to get an overview of all the numbers in the mtcars
dataset is to look at a heat map. In this kind of visualization, every
number in the standardized matrix is transformed into a colored
rectangle. This is done in a systematic way so that a color represents
the approximate value, or intensity, of the number. For instance, one
possible range of colors we might use runs from dark red for very
low numbers, through ever lighter shades of red, orange, yellow, and
finally white as the numbers become higher. This range of colors is
the default for the image() function, but many other color sets are
possible. A simple heat map on scaled values in the mtcars dataset
appears in Figure 19-3. The code that produced it follows:
# Figure 19-3
attach(mtcars)
cars = as.matrix(mtcars)
image(scale(cars))
# simple heat map

Heat Maps

|

227

Figure 19-3. Heat map of mtcars dataset in default colors
The col = rainbow() argument controls the color range in the
image() function. Another reasonable color scheme is a range of
blues, from very dark to very light. The following command shows
how to invoke the blue range of colors:
# Figure 19-4
image(scale(cars), col = rainbow(256, start = .5, end = .6))
# heat map with range of blues

The result appears in Figure 19-4.

228

|

Chapter 19: Clustering: Dendrograms and Heat Maps

Figure 19-4. Heat map of mtcars dataset in a range of blue colors.
Not all color sets are easily interpreted. If you used all possible col‐
ors, it would be difficult to know whether, for instance, dark green
was more positive or more negative than dark blue. The color
schemes in Figure 19-3 and Figure 19-4, however, are relatively easy
for most people to grasp. Each of the values start and end in the
rainbow argument must be 0 or larger but no larger than 1, and the
two values must be different. You might experiment with different
values and see if you find a combination that works as well for you
as the two demonstrated here. For more information, type ?rain
bow.
The heat map in Figure 19-3 (also Figure 19-4) is turned on its side,
as if the data matrix fell to the left. If you count, you can find 11
rows and 32 columns, instead of the 11 columns and 32 rows in the
original dataset. Even though the colors show a wide range of values,
with many dark red (low) values and some pale yellow and white

Heat Maps

|

229

(high) values, there does not seem to be any obvious pattern in the
graph.
We would like to find patterns in the data, just as we did with the
cluster analysis. It is actually possible to combine the dendrogram
and the heat map into one visual display to aid in understanding the
relationships among the variables and particular car models. The
heatmap() function can both perform clustering and make a heat
map at the same time. Rows and/or columns are reordered to put
like items together, and cells are colored appropriately. The com‐
mand that follows produced Figure 19-5, using the default options:
> heatmap(scale(cars))

# Figure 19-5

See the help file for more information about the
many options available, such as whether to
include a row and/or column dendrogram,
methods for measuring distances, how to weight
rows and columns, and more.

230

| Chapter 19: Clustering: Dendrograms and Heat Maps

Figure 19-5. A heat map of clusters in mtcars
You can see some striking patterns in Figure 19-5. Notice how the
colors set off some groups of car models from others. Compare
those clusters to the ones indicated by the dendrogram on the left‐
hand side. We can see not only that certain models are in the same
clusters, but that models within clusters—especially in the ones that
were among the earliest formed—have similar color patterns among
the variables.
A similar heat map is shown in Figure 19-6.

Heat Maps

|

231

Figure 19-6. Heat map of mtcars, using heatmap.2() from the gplots
package.
Figure 19-6 was made by using the function heatmap.2() from the
gplots package. There are a couple of extra features provided by
heatmap.2() that make interpretation of the map a bit easier. First,
there is a key in the upper-left corner that shows the relation of the
colors to variable values. Second, there is a system of vertical lines
running through each of the columns. The dotted line represents the
value 0. The solid line shows how much the value in a particular cell
varies, positively or negatively, from 0. This reinforces the key, giv‐
ing a confirmation in each cell. The code to produce this figure fol‐
lows:
# Figure 19-6
library(gplots)
heatmap.2(scale(cars))

Clustering is not an exact science; rather, it is a way of searching for
order in complex data. Clustering algorithms, dendrograms, and
232

|

Chapter 19: Clustering: Dendrograms and Heat Maps