Chapter 16. Scatter plot Matrices and Corrgrams
Tải bản đầy đủ
> library(Sleuth2)
> attach(ex1713)
> head(ex1713)
Denomination Distinct Attend NonChurch StrongPct AnnInc
1
American Baptist
2.5
25.6
1.01
50.6 24000
2
Assemblies of God
4.8
35.4
0.68
58.6 27100
3
Catholic
3.0
26.4
1.43
40.0 32900
4 Disciples of Christ
2.1
24.3
2.58
47.0 28600
5
Episcopal
1.1
17.3
1.93
32.0 39000
6 Evangelical Lutheran
2.7
23.0
1.71
41.5 33700
To see the codebook for this data, type:
> ?ex1713
Here’s a brief summary of the codebook:
Distinct
The distinctiveness/strictness of discipline, on a sevenpoint scale
Attend
The average percentage of weekly attendance
NonChurch
The average number of secular organizations to which members
belong
StrongPct
The average percentage of members who consider themselves
strong church members
AnnInc
The average annual income
The scatter plot matrix shown in Figure 161 was produced by using
the pairs() function. Note that the variable names are typed as a
formula, beginning with the ~ symbol, followed by the variable
names in the order in which they will appear on the graph, separated
by the + symbol. Further, you can add any of a number of special
arguments for this function, as well as par() arguments. For the
code to produce Figure 161, only the pch and col arguments have
been used:
# Figure 161: produce scatter plot matrix of denomination data
library(Sleuth2)
attach(ex1713)
pairs(~ Distinct + Attend + NonChurch + StrongPct + AnnInc,
184

Chapter 16: Scatter plot Matrices and Corrgrams
pch = 16,
col = "deepskyblue")
Figure 161. A scatter plot matrix of the church denomination data.
In the scatter plot matrix in Figure 161, each variable is plotted
against every other variable, twice. In each pair, a given variable is
once the xvariable and once the yvariable. For example, in the sec‐
ond row, the variable Attend is the yvariable in each of the four
scatter plots, and each of the other four variables is the xvariable
once. In the second column, Attend is the xvariable in each of the
four scatter plots, and each of the other variables is the yvariable
one time.
Looking across the second row, we can see that Attend has a positive
association with Distinct; that is, as one of these increases, the
other does also. Likewise, there is a positive association between
Attend and StrongPct. In contrast, Attend has negative associations
with NonChurch and AnnInc; as one increases, the other decreases.
Scatter plot Matrix

185
However, these negative associations are not as strong as the positive
ones. In other words, the points in the negative associations do not
hug a straight line as tightly as the positive association plots do. This
is clearer in Figure 163, in which leastsquares lines are placed on
each scatter plot. Of course, associations, even strong ones, do not
imply causation—or, put another way, knowing that greater strict‐
ness and higher attendance usually go together does not prove that
one causes the other. It does, however, suggest that this relationship
might be an interesting one to study further.
The car package has a function called scatterplotMatrix() that
adds some useful features to the scatter plot matrix. First, it is easy to
plot the distribution of each of the variables on the diagonal of the
matrix as a histogram, density plot, box plot, QQ plot, or 1D (diago‐
nal) strip chart. In addition, you can easily add a leastsquares line
to each plot.
Smoothers are also available for each plot. As we saw in Chapter 12,
a smoother is a tool for making patterns in scatter plot data a little
easier to see. There are several types of smoothers, but they all show
the center of the y’s at a given value of x (or several close x’s) and do
it in such a way that the (usually curved) line formed by connecting
all such points is relatively smooth. Figure 162 shows a scatter plot
matrix with smoothers, represented as red lines. You can use the
smoother argument to select a smoothing method, but Figure 162
uses the default method, “loess,” or locally weighted regression.
186
 Chapter 16: Scatter plot Matrices and Corrgrams
Figure 162. A scatter plot matrix produced by scatterplotMatrix() in
the car package. The default options add kernel density plots and rug
plots in the diagonal as well as leastsquares lines and smoothers in
each of the plot windows.
Here is the code to produce Figure 162:
#Fig 162: scatter plot matrix w/ smoother & diagonal density
library(car)
library(Sleuth2)
attach(ex1713)
scatterplotMatrix(~Distinct + Attend + NonChurch + StrongPct +
AnnInc)
The lines produced by the smoother in Figure 162 show some
interesting things. The associations between Attend and Distinct
and between Attend and StrongPct are close to straight lines and
suggest that these relationships may be described as simple linear
correlations. Certain other associations that looked close to linear
on the simple scatter plot—for example, that between Attend and
Scatter plot Matrix

187
AnnInc—now appear more complex. It should be noted, however,
that this dataset only has 18 denominations in it, which is a rather
small number from which to draw conclusions about the shape of
the relation between any two variables. This example is merely an
illustration of the features available in the package. In most cases,
you will probably find it useful to look at a display like Figure 161
first; after getting a feel for the data, you might find some of the
other features helpful.
You can customize the matrix produced by scatterplot() quite a
bit. You can omit the smoother by using the smoother = NULL argu‐
ment, as shown in the code that follows. Likewise, you could remove
the regression line by using the reg.line = F argument. It is also
possible to change the type of graph on the diagonals by using the
diagonal argument. To see the options, type ?scatterplotMatrix.
Figure 163 illustrates the customized scatterplot() matrix cre‐
ated by the following code:
# Figure 163: scatter plot matrix w/out smoother & with
histograms
scatterplotMatrix(~Distinct + Attend + NonChurch + StrongPct
+ AnnInc, diagonal = "histogram",
smoother = NULL)
188

Chapter 16: Scatter plot Matrices and Corrgrams
Figure 163. A scatter plot matrix produced by scatterplotMatrix() in
the car package. Smoothers have been left out and the diagonal density
plots replaced with histograms.
Figure 163 shows a matrix with diagonal histograms. This might be
a better choice than the density plots that are produced by default, at
least in this instance, given that the sample size is only 18. The dis‐
tribution of a couple of variables, Attend and NonChurch, is less
smooth than the density plots might lead us to think. Further, the
two especially large values of NonChurch can cause the relationship
between that variable and Attend to appear stronger and more lin‐
ear than it really is. You can probably see that by looking carefully at
the scatter plot of those two variables, but you might have missed it
had not the histogram flagged the plot first.
When examining a scatter plot matrix, it is important to remember
that you are actually being presented with many separate plots. Do
not let yourself become overwhelmed by the amount of information
on the page. Look at each plot by itself. After you have done this for
Scatter plot Matrix

189
many of the plots, you will probably find it enlightening to compare
them.
Corrgram
The corrgram (sometimes called “correlogram,” although this term
actually refers to something else) is a type of graph related to the
scatter plot matrix. In this type of graph, the individual scatter plots
are replaced by symbols that represent numbers measuring the
amount of linear correlation between two quantitative variables. The
Pearson correlation coefficient, usually denoted as r, can vary
between –1 and 1. A perfect positive correlation is 1, meaning that
all the points on the scatter plot of two quantitative variables lie
exactly on an ascending straight line. A perfect negative correlation
is –1, indicating that all points lie exactly on a descending straight
line. Values near 0 indicate little or no association between two vari‐
ables. Take note that the correlation coefficient is not a measure of
the steepness of a line’s slope. It is, instead, a measure of the total
deviation of the points from a straight line. Figure 164 illustrates
the meaning of the correlation coefficient. A further caution: the
correlation coefficient is useful only if the relationship between the
variables is linear; that is to say, if the points fall on a straight line.
In other situations, the correlation coefficient can be misleading or
even deceptive.
190
 Chapter 16: Scatter plot Matrices and Corrgrams
Figure 164. A perfect positive correlation of 1 has all the points falling
exactly on an upwardsloping line. A perfect negative correlation of –1
has all the points falling exactly on a downwardsloping line. A corre‐
lation of 0 shows no discernible pattern. A positive correlation of .79
shows points falling “close” to a straight line.
To make a corrgram, it is first necessary to make a correlation matrix
—a matrix containing the correlation coefficients of all the variable
pairs in the dataset. This is accomplished by using the cor() func‐
tion:
>
>
>
>
library(Sleuth2)
attach(ex1713)
y = cor(ex1713[, 2:6]) # use all rows and columns 26
y
Distinct
Attend NonChurch StrongPct
AnnInc
Distinct
1.0000000 0.7891067 0.6585883 0.8127124 0.6003892
Attend
0.7891067 1.0000000 0.6107342 0.8649691 0.6766143
NonChurch 0.6585883 0.6107342 1.0000000 0.4218525 0.6458747
StrongPct 0.8127124 0.8649691 0.4218525 1.0000000 0.6146261
AnnInc
0.6003892 0.6766143 0.6458747 0.6146261 1.0000000
Corrgram 
191
When the correlation matrix has been produced, you can use the
corrplot() function from the corrplot package to make several
types of corrgram. Some examples appear in Figure 165. All of the
examples use color to depict size of correlation. You can also use size
of object, orientation of object, or numbers to show how large the
correlation of a given pair of variables is.
Figure 165. Visualizations of the correlation matrix. This is a type of
summary, or approximation, of the scatter plot matrix, produced by
the corrplot() function in the corrplot package. Upper left: method =
“circle”; upper right: method = “color”; lower left: method="number”;
lower right: method= “ellipse”, type="lower”
The correlation between variables A and B is the same as the corre‐
lation between B and A, so the complete corrgram is redundant.
That is to say, the correlations in the upper half are exactly the same
as the correlations in the lower half. For this reason, some prefer to
display only the upper half or only the lower half of the matrix. An
example of this appears in the lowerright corner of Figure 165. You
192

Chapter 16: Scatter plot Matrices and Corrgrams
can do this by using the argument type = "lower". The code to
produce the corrgrams in Figure 165 follows:
# Figure 165: various corrgrams
library(corrplot)
library(Sleuth2)
attach(ex1713)
y = cor(ex1713[, 2:6])
par(mfrow = c(2,2))
corrplot(y) # default method is "circle"
corrplot(y, method = "color")
corrplot(y, method = "number")
corrplot(y, method = "ellipse", type = "lower")
Despite all the warnings about correlation coefficients, the corrgram
can be an effective way to present and screen data, if you take the
time to look at the scatter plots (and possibly smoothers), first, to
see if correlation coefficients make sense. Compare the corrgrams in
Figure 165 to the scatter plot matrices in earlier figures in this
chapter to see how consistent the conclusions from these varied dis‐
plays may be. Corrgrams are also available through the cor.plot()
function in the psych package.
All of the plots in Figure 165 use color to indicate the strength of
the correlation, with a color gradient either on the right or the bot‐
tom to show the color meanings. Shades of blue show a positive
relationship, with darker colors being stronger (i.e., closer to 1).
Shades of red show a negative correlation, with darker colors being
closer to –1. In two graphs (the upperleft corner and the lowerright corner), size also indicates strength, but in opposite ways. In
the upperleft graph, larger size shows larger absolute value. In the
lowerright graph, orientation indicates positive or negative correla‐
tion, with narrow ovals showing points close to a line (i.e., strong
correlation). Fat ovals indicate a lot of variation around a line, or
weaker correlation. You probably picked this up without my telling
you, but it feels better to have your suspicions confirmed, right?
It is also possible to combine the scatter plot matrix with the corr‐
gram by putting one of these graphs in the lower half of the matrix,
and the other in the upper half. The ggscatmat() function in the
GGally package does exactly that:
# Figure 166
library(GGally)
library(Sleuth2)
ggscatmat(ex1713, columns = 2:6)
Corrgram

193
Figure 166 shows the results.
Figure 166. A combination of scatter plot matrix and corrgram pro‐
duced by the ggscatmat() function in the GGally package. Note the
overprinting on the xaxis in the lowerright corner. This can be fixed!
Note a small problem in Figure 166. The xaxis values in the lowerright corner are overprinting because the numbers are too big to fit
in a small space. There is a pretty simple fix for this. Change the
scale of the values of AnnInc from dollars to thousands of dollars,
and redo the graph with this new variable. Accomplishing this will
require one new command and a small change in another one. First,
create a new variable, Inc, that is AnnInc divided by one thousand.
This new variable becomes the seventh column of the data frame.
Next, modify the ggscatmat() command to include the desired col‐
umns, leaving out AnnInc and including Inc:
# Figure 167: fix a bug in Figure 166
library(GGally)
library(Sleuth2)
194

Chapter 16: Scatter plot Matrices and Corrgrams