3 Spearman’s Coefficient of Correlation
Tải bản đầy đủ
15.3 Spearman’s Coefficient of Correlation
587
Fig. 15.4 Charles Edward Spearman (1863–1945).
Spearman (1904) proposed the rank correlation coefficient long before
statistics became a scientific discipline. For bivariate data, an observation
has two coupled components (X , Y ) that may or may not be related to each
other. Let ρ =Corr(X , Y ) represent the unknown correlation between two components. In a sample of n, let R 1 , . . . , R n denote the ranks for the first component X and S 1 , . . . , S n denote the ranks for Y . For example, if x1 = x(n) is the
largest value from x1 , . . . , xn and y1 = y(1) is the smallest value from y1 , . . . , yn ,
then (R 1 , S 1 ) = (n, 1). Corresponding to Pearson’s (parametric) coefficient of
correlation, the Spearman coefficient of correlation is defined as
ρˆ =
n
i =1 (R i − R)(S i − S)
n
2
i =1 (R i − R) ·
.
(15.1)
n
2
i =1 (S i − S)
This expression can be simplified. From (15.1), R = S = (n + 1)/2 and (R i −
R)2 = (S i − S)2 = nV ar(R i ) = n(n2 − 1)/12. Define D as the difference between
ranks, i.e., D i = R i − S i . With R = S, we can see that
D i = (R i − R) − (S i − S)
and
n
i =1
D 2i =
n
(R i − R)2 +
i =1
n
(S i − S)2 − 2
i =1
n
(R i − R)(S i − S),
i =1
i.e.,
n
(R i − R)(S i − S) =
i =1
n(n2 − 1) 1 n 2
−
D .
12
2 i=1 i
588
15 Correlation
By dividing both sides of the equation by
n
2
i =1 (R i − R)
n
2
i =1 (R i − R) ·
n
2
i =1 (S i − S)
=
2
= n(n − 1)/12, we obtain
ρˆ = 1 −
n
2
i =1 D i
.
n(n2 − 1)
6
(15.2)
Consistent with Pearson’s coefficient of correlation (the standard parametric measure of covariance), Spearman’s coefficient of correlation ranges between −1 and 1. If there is perfect agreement, i.e., all the differences are 0,
then ρˆ = 1. The scenario that maximizes D 2i occurs when ranks are perfectly
opposite: R i = n − S i + 1.
If the sample is large enough, then Spearman’s statistic can be approximated using the normal distribution. It was shown that if n > 10, then
Z = (ρˆ − ρ ) n − 1 ∼ N (0, 1).
Example 15.8. Stichler et al. (1953) list tread wear for tires, each tire measured by two methods based on (a) weight loss and (b) groove wear.
Weight
45.9
37.5
31.0
30.9
30.4
20.4
20.9
13.7
Groove
35.7
31.1
24.0
25.9
23.1
20.9
19.9
11.5
Weight
41.9
33.4
30.5
31.9
27.3
24.5
18.9
11.4
Groove
39.2
28.1
28.7
23.3
23.7
16.1
15.2
11.2
For this data, ρˆ = 0.9265. Note that if we opt for the parametric measure
of correlation, the Pearson coefficient is 0.948.
Ties in the Data: The statistics in (15.1) and (15.2) are not designed for
paired data that include tied measurements. If ties exist in the data, a simple
adjustment should be made. Define u = u(u2 − 1)/12 and v = v(v2 − 1)/12
where the us and vs are the ranks for X and Y adjusted (e.g., averaged) for
ties. Then
ρˆ =
n(n2 − 1) − 6
{[n(n2 − 1) − 12u
n
2
i =1 D i − 6(u + v
][n(n2 − 1) − 12v
and it holds that, for large n,
Z = (ρˆ − ρ ) n − 1 ∼ N (0, 1).
)
]}1/2
,
15.4 Kendall’s Tau
589
The MATLAB function corr(x,y,’type’,’Spearman’) computes the Spearman correlation coefficient for column vectors x and y.
15.4 Kendall’s Tau
M. G. Kendall (Fig. 15.5) formalized an alternative measure of dependence
(originally proposed and used in the nineteenth century) by finding out how
many pairs in a bivariate sample are “concordant,” which means that the signs
between X and Y agree in the pairs. Pairs for which one sign is plus and the
other is minus are “discordant.” From (X i , Yi ), i = 1, . . . , n one can choose n2
different pairs. The pair (X i , Yi ), (X j , Y j ) is concordant if either X i ≤ X j and
Yi ≤ Y j or X i ≥ X j and Yi ≥ Y j . The pair is called discordant if either X i ≤ X j
and Yi ≥ Y j or X i ≥ X j and Yi ≤ Y j . For example, the pairs (2, 4) and (1, −1) are
concordant, while the pairs (−2, 4) and (1, −1) are discordant.
Fig. 15.5 Sir Maurice George Kendall (1907–1983).
Kendall’s τˆ -statistic (Kendall, 1938) is defined as
τˆ =
n
2S τ
, Sτ =
n(n − 1)
i =1
n
sign{ r i − r j },
j = i +1
where r i s are defined via ranks of the second sample corresponding to the
ordered ranks of the first sample, {1, 2, . . . , n}, i.e.,
1 2 ... n
.
r1 r2 . . . r n
In this notation ni=1 D 2i from Spearman’s coefficient of correlation becomes
n
2
i =1 (r i − i) . In terms of the number of concordant (n C ) and discordant (n D =
n − n C ) pairs,
2(n C − n D )
,
τˆ =
n(n − 1)
590
15 Correlation
and in the case of ties, use
nC − nD
.
nC + nD
τˆ =
Example 15.9. Prevention of Vitreous Loss. Limbal incisions were made
in rabbit eyes to mirror the initial steps of lens extraction. The vitreous body
loses water when the eye is open and decreases in weight, as reported by
Galin et al. (1971). The results had implications in the context of cataract
surgery. The authors measured the vitreous body weight for each eye of 15
New Zealand albino rabbits. One eye has been open for 5 min. ( y), while the
other served as a control (x). The measurements of vitreous weight (in mg) are
provided next:
Rabbit #
Control eye (x)
Open eye (y)
Rabbit #
Control eye (x)
Open eye (y)
1
1848
1738
9
1724
1596
2
1532
1440
10
1873
1794
3
1460
1388
11
1928
1785
4
1947
1756
12
2226
2044
5
1810
1692
13
1708
1602
6
1718
1629
14
1605
1491
7
8
1686 1617
1583 1499
15
1822
1702
We will find Kendall’s τˆ and provide an approximate 95% confidence interval.
The sample variance of τˆ when no ties are present is approximately
s2 (τˆ ) = 4
n
i =1
c2i − 2
n
ci −
2
n
i =1 c i
(2n − 3)
n
2
i =1
.
With the presence of ties the expression for sample variance is more complicated, but the above expression can serve as an approximation. Then
(1 − α)100% confidence interval is
τˆ −
z1−α/2
n
2
s(τˆ ), τˆ +
z1−α/2
n
2
s(τˆ ) ∩ [−1, 1].
rabbits.m we found no ties, n C = 100, n D = 5, τˆ = 0.9048, and a
Using
95% confidence interval of [0.8091, 1.0000]. Using the difference T = n C − n D
one can test for independence of the two components. The test has a p-value
of
p = 2P Z ≥
3(|T | + 1) 2
n(n − 1)(2n − 5)
where Z is standard normal. In our example a strong dependence between
control and open eye measurements is found
15.5 Cum hoc ergo propter hoc
591
T=nc-nd
%95
p = 2 * (1-normcdf(3*(abs(T)+1)*sqrt(2)/sqrt(n*(n-1)*(2*n -5))))
%1.8965e-008
15.5 Cum hoc ergo propter hoc
We conclude this chapter with a discussion on the misuses of correlations. The
fallacy that correlation implies causation is summarized by Gould’s quote at
the chapter’s beginning (Latin Cum hoc ergo propter hoc meaning “With this,
therefore because of this”). We already mentioned the “link” between ice-cream
sold on the beach and the number of drowning accidents, but the fallacy causes
more serious damage to science. Spurious correlations are often misused in
medical and health science and attributed to causations. The number of published studies with voodoo causations, often conflicted from study to study, is
stunning.
16
Highway fatality rate (per 100,000)
1996
15.8
1997
15.6
15.4
1998
1999
15.2
15
2000
14.8
200
300
400
500
Import of fresh lemons (metric tons)
600
Fig. 15.6 Fresh lemons imported to USA from Mexico (in metric tons; U.S. Department of
Agriculture) and total U.S. highway fatality rate (per 100,000; U.S. NHTSA, DOT HS 810
780).
As an extreme case of spurious correlation we give an example (popular
among bloggers on the Web) involving data on imports of fresh lemons from
Mexico (1996–2000) and U.S. highway fatality rates (1996–2000), Fig. 15.6.
The correlation is r = −0.986 and is highly significant (p < 0.0002) even with
sample size n = 5. Some bloggers provided “causal links” citing less expensive
592
15 Correlation
car air-fresheners that make drivers happy or slower traffic caused by trucks
from Mexico transporting lemons.
There are two possible errors in correlation inference caused by grouping
data. The first one is if two separate groups are combined. For each group there
may not be correlation, but when the groups are combined, the correlation
may be significant and, of course, spurious. Figure 15.7 illustrated this point.
Observations represented by red circles (group 1, r = 0.0643), as well as the
pairs represented by blue circles (group 2, r = 0.0079), show no significant
correlation. However, when the groups are combined, the correlation increases
to r = 0.7031, and it is significant with a p-value of 1.5 × 10−5 . Details can be
found in
spur.m.
7
y
6
5
4
3
Group 1
Group 2
2
3
4
5
x
6
7
8
Fig. 15.7 Spurious correlation when two groups of uncorrelated pairs are combined.
The second error is more subtle. Often, repeated bivariate measurements
are considered as independent and an artificial correlation due to a blocking
factor is introduced. For example, if for 15 subjects one measures weight (X )
and skinfold thickness (Y ) before and after a diet and combines the measurements, then due to the “increased sample size” a significance of correlation
between X and Y is more likely.
15.6 Exercises
15.1. Correlation Between Uniforms and Their Squares. Generate 10,000
uniform random numbers between −1 and 1 in the form of a vector x.
15.6 Exercises
593
Demonstrate that y=x.2 has a small correlation with x, regardless of their
perfect functional relationship.
15.2. Muscle Strength of “Ethanol Abusers.” It is estimated that 10% of European and North American adults, and up to one-third of acute hospital admissions, are alcoholics. Obviously, the high proportion of alcoholics in the
hospitalized population imposes severe financial constraints on health authorities and emphasizes the need for primary caretakers to focus on minimizing alcohol misuse. A staggering two-thirds of chronic ethanol abusers
have skeletal muscle myopathy (Martin et al., 1985; Worden, 1976).
Hickish et al. (1989) provide height, quadriceps muscle strength, and age
data in 41 male alcoholics, as in the table below. The data are available as
alcos.xls or alcos.ascii.
Height Quadriceps Age
(cm)
muscle
(years)
strength (N)
155
196
55
159
196
62
159
216
53
160
392
32
160
98
58
161
387
39
162
270
47
162
216
61
166
466
24
167
294
50
167
491
35
168
137
65
168
343
41
168
74
65
170
304
55
171
294
47
172
294
31
172
343
38
172
147
31
172
319
39
172
466
53
Height Quadriceps Age
(cm)
muscle
(years)
strength (N)
172
147
32
173
441
39
173
343
28
173
441
40
173
294
53
175
304
27
175
404
28
175
402
34
175
392
53
175
196
37
176
368
51
177
441
49
177
368
48
177
412
32
178
392
49
178
540
41
178
417
42
178
324
55
179
270
32
180
368
34
(a) Find the sample correlation between Height and Strength, r HS . Test the
hypothesis that the population correlation coefficient between the Height
and Strength ρ HS is significantly positive at the level α = 0.01.
Since an increase in Age is expected to decrease the Strength (negative
correlation), find the correlation between Height and Strength when Age is
594
15 Correlation
accounted for, that is, find r HS.A . Test the hypothesis that ρ HS.A is positive
at the level α = 0.01.
Find an approximate 95% confidence interval for ρ HS .
15.3. Vending Machine and Pharmacy Errors. Mr. Joseph Bentley, the
owner of a pharmacy store, wants to remove the Coke vending machine
standing in front of his store because he believes the vending machine influences the number of errors the store employees make. More precisely, as
more Coke is sold outside his store, more errors are made. He provided the
following data:
Errors made
Coke sold
5 3 10 9 5 7 8 4
112 100 220 250 100 200 160 100
Find the coefficient of correlation. Comment on why this correlation is high.
Is there a causation – are Coke sales by themselves influencing the pharmacy employees?
15.4. Vending Machine and Pharmacy Errors Revisited. Refer to Exercise
15.3. In addition to Errors and Coke, Mr. Bentley provided the count of
people that pass by his store (and the vending machine):
Errors made
Coke
People
5
3
10
9
5
7
8
4
112 100 220 250 100 200 160 100
10000 6000 17000 20000 9000 15000 14000 8000
Find the coefficient of correlation between Errors and Coke sales while accounting for the number of people. Comment.
15.5. Corn Yields and Rainfall. The following table published by Misner
(1928) has been analyzed by Ezekiel and Fox (1959). The variables are
years:
rain (X): rainfall measurements in inches, in the six states, from 1890 to
1927. Year 1 in the data below corresponds to 1890.
yield (Y): yearly corn yield in bushels per acre, in six Corn Belt states
(Iowa, Illinois, Nebraska, Missouri, Indiana, and Ohio).
year
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
X
9.6
12.9
9.9
8.7
6.8
12.5
13.0
10.1
10.1
10.1
10.8
7.8
16.2
14.1
10.6
10.0
11.5
Y
24.5
33.7
27.9
27.5
21.7
31.9
36.8
29.9
30.2
32.0
34.0
19.4
36.0
30.2
32.4
36.4
36.9
year
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
X
12.0
9.3
7.7
11.0
6.9
9.5
16.5
9.3
9.4
8.7
9.5
11.6
12.1
8.0
10.7
13.9
11.3
Y
32.3
34.9
30.1
36.9
26.8
30.5
33.3
29.7
35.0
29.9
35.2
38.3
35.2
35.5
36.7
26.8
38.0
15.6 Exercises
18
19
13.6
12.1
595
31.5
30.5
37
38
11.6
10.4
31.7
32.6
Find the sample correlation coefficient r and a 95% confidence interval for
the population coefficient ρ .
15.6. Drosophilæ. Sokoloff (1966) reported the correlation between body weight
and wing length in Drosophila pseudoobscura as 0.52 in a sample of n 1 =
39 at the Grand Canyon, and as 0.67 in a sample of n 2 = 20 at Flagstaff,
Arizona. Do the correlations in these two populations differ significantly?
Use α = 0.05.
15.7. Confidence Interval for the Difference of Two Correlation Coefficients. Using the results on testing the equality of two correlation coefficients develop a (1 − α)100% confidence interval for their difference.
15.8. Oxygen Intake. The human body takes in more oxygen when exercising
than when it is at rest, and to deliver the oxygen to the muscles, the heart
must beat faster. Heart rate is easy to measure, but the measurement of
oxygen uptake requires elaborate equipment. If oxygen uptake (VO2) is
strongly correlated with the heart rate (HR) under a particular set of exercise conditions, then its predicted, rather than measured, values could be
used for various research purposes.1
HR
94
96
95
95
94
95
94
104
104
106
VO2
0.473
0.753
0.929
0.939
0.832
0.983
1.049
1.178
1.176
1.292
HR
108
110
113
113
118
115
121
127
135
VO2
1.403
1.499
1.529
1.599
1.749
1.746
1.897
2.040
2.231
Find the sample correlation r and calculate a 95% confidence interval for
its population counterpart, ρ .
15.9. Obesity and Pain. Khimich (1997) found that a pain threshold increases
in obese subjects and increases with age. Obesity is measured as the percentage over ideal weight (X ). The response to pain is measured by using
the threshold of the nociceptive flexion reflex (Y ), which is a measure of
the pricking pain sensation in an individual. Measurements X and Y are
1
Data provided by Paul Waldsmith from experiments conducted in Don Corrigan’s lab at
Purdue University, West Lafayette, Indiana.
596
15 Correlation
considered to be normal. We are interested in an inference about the correlation between X and Y . The following data were obtained:
X 89 90 75 30 51 75 62 45 90 20
Y 2 3 4 4.5 5.5 7 9 13 15 14
(a) Using results n =10, i X i Yi = 4461.5, i X i = 627, i X i2 = 45141, i Yi =
77, i Yi2 = 799.5 calculate the Pearson coefficient of linear correlation, r.
(b) Test the hypothesis that the population coefficient of correlation, ρ , is 0,
against the alternative H1 : ρ < 0. Use α = 0.05.
(c) Let the age Z (in years) of the individuals from the table be as follows
(in the corresponding order): 20 18 23 19 44 51 36 47 60 55. Find the partial
coefficient of correlation r x y.z if r xz = −0.2089 and r yz = 0.8627.
(d) Find a 95% confidence interval for ρ .
MATLAB AND WINBUGS FILES AND DATA SETS USED IN THIS CHAPTER
http://springer.bme.gatech.edu/Ch15.Corr/
corrs.m, errorscoke.m, fisherzsimu.m, histo.m, iriscorr.m, lemon.m,
nanoprism.m, ObesityPain.m, rabbits.m, spur.m, variouscorrs.m
corr.odc
nanoprism.dat
CHAPTER REFERENCES
Anderson, T. W., (1984). An Introduction to Multivariate Statistical Analysis, 2nd Ed. Wiley,
New York.
Arvin, D. V. and Spaeth, R. (1998). Trends in Indiana’s water use, 1986–1996. Indiana Department of Natural Resources Special Report, No. 1.
Brower, L. P. (1959). Speciation in butterflies of the Papilio glaucus group. I: Morphological
relationships and hybridizations. Evolution, 13, 40–63.
Ezekiel, M. and Fox, K. A. (1959). Methods of Correlation and Regression Analysis. Wiley,
New York.
Chapter References
597
Galin, M. A., Robbins, R., and Obstbaum, S. (1971). Prevention of vitreous loss. Brit. J.
Ophthal., 55, 533–537.
Hickish, T., Colston, K., Bland, J. M., and Maxwell J. D. (1989). Vitamin D deficiency and
muscle strength in male alcoholics. Clin. Sci., 77, 171–176.
Kendall, M. (1938). A new measure of rank correlation. Biometrika, 30, 1–2, 81–89.
Khimich, S. (1997). Level of sensitivity of pain in patients with obesity. Acta Chir. Hung., 36,
166–167.
Martin, R., Ward, K., Slavin, G., Levi, J., and Peters, T. J. (1985). Alcoholic skeletal myopathy, a clinical and pathological study. Q. J. Med., 55, 233–251.
Misner, E. G. (1928). Studies of the relationship of weather to the production and price of
farm products. I: Corn, mimeographed publication, Cornell University, March 1928.
Sokoloff, A. (1966). Morphological variation in natural and experimental populations of
Drosophila pseudoobscura and Drosophila persimilis. Evolution, 20, 49–71.
Spearman, C. (1904). General intelligence: objectively determined and measured. Am. J.
Psychol., 15, 201–293.
Stichler, R. G., Richey, G. G., and Mandel, J. (1953). Measurement of treadware of commercial tires. Rubber Age, 73, 2.
Worden R. E. (1976). Pattern of muscle and nerve pathology in alcoholism. N.Y. Acad. Sci.,
273, 351–359.