Tải bản đầy đủ
7 Kruskal–Wallis one-way analysis of variance

# 7 Kruskal–Wallis one-way analysis of variance

Tải bản đầy đủ

208

CH 9

NON-PARAMETRIC TESTS

Table 9.9 Ranks are to the right of each data column. At the bottom of each rank column are

the total ranks, R. Tied ranks are highlighted in grey
Cats
35
10
2
25
30
6
4
48
20

Rc =

Rank

Dogs

20
13
3
17
18.5
9.5
6.5
24
15.5

Rank

8
36
42
30
5
50
3
47
49

Rd =

127

11.5
21
22
18.5
8
26
5
23
25

Humans

Rank

2
13
1
4
2
20
6
8

3
14
1
6.5
3
15.5
9.5
11.5

160

Rh =

64

did for the Mann–Whitney test. Again, we need to rank the values irrespective
of their group membership: the smallest rank of 1 is given to the human flea
height of 1 cm, and the largest rank of 26 is given to the dog flea height of
50 cm (Table 9.9).
The statistic we calculate is known as H:
∑ Rk
12

− 3 ∗ (N + 1)
nk
N ∗ (N + 1)
2

H=

where k is the number of groups, n is the number in each group and N is the
total number of fleas in the study.
12

H=
26 ∗ (26 + 1)

(

1272 1602 642
+
+
9
9
8

)
− 3 ∗ (N + 1)

H = 7.009496676
This has to be corrected for ties. The correction factor is calculated by:

1−

Ti
N3 − N

Where Ti = ti 3 − ti
ti is the number of tied values in the ith group of tied scores. For example,
the first tied group comprises three values of two, giving T = 33 − 3 = 27 − 3
= 24. The remaining five groups consist of pairs of ties, each T = 23 − 2 = 6.

9.7

KRUSKAL–WALLIS ONE-WAY ANALYSIS OF VARIANCE

209

So that the correction factor is
1−

(24 + 6 + 6 + 6 + 6 + 6)
263 − 26

= 0.996923077
The H value calculated earlier is then divided by this correction factor:
H = 7.009496676 ∕ 0.996923077
= 7.031130925
This value is equivalent to a 𝜒 2 value with k − 1 degrees of freedom (here
df = 3 − 1 = 2), which can be compared with the critical values shown in
Table 8.1, or from Excel or R. The R code and output are given in Box 9.4.
Box 9.4 R code for calculating medians, sample sizes, and performing the
Kruskal–Wallis one-way analysis of variance
> height <- c(35,10,2,25,30,6,4,48,20,8,36,42,30,5,50,3,47,49,2,13,1,
4,2,20,6,8)
> animal <- as.factor(c(rep("Cat",9),rep("Dog",9),rep("Human",8)))
> dat1 <- data.frame(height,animal)
> tapply(height,animal,median)
#gives medians
Cat
Dog Human
20
36
5
> tapply(height,animal,length)
#gives sample sizes
Cat
Dog Human
9
9
8
> kruskal.test(height~animal)
#Kruskal-Wallis test
Kruskal-Wallis rank sum test
data: height by animal
Kruskal-Wallis chi-squared = 7.0311, df = 2, p-value = 0.02973
>
> pchisq(7.031,df=2,lower.tail=FALSE)
#p value for chi-square= 7.031, 2 df
[1] 0.02973293
> qchisq(0.950, 2)
#critical chi-squared 5% signif 2 df
[1] 5.991465

Results summary: A Kruskal–Wallis one-way analysis of variance compared the jumping heights of three different species of flea. Two or more
species were significantly different from each other, 𝜒 2 (2) = 7.03, p =
.030. Dog fleas jumped the highest (median = 36 cm, n = 9) > cat fleas
(median = 20 cm, n = 9) > human fleas (median = 5 cm, n = 8). A boxplot
is given in Figure 9.4.

210

CH 9

NON-PARAMETRIC TESTS

50

Jump height in cm

40
30
20
10
0
Cat

Dog

Human

Species of host

Figure 9.4 Boxplot of the jumping heights of three different species of flea.

With such a low jump height for human fleas, the experimenters now feel
fairly safe from catching the escaped flea.
We know from the results of the analysis that two or more flea species differ from each other, but we don’t know which ones. It is possible that all three
species differ from each other. We can use a Mann–Whitney test pairwise to
test for differences, that is, dog fleas versus cat fleas, dog fleas versus human
fleas, cat fleas versus human fleas. See if you can do them, and report the
results.2 Since there are only three groups and the overall analysis was statistically significant, it is not necessary to adjust the procedure to guard against
Type I errors. However, if there are more than three groups then an adjustment should be made to the significance level using Bonferroni correction, so
that 𝛼 = .05/number of comparisons. We did a single overall test (Kruskal–
Wallis) to see if there were any differences at all, and if there weren’t any, it
saved us the time of running lots of multiple comparisons.
A parametric one-way ANOVA produces F2,23 = 4.80, p = .018, a value that
is slightly more statistically significant than that obtained above. However, we
decided to use the non-parametric test because the data in each group did not
look normally distributed. By using a non-parametric test we were adopting
a cautious approach.

2 Only the dog fleas versus human fleas was statistically significant. Dog fleas (median = 36,
N = 9) jumped significantly higher than human fleas (median = 5, N = 8), U = 11.5, p = .015. Dog
fleas were not significantly different from cat fleas (median = 20, N = 9), U = 26.5, p = .222. Cat fleas
were marginally not significantly different from human fleas, U = 16.5, p = .059.

9.8 FRIEDMAN TEST FOR CORRELATED SAMPLES

211

9.8 Friedman test for correlated samples
This test is the non-parametric equivalent for the repeated measures analysis mentioned in Chapter 6. Where we have more than two treatments, and
where the scores are related across the different treatments. Typically, the
scores are related because the same participants are tested in each of the different treatments. We could imagine that our surgeons and coffee example
above had another treatment, decaffeinated coffee (Table 9.10). This would
act as a more appropriate placebo than hot water since it would (presumably)
taste and look the same as the coffee treatment. The order in which the three
treatments were administered would also need to be randomised or counterbalanced to control for order effects.
The ranking is then performed within each participant’s set of scores, with
the highest value for each participant ranked 3 and the lowest ranked 1
(Table 9.11).

Table 9.10 The data from Table 9.6 with a decaffeinated
coffee treatment added (third column)
Surgeon
1
2
3
4
5
6
7
8

Hot water

Decaf coffee

Coffee

2
3
1
0
4
2
10
1

3
1
3
8
0
0
2
2

8
4
7
9
10
6
1
10

Table 9.11 Data from Table 9.10 converted to ranks within
each participant
Surgeon
1
2
3
4
5
6
7
8
Totals

Hot water

Decaf coffee

Coffee

1
2
1
1
2
2
3
1

2
1
2
2
1
1
2
2

3
3
3
3
3
3
1
3

13

13

22

212

CH 9

NON-PARAMETRIC TESTS

Friedman’s 𝜒 2 can be calculated using the following formula:
(
𝜒F2

=

12

R2i
N ∗ k ∗ (k + 1)

)
− (3 ∗ N ∗ (k + 1))

where N is the sample size, k is the number of treatments and R2i are each of
the treatment rank totals squared. So we get
(

)
12
∗ (132 + 132 + 222 ) − (3 ∗ N ∗ (k + 1))
8 ∗ 3 ∗ (3 + 1)
(
)
12
=
∗ 822 − 96
96

𝜒F2 =

= 6.75
This is equivalent to a 𝜒 2 value and should again be evaluated on k − 1 =
3 − 1 = 2 degrees of freedom, either using Table 8.1, or from Excel or R. The
R code and outputs are given in Box 9.5.
Box 9.5 R code for entering data, obtaining medians and sample size, and running
the Friedman test for related samples
>
>
>
>
>

hotwater <- c(2,3,1,0,4,2,10,1)
#enter data
decaffeine <- c(3,1,3,8,0,0,2,2)
coffee <- c(8,4,7,9,10,6,1,10)
coffee3 <- data.frame(hotwater,decaffeine,coffee)
sapply(coffee3,median)
#gives medians
hotwater decaffeine
coffee
2.0
2.0
7.5
> sapply(coffee3,length)
hotwater decaffeine
coffee
#gives number in each treatment
8
8
8
> friedman.test(as.matrix(coffee3))
#Friedman test, note as.matrix
Friedman rank sum test
data: as.matrix(coffee3)
Friedman chi-squared = 6.75, df = 2, p-value = 0.03422
>
> pchisq(6.75,df=2,lower.tail=FALSE)
#p value for chi-square=6.75
[1] 0.03421812
> qchisq(0.950, 2)
#critical chi-squared for 2 df
[1] 5.991465

9.8 FRIEDMAN TEST FOR CORRELATED SAMPLES

213

Results summary: A Friedman test found a statistically significant difference between two or more drink treatments in the amount of hand tremor
exhibited by the surgeons, 𝜒 2 (2) = 6.75, N = 8, p = .034. Figure 9.5 gives a
boxplot of the data. The coffee treatment produces the highest ratings of
tremor (median = 7.5) compared with hot water (median = 2) and decaffeinated coffee (median = 2).
10

Hand tremor rating 0–10

8

6

4

2

0
Hot water

Decaff coffee

Coffee

Type of drink

Figure 9.5 Boxplot of the effects of different drinks on the amount of hand tremor in
surgeons. The coffee treatment results in the highest tremor ratings (scale 0–10).

A parametric analysis of variance came to a similar conclusion, F2,14 = 4.21,
p = .037. This is similar to our value obtained with the non-parametric test. We
used the Friedman test here because in the paired samples test we identified
an outlier difference. We are again being cautious and also preventing any
outliers from interfering with a parametric test (where the variance would be
elevated).
For multiple comparisons the non-parametric paired samples test should
be used to compare treatments (e.g. coffee versus hot water, coffee versus
decaffeinated, hot water versus decaffeinated). If there are more than three
treatments, then it will be necessary to correct the significance level using
Bonferroni correction. As with the Kruskal–Wallis test, we did a single

214

CH 9

NON-PARAMETRIC TESTS

overall test to see if there were any differences at all, and if there weren’t
any, it saved us the time of running lots of multiple comparisons.

9.9 Conclusion
All these non-parametric tests can be quite easily done manually, and more
easily using a spreadsheet programme like Excel. One of the difficulties in
doing the calculations by hand is that of obtaining 95% confidence intervals,
although many statistical packages will provide these. A drawback of nonparametric procedures is that they can only analyse data for one independent
variable. Two-way designs (like those we saw in Chapter 6) are not possible.
This is a distinct drawback of these methods since many studies do involve
more than one factor/variable. It means that multiple independent, withinand/or between-subject variables cannot be studied in the same design. Hence
interactions between variables cannot easily be detected and examined. These
need to be done by individual (multiple) comparisons using Mann–Whitney
or paired sample tests. Alternatively, a successful transform of the data (e.g.
logarithmic, see Chapter 6) to satisfy parametric assumptions would allow the
use of parametric tests.
In general, non-parametric tests are useful for small data sets (sample size of
less than 30) where few assumptions need to be made about the population(s)
from which the data came.

9.10 Summary
r Non-parametric tests are useful for smaller data sets where we are unable
or unwilling to vouch for assumptions required for parametric tests (normality of population, homogeneity of variances). The appropriate central
tendency statistic is the median rather than the mean.
r A corresponding non-parametric test is available for each parametric test,
except for a multi-way ANOVA, which means that the simultaneous
effects of two or more factors cannot be studied in the same analysis. It also
means that we are unable to easily detect or examine interactions between
the two or more factors.
r The sign test can be done by hand with a calculator or Excel.
r Similarly, the Kruskal–Wallis and Friedman test can be done manually quite easily, both resulting in a 𝜒 2 statistic which can be looked
up in a table or the critical value can be obtained from Excel (e.g.
=CHISQ.INV(0.95,1) for 2010 version or =CHIINV(0.05,1) for previous versions) or R (e.g. qchisq(0.950, 1)). Or better, the exact p value

REFERENCES

215

can be obtained using =CHIDIST(value,df), =CHISQ.DIST.RT(value,df),
pchisq(value,df=DF,lower.tail=FALSE) in Excel and R respectively.
r It is more problematic doing the Mann–Whitney and Wilcoxon tests manually since they result in less well-known statistics (U and V). With larger
samples (N > 20 for each treatment), z can be calculated and compared
with the critical value 1.96 or larger, depending on the significance level
required. It is easier to use a computer package which will give exact
p values.
r Computer packages will also calculate 95% confidence intervals which are
difficult to do manually.
r When more than two treatments (levels) are included then multiple
comparisons can be performed pairwise using the corresponding nonparametric independent samples and paired samples tests. If more than
three treatments are present, then a Bonferroni correction to the significance level should be applied.

References
Cadiergues, M-C., Joubert, C. and Franc, M. (2000) A comparison of jump performances
of the dog flea, Ctenocephalides canis (Curtis, 1826) and the cat flea, Ctenocephalides
´ 1835). Veterinary Parasitology, 92:239–241.
felis felis (Bouche,
Sprent, P. (1992) Applied Nonparametric Statistical Methods. Chapman & Hall/CRC.

10

Resampling Statistics
comes of Age – ‘There’s
always a Third Way’

10.1 Aims
Although resampling techniques have been around for decades, their popularity was initially restricted by a lack of computing power sufficient to run
thousands of iterations of a data set. Nowadays, with more powerful modern
computer and the rapid expansion of genomics, the acquisition of extremely
large data sets is reasonably common and requires statistics beyond the scope
of classical parametric or non-parametric tests. This chapter is intended simply as an introduction to the subject and will explain how resampling statistics
are an attractive alternative as they can overcome these limitations by using
non-parametric techniques without a loss of statistical power. This chapter
will be of particular relevance to biomedical sciences students, and psychology students interested in the field of cognitive neuroscience.

10.2 The age of information
DNA microarrays heralded the era of genomics and previously unimaginably large data sets. Instead of meticulously working on one gene, laboriously
investigating the factors regulating its expression, its expression patterns and
the relationship of mRNA to protein levels (if any), researchers now investigate thousands of genes at a time. And not only their expression. Next, generation sequencing costs have diminished substantially and are now within the
reach of many labs. These techniques have changed the way we perceive the
Starting Out in Statistics: An Introduction for Students of Human Health, Disease, and Psychology
First Edition. Patricia de Winter and Peter M. B. Cahusac.
C ⃝ 2014 John Wiley & Sons, Ltd. Published 2014 by John Wiley & Sons, Ltd.
Companion Website: www.wiley.com/go/deWinter/spatialscale

218

CH 10

RESAMPLING STATISTICS COMES OF AGE

genome. We now know that most of it is actually transcribed and it’s not ‘junk’.
Non-coding RNA is active, it does things like regulate coding genes. So suddenly we have vast amounts of data, tens of thousands of numbers to analyse,
not just a few dozen. Similarly, cognitive neuroscientists are now able to image
activity of the whole brain using techniques such as functional magnetic resonance imaging (fMRI) and positron emission tomography (PET) scanning.
That’s around 86 billion neurons and as many glial cells (Azevedo et al., 2009).
Although cells are not scanned individually using these techniques, the sheer
number of cells present gives some idea of the volume and complexity of data
that are generated. OK, so we have computers, but we need alternative statistical methods of analysing these data, because the conventional ones are
not fit for purpose. The late David Carradine (of Kill Bill fame) is quoted as
saying ‘There’s an alternative. There’s always a third way, and it’s not a combination of the other two ways. It’s a different way’. We refer to resampling
as the third way, the new statistics.
Methods that generate large volumes of data pose a challenge for analysis.
Techniques for optical data acquisition such as fluorescence intensity measurement (collected as pixel counts) or medical scans (collected as voxels –
think of it as a three-dimensional pixel), invariably involve collection of data
that are simply background noise. What do we mean by this? Well, it means
that you will measure something even if the thing you are trying to measure
is absent. This is admirably demonstrated by Bennett et al. (2010) who performed an fMRI scan on a dead salmon to which they showed photographs of
humans, asking it to determine which emotions were being experienced. The
processed scan data reported active voxel clusters in the salmon’s brain and
spinal column. Clearly, a dead salmon does not have any brain activity and
is certainly not able to perform a cognitive task – the results were what we
call a false positive and arise from inadequate or lack of correction for multiple testing. As space restrictions are not conducive to analysis of such massive amounts of data, here we will use small data sets as examples to demonstrate the principles of resampling methods, which may then be applied to
more complex data such as those encountered in the real world.

10.3 Resampling
Resampling statistics is a general term that covers a wide variety of statistical methods, of which we are going to introduce two: randomisation tests and
bootstrapping. These methods are particularly useful for, but not restricted to,
large data sets that are not amenable to conventional tests, such as genomics
data. Recently, interest has also been aroused in using bootstrapping for the
analysis of fMRI data (e.g. Bellec et al., 2010). In this chapter, we will first
demonstrate how randomisation tests and bootstrapping work and then we
will discuss their application in the above fields.