11 Comparing multiple means – the principles of analysis of variance
Tải bản đầy đủ
6.11
COMPARING MULTIPLE MEANS – THE PRINCIPLES OF ANALYSIS OF VARIANCE
113
Table 6.4 Systolic blood pressure following administration of either no treatment or red,
white or blue placebo pills for 2 weeks to subjects with mild hypertension, n = 8 per group
No pills
139
139
144
149
144
143
150
140
White pills
Red pills
Blue pills
142
133
143
139
123
138
138
134
151
145
144
150
145
146
156
145
123
136
125
123
124
130
134
124
of treatment, although if we really were to conduct such a study we would
require a much larger sample size (Table 6.4). Because we have used different
subjects the four samples are independent of each other.
Systolic blood pressure varies and innumerable transient factors can affect
its values including mood, anxiety, time of day, exercise, caffeine, alcohol, and
white coat effects. The white coat effect refers to the fact that in some people
simply having their blood pressure measured, whether that person is wearing
a white coat or not, increases its value. So the variability in blood pressure
measurement could be due to any of these things, which we call random factors, or it could be due to the treatment that we apply. We want to know if
there is an effect of the factor ‘treatment’ over and above the variation due to
random factors. In the first instance, we simply ignore the fact that we have
different treatments and we pool all the observations together. We want to
separate the variation due to the fixed factor treatment (meaning all treatments together) from that due to all the random factors, which is everything
else. We call the variation in the data due to random factors the error or residual variation. Here we are going to use the term ‘error’.
The theory behind an ANOVA is modelled by Pythagoras’s theorem of
right-angled triangles, a2 + b2 = c2 (Efron, 1978). Before you allow the previous sentence to reduce you to jelly, take a look at Figure 6.6. Now imagine
that the area c2 represents the total variation in the data and its value is 25
squared units, the area a2 represents the variation due to treatment with a
value of 16 squared units and the area b2 represents the variation due to error
(random factors) and its value is 9 squared units, then 16 + 9 = 25. There is
more variation due to treatment (16 squared units) than there is due to error
(9 squared units). Where b is short relative to a, its square will be a smaller
proportion of the total area (variation) and where b is long relative to a, its
square will constitute a large proportion of the total area (variation).
So how do we use a geometric model, Pythagoras’ theorem, in analysis
of variance? ANOVA assumes that the observations deviate from the grand
114
CH 6
COMPARING GROUPS USING t-TESTS AND ANOVA
Figure 6.6 Pythagoras’ theorem. A right-angled triangle has sides of length a, b and c, with c
being the longest side or hypotenuse (left image). If we square each side (multiply the length
of each side by itself) we will have three squares, a2 , b2 and c2 (right image). If we add together
the areas a2 and b2 they will equal the area of c2 . We can test this with real numbers. If a = 4,
b = 3 and c = 5, then 42 + 32 = 52 or 16 + 9 = 25.
mean (overall arithmetic mean) because of variation attributable both to the
effect of the fixed factor and random factors. This is how we do it:
Step 1. Calculate the grand mean by pooling all the observations together
and ignoring the fact that they belong to different treatment groups:
The sum of the observations is 4439.
Grand mean = 4439 ∕ 32 = 138.7188
Step 2. Subtract the grand mean from each observation – these are the deviations and then square each deviation. Next sum (add up) the squared
deviations:
Treatment
No pills
No pills
No pills
No pills
No pills
No pills
No pills
No pills
White pills
White pills
White pills
White pills
White pills
Observations − grand mean
139 − 138.7188
139 − 138.7188
144 − 138.7188
149 − 138.7188
144 − 138.7188
143 − 138.7188
150 − 138.7188
140 − 138.7188
142 − 138.7188
133 − 138.7188
143 − 138.7188
139 − 138.7188
123 − 138.7188
=
=
=
=
=
=
=
=
=
=
=
=
=
Deviations
0.2812
0.2812
5.2812
10.2812
5.2812
4.2812
11.2812
1.2812
3.2812
−5.7188
4.2812
0.2812
−15.7188
Deviations squared
0.0791
0.0791
27.8911
105.7031
27.8911
18.3287
127.2655
1.6415
10.7663
32.7047
18.3287
0.0791
247.0807
6.11
COMPARING MULTIPLE MEANS – THE PRINCIPLES OF ANALYSIS OF VARIANCE
White pills
White pills
White pills
Red pills
Red pills
Red pills
Red pills
Red pills
Red pills
Red pills
Red pills
Blue pills
Blue pills
Blue pills
Blue pills
Blue pills
Blue pills
Blue pills
Blue pills
138
138
134
151
145
144
150
145
146
156
145
123
136
125
123
124
130
134
124
−
−
−
−
−
−
−
−
−
−
−
−
−
−
−
−
−
−
−
138.7188
138.7188
138.7188
138.7188
138.7188
138.7188
138.7188
138.7188
138.7188
138.7188
138.7188
138.7188
138.7188
138.7188
138.7188
138.7188
138.7188
138.7188
138.7188
=
=
=
=
=
=
=
=
=
=
=
=
=
=
=
=
=
=
=
115
−0.7188
0.5167
−0.7188
0.5167
−4.7188
22.2671
12.2812
150.8279
6.2812
39.4535
5.2812
27.8911
11.2812
127.2655
6.2812
39.4535
7.2812
53.0159
17.2812
298.6399
6.2812
39.4535
−15.7188
247.0807
−2.7188
7.3919
−13.7188
188.2055
−15.7188
247.0807
−14.7188
216.6431
−8.7188
76.0175
−4.7188
22.2671
−14.7188
216.6431
Total sum of squares: 2638.4688
We now have the total sum of squares (SS), 2638.4688. This is the
total variation in the data and is represented by area c2 in Figure 6.6.
Step 3. Calculate the arithmetic mean of each sample (use the data in
Table 6.4 if that is easier)
No pills − the sum of the observations is 1148.
Mean = 1148 ∕ 8 = 143.500
White pills − the sum of the observations is 1090.
Mean = 1090 ∕ 8 = 136.250
Red pills − the sum of the observations is 1182.
Mean = 1182 ∕ 8 = 147.750
Blue pills − the sum of the observations is 1019.
Mean = 1019 ∕ 8 = 127.375
Step 4. For each treatment group, subtract the sample mean calculated in
Step 3 from the grand mean we obtained in Step 1, square the answer,
and multiply by 8 (as there are eight observations per group)
Using: (grand mean – sample mean)2 ∗ 8
No pills
(138.7188 − 143.500)2 ∗ 8
= −4.7812 ∗ 8
= 22.860 ∗ 8 = 182.879
116
CH 6
COMPARING GROUPS USING t-TESTS AND ANOVA
White pills
(138.7188 − 136.250)2 ∗ 8
= 2.4692 ∗ 8
= 6.095 ∗ 8 = 48.760
Red pills
(138.7188 − 147.750)2 ∗ 8
= −9.0312 ∗ 8
= 81.563 ∗ 8 = 652.501
Blue pills
(138.7188 − 127.375)2 ∗ 8
= 11.3442 ∗ 8
= 128.682 ∗ 8 = 1029.454
Step 5. Sum the four answers obtained in Step 3.
182.879 + 48.760 + 652.501 + 1029.454 = 1913.594 or 1913.6
This is the treatment sum of squares. This is the variation in the data
due to the fixed factors (treatment) and is represented by area a2 in
Figure 6.6.
Step 6. At this point we could cheat and subtract the treatment sum of
squares from the total sum of squares to obtain the error sum of
squares, but we are going to show you how to calculate the error
sum of squares directly from the data. You can check that you get
same result either way. For each treatment group, subtract the sample mean from each observation – these are the deviations and then
square each deviation – then sum (add up) the squared deviations
for that sample:
No pills
Observations – sample mean Deviations
139 − 143.50 =
−4.500
139 − 143.50 =
−4.500
144 − 143.50 =
0.500
149 − 143.50 =
5.500
144 − 143.50 =
0.500
143 − 143.50 =
−0.500
150 − 143.50 =
6.500
140 − 143.50 =
−3.500
Deviations squared
20.2500
20.2500
0.2500
30.2500
0.2500
0.2500
42.2500
12.2500
126.0000
6.11
COMPARING MULTIPLE MEANS – THE PRINCIPLES OF ANALYSIS OF VARIANCE
117
White pills
Observations – sample mean Deviations
142 − 136.25 =
5.750
133 − 136.25 =
−3.250
143 − 136.25 =
6.750
139 − 136.25 =
2.750
123 − 136.25 =
−13.250
138 − 136.25 =
1.750
138 − 136.25 =
1.750
134 − 136.25 =
−2.250
Deviations squared
33.0625
10.5625
45.5625
7.5625
175.5625
3.0625
3.0625
5.0625
283.5000
Red pills
Observations – sample mean Deviations
151 − 147.75 =
3.250
145 − 147.75 =
−2.750
144 − 147.75 =
−3.750
150 − 147.75 =
2.250
145 − 147.75 =
−2.750
146 − 147.75 =
−1.750
156 − 147.75 =
8.250
145 − 147.75 =
−2.750
Deviations squared
10.5625
7.5625
14.0625
5.0625
7.5625
3.0625
68.0625
7.5625
123.5000
Blue pills
Observations – sample mean Deviations
123 − 127.38 =
−4.375
136 − 127.38 =
8.625
125 − 127.38 =
−2.375
123 − 127.38 =
−4.375
124 − 127.38 =
−3.375
130 − 127.38 =
2.625
134 − 127.38 =
6.625
124 − 127.38 =
−3.375
Deviations squared
19.1406
74.3906
5.6406
19.1406
11.3906
6.8906
43.8906
11.3906
191.8750
Step 7. Sum the four answers obtained in Step 6:
126.0000 + 283.5000 + 123.5000 + 191.8750 = 724.9
This is the error sum of squares (SS). This is the variation in the data
due to random factors (error) and is represented by area b2 in Figure 6.6.
We have now partitioned the total variation into that due to treatment and that due to error.
118
CH 6
COMPARING GROUPS USING t-TESTS AND ANOVA
Step 8. Calculate the variances. In an ANOVA the variances are equivalent
to the mean squares (MS). The sum of squares must be corrected for
the degrees of freedom (df) to produce the mean squares.
The df for the total is the number of replicates − 1, that is, 32 − 1 = 31
The df for treatments is the number of treatments − 1, i.e. 4 − 1 = 3
The error df is total df − treatments df, ie. 31 − 3 = 28
To obtain the mean squares (variances) for the treatments and error
we need to divide the sum of squares from Steps 5 and 7 by the corresponding degrees of freedom above:
Treatment MS = Treatment SS/Treatment df = 1913.6 ∕ 3 = 637.9
Error MS = Error SS/Error df = 724.9 ∕ 28 = 25.9
The treatment mean squares and error mean squares are different
estimates of the population variance. The hypothesis we are testing is
whether the variance associated with the treatments is so large, when
compared with the error variance that it could not have occurred if
the null hypothesis were correct.
Step 9. Calculate the test statistic, F:
F = Treatment MS∕error MS
= 637.9 ∕ 25.9
= 24.64
The variance ratio is called F and is the test statistic for an ANOVA. We
should be able to recognise that when the treatment MS is large and the error
MS is small, F will be large and most of the variation in the data will be due to
treatment so there will be a difference in group means. When the null hypothesis is true the variation from the treatments will be similar in magnitude to
that from error and the F ratio will be approximately equal to 1. When the
treatment MS is smaller and the error MS is larger, F will be small and most
of the variation in the data will be as much due to random factors as the treatment difference. As the mean squares are the measure of variance then if the
null hypothesis is true we would expect the size of variance due to the treatments to be no different from the remainder attributable to random error. In
other words, the F ratio would be 1 if the null hypothesis were correct.
So, where there is an effect of the treatments we would anticipate that treatment MS would be larger than error MS. But how much larger does it have to
be to reject the null hypothesis? As we saw for the t-test the critical values of
F have been tabulated (although nowadays these reside within the computer
6.11
COMPARING MULTIPLE MEANS – THE PRINCIPLES OF ANALYSIS OF VARIANCE
119
Denominator df
Table 6.5 Statistical table for the F distribution for selected degrees of freedom
Numerator df
1
2
3
4
5
21
22
23
24
25
26
27
28
29
4.32
4.30
4.28
4.26
4.24
4.22
4.21
4.20
4.18
3.47
3.44
3.42
3.40
3.38
3.37
3.35
3.34
3.33
3.07
3.05
3.03
3.01
2.99
2.98
2.96
2.95
2.93
2.84
2.82
2.80
2.78
2.76
2.74
2.73
2.71
2.70
2.68
2.66
2.64
2.62
2.60
2.59
2.57
2.56
2.54
The shaded value is the critical value of F for the one-way ANOVA performed on data for the effect of placebo
pills on blood pressure.
software) and the magnitude of the critical value of F for at 𝛼 = .05 varies
with the degrees of freedom, both those for the treatments and error. As we
anticipate that F will increase if the null hypothesis is false, the numerator is
always the treatment MS and the denominator the error MS. Unlike the t-test
we do not have a choice of one-tailed or two-tailed tests depending on the null
hypothesis being tested, the null hypothesis is already determined for us.
The pertinent section of an F table for 𝛼 = .05 below shows that with our
degrees of freedom (3 for the numerator or treatment MS, and 28 for the
denominator or error MS) the critical value of F is 2.95 (Table 6.5). So in this
case to reject the null hypothesis and conclude that the treatments had an
effect we require approximately three times the variance associated with the
treatments (fixed factor) compared with the error. Our calculated F ratio is
much bigger than this at 24.64 so p < .05 and we would reject the null hypothesis that every sample is drawn from a population(s) with the same population
mean. Note that this does not tells us whether the samples are drawn from two
or more populations with the same population mean or whether they are all
samples from the same population with a single population mean. Note also
that the smaller the sample size, and therefore error degrees of freedom, the
larger the critical value of F. For smaller samples the effect of the treatments
on the variance has to be much larger to attain statistical significance. We
could report this F ratio as F3,28 = 24.64, where by convention the subscripted
numbers are the treatment and error degrees of freedom respectively.
The results of our ANOVA tell us that there is an effect of treatment, at
least one of the placebos alters blood pressure, but it does not tell us which
particular treatments are different from each other. To find this out we need
to compare the groups using a comparison test. There are several tests that
can be used in conjunction with an ANOVA and which we use depends
partly on the comparisons we wish to make. We will briefly describe how to
120
CH 6
COMPARING GROUPS USING t-TESTS AND ANOVA
interpret two commonly used tests, Tukey’s and Dunnett’s. Both are based on
the t-test, but unlike Bonferroni’s correction, these tests correct for multiple
comparisons by adjusting 𝛼 in a less conservative way.
6.11.1 Tukey’s honest significant difference test
Tukey’s post hoc test can compare all possible pairs of means, where the overall (family) value for 𝛼 is .05 and the value of 𝛼 for each comparison is determined by the number of treatments, k, the error degrees of freedom, v, and a
test statistic called q. As long as the test assumptions are met (homogeneity
of variances, independence and normality) it is a robust test that maintains 𝛼
at intended values.
The output from Minitab for a one-way ANOVA is reported in Box 6.2.
The first part is the ANOVA itself, which is what we calculated above. Below
the ANOVA results are the sample descriptive statistics and 95% confidence
intervals. It looks like the white pills and blue pills might have lowered the
systolic blood pressure as the means are lower than no treatment and the
confidence intervals don’t overlap, but Tukey’s test below will tell us for
sure. Recall that the 95% confidence intervals estimate the location of the
population mean. The individual value of 𝛼, that is the significance level for
each comparison, is .0108. This is less conservative than Bonferroni’s correction where it would be .0083 for six comparisons. The critical value of the
test statistic, q, is 3.86 and for each comparison calculated q must exceed this
for the difference to be significant at the .05 level. To interpret the table of
intervals at the bottom, we need to look at each pair of numbers for a given
comparison and if the pair includes zero there is no difference between those
treatments, whereas if zero is not included there is a difference. So for no
pills (none) versus blue pills the interval is −23.069 to −9.181. Both numbers
are negative so they don’t span zero. We can conclude that blue placebo pills
lower blood pressure in patients when compared with the blood pressure of
patients given no pills. For the comparison no pills versus red pills, however,
the values are −11.194 to 2.694. As one number is negative and the other is
positive, this interval spans zero so there is no difference in the systolic blood
pressure of patients given no pills compared with those given red placebo pills.
Now try to work out for yourself whether there are differences for the other
comparisons. You should find that the results are as follows:
Blue versus red – difference, p < .05
Blue versus white – difference, p < .05
None versus white – difference, p < .05
Red versus white – difference, p < .05
6.11
COMPARING MULTIPLE MEANS – THE PRINCIPLES OF ANALYSIS OF VARIANCE
121
Box 6.2 Minitab output for a one-way analysis of variance with Tukey’s pairwise
comparisons
Analysis of variance for placebo
Source
Placebo
Error
Total
Level
blue
none
red
white
DF
3
28
31
N
8
8
8
8
SS
1913.6
724.9
2638.5
MS
637.9
25.9
Mean
127.38
143.50
147.75
136.25
StDev
5.24
4.24
4.20
6.36
Pooled StDev = 5.09
F
24.64
P
0.000
Individual 95% CIs for mean
Based on pooled StDev
------+---------+---------+---------+
(---∗----)
(---∗----)
(----∗---)
(---∗----)
------+---------+---------+---------+
128.0
136.0
144.0
152.0
Tukey’s pairwise comparisons
Family error rate = 0.0500
Individual error rate = 0.0108
Critical value = 3.86
Intervals for (column level mean) - (row level mean)
blue
none
none
-23.069
-9.181
red
-27.319
-13.431
-11.194
2.694
white
-15.819
-1.931
0.306
14.194
red
4.556
18.444
6.11.2 Dunnett’s test
Suppose we aren’t interested in the pairwise comparisons produced by
Tukey’s test but only in whether placebo pills affect blood pressure compared with a control, in this case no pills. In this instance we could use a
Dunnett’s test, in which a treatment is nominated as the control and each of
the other treatments is compared with it. This reduces the number of comparisons when compared with Tukey’s test. For our placebo data, there are
122
CH 6
COMPARING GROUPS USING t-TESTS AND ANOVA
four groups and so three comparisons. The Minitab output for a Dunnett’s
test of placebo pills data is reported in Box 6.3. The first part of the output
is identical to the ANOVA we performed earlier so we need to concentrate
on the second part of the output below it, Dunnett’s comparisons with a control. Again, with the significance level, 𝛼, at .0193 for individual comparisons,
Dunnett’s test is less conservative than multiple t-tests with Bonferroni’s correction where it would be .0167 when three tests are required. The critical
value of the test statistic (also q) is 2.48 and for each comparison calculated
q must exceed this for the difference to be significant at the .05 level. The
Box 6.3 Minitab output for a one-way analysis of variance with Dunnett’s
comparisons with a control
Analysis of variance for placebo
Source
Placebo
Error
Total
Level
Blue
None
Red
White
DF
3
28
31
N
8
8
8
8
SS
1913.6
724.9
2638.5
Mean
127.38
143.50
147.75
136.25
Pooled StDev = 5.09
MS
637.9
25.9
F
24.64
P
0.000
Individual 95% CIs For Mean
Based on Pooled StDev
StDev ------+---------+---------+---------+
5.24 (---∗----)
4.24
(---∗----)
4.20
(----∗---)
6.36
(---∗----)
------+---------+---------+---------+
128.0
136.0
144.0
152.0
Dunnett’s comparisons with a control
Family error rate = 0.0500
Individual error rate = 0.0193
Critical value = 2.48
Control = level of treatment(none)
Intervals for treatment mean minus control mean
Level
White
Red
Blue
Lower
-13.567
-2.067
-22.442
Center
-7.250
4.250
-16.125
Upper
-0.933
10.567
-9.808
---+---------+---------+---------+---(------∗-----)
(-----∗------)
(-----∗-----)
---+---------+---------+---------+---20
-10
0
10
6.11
COMPARING MULTIPLE MEANS – THE PRINCIPLES OF ANALYSIS OF VARIANCE
123
diagram of 95% confidence intervals for Dunnett’s test differs from that for
the ANOVA in that the comparison is now for a difference in means between
each of the placebo pill groups and no pill group, so the scale runs from −20 to
10. If we subtract the mean of the control group ‘none’ from the mean of the
‘red’, the difference is 147.75 − 143.50 = 4.25. The mean for this comparison is
represented by the asterisk at 4.25 on the x-axis. Although the systolic blood
pressure is a little higher in patients given red placebo pills, the difference is
not statistically significant because the confidence interval for the difference
in means spans zero. For both of the other comparisons, no pills versus white
pills and no pills versus blue pills, there is a statistically significant decrease
in the means – the asterisks are at negative values on the x-axis and the confidence intervals do not overlap zero. These results mirror those we reported
for Tukey’s test.
One-way ANOVA results summary: A difference was observed in the
mean systolic blood pressure of patients given different coloured placebo
pills or no pills (F3,28 = 24.64, p < .001).
For Tukey’s comparisons:
Tukey’s pairwise comparisons revealed that the mean systolic blood pressure of patients administered red pills (147.75 mmHg) did not differ from
that of untreated patients (143.50 mmHg). The administration of white
or blue placebo pills, however, resulted in mean systolic blood pressures
of 136.25 mmHg and 127.38 mmHg respectively which were significantly
lower than systolic pressures of patients given no pills or red pills (Tukey’s
family error rate = .0500, individual error rate = .0108).
For Dunnet’s comparisons:
Dunnet’s test to compare pills versus the no pills control patients revealed
that the mean systolic blood pressure of patients given red pills was 147.75
mmHg but did not differ from that of controls which was 143.50 mmHg
(95% CI for the difference −2.067, 10.567). The mean systolic blood pressure of patients given either white or blue pills, however, was lower than
that of patients given no pills (meanwhite 136.25 mmHg, 95% CI for the
difference −13.567, −0.933, meanblue 127.38 mmHg, 95% CI for the difference −22.442, −9.808). Dunnet’s family error rate = 0.0500; individual
error rate = 0.0193.
6.11.3 Accounting for identifiable sources of error
in one-way ANOVA: nested design
Let’s suppose we are investigating oxidative stress and we measure superoxide
levels in mice of different ages with a luminescence assay. This simply involves