2 Comparing observed and expected frequencies: the chi-square test for goodness of fit
Tải bản đầy đủ - 0trang
18.2 Observed and expected frequencies
231
describing that locality give the sand mineralogy at all of the beaches on
the island to be 75% coral fragments and 25% basalt grains, and so your null
hypothesis is that the sand from the suspect’s shoes will also have these
proportions. When you examine the sample of 100 grains from the suspect’s
shoes, you ﬁnd that it contains 86 coral fragments and 14 basalt grains.
Before you go and testify in court, you will need to know the probability
that this diﬀerence between the observed frequencies in the sample and
those expected from the composition of the beach is due to chance.
Second, you may want to know the probability that two or more samples
have come from the same population. As an example, consider the handedness of quartz crystals, which is important because of its eﬀect on optical
properties. The handedness arises because there are chains of SiO4 tetrahedra
that form a helical spiral around the vertical axis, but the spiral can turn in
either a clockwise or counter-clockwise direction. A manufacturer noticed
that quartz crystals grown using a new type of alloy in the autoclave tended
to be predominantly right-handed. Consequently, 100 quartz crystals grown
with the new alloy method and 100 samples grown using the original one
were compared. For the new method 67 crystals were right-handed and
33 left-handed, while the original method produced 53 right-handed and
47 left-handed. Here too, the diﬀerence between the two samples might be
due to chance, or also be aﬀected by the new procedure.
For both of these examples a method is needed that gives the probability
of obtaining the observed outcome under the null hypothesis. This chapter
describes some tests for analyzing samples of categorical data.
18.2
Comparing observed and expected frequencies:
the chi-square test for goodness of ﬁt
The chi-square test for goodness of ﬁt compares the observed frequencies in
a sample to those expected in a population. The chi-square statistic is the
sum, of each observed frequency minus its expected frequency, squared and
then divided by the expected frequency (and was rst discussed in Chapter 6):
2 ẳ
n
X
oi ei ị2
iẳ1
ei
(18:1)
232
Non-parametric tests for nominal scale data
This is sometimes written as:
n
X
ðfi À ^f i ị2
2 ẳ
^f
iẳ1
i
(18:2)
where fi is the observed frequency and ^f i is the expected frequency.
It does not matter whether the diﬀerence between the observed and
expected frequencies is positive or negative because the square of any
diﬀerence will be positive.
If there is perfect agreement between every observed and expected frequency, the value of chi-square will be zero. Nevertheless, even if the null
hypothesis applies, samples are unlikely to always contain the exact proportions present in the population. By chance, small departures are likely and
larger departures will also occur, all of which will generate positive values of
chi-square. The most extreme 5% of departures from the expected ratio are
considered statistically signiﬁcant and will exceed a critical value of chi-square.
For example, forams can be coiled either counter-clockwise (to the left) or
clockwise (to the right). The proportion of forams that coil to the left is close
to 0.1 (10%), which can be considered the proportion in the population
because it is from a sample of several thousand specimens. A paleontologist,
who knew that the proportion of left- and right-coiled forams shows some
variation among outcrops, chose 20 forams at random from the same
locality and found that four were left-coiled and 16 right-coiled. The question is whether the proportions in the sample were signiﬁcantly diﬀerent
from the expected proportions of 0.1 and 0.9 respectively. The diﬀerence
between the population and the sample might be only due to chance, but it
might also reﬂect something about the environment in which the forams
lived, such as the water temperature. Table 18.1 gives a worked example of a
chi-square test for this sample of left- and right-coiled forams.
The value of chi-square in Table 18.1 has one degree of freedom because
the sample size is ﬁxed, so as soon as the frequency of one of the two
categories is set the other is no longer free to vary. The 5% critical value of
chi-square with one degree of freedom is 3.84 (Appendix A), so the proportions of left- and right-coiled forams in the sample are not signiﬁcantly
diﬀerent to the expected proportions of 0.1 to 0.9. The chi-square test for
goodness of ﬁt can be extended to any number of categories and the degrees
of freedom will be k − 1 (where k is the number of categories). Statistical
packages will calculate the value of chi-square and its probability.
18.2 Observed and expected frequencies
233
Table 18.1 A worked example using chi-square to compare the
observed frequencies in a foram sample to those expected from the
known proportions in the population. The observed frequencies in a
sample of 20 are 4:16 and the expected frequencies are 2:18.
Coil direction
Left
Right
Observed
Expected
Obs – Exp
(Obs – Exp)2
ðObs À ExpÞ2
Exp
4
2
2
4
16
18
−2
4
2
0.22
2 ¼
n
X
ðoi À ei Þ2
I¼1
ei
¼ 2:22
18.2.1 Small sample sizes
When expected frequencies are small, the calculated chi-square statistic is
inaccurate and tends to be too large, therefore indicating a lower than
appropriate probability which increases the risk of Type 1 error. It used to
be recommended that no expected frequency in a chi-square goodness of
ﬁt test should be less than ﬁve, but this has been relaxed somewhat in the
light of more recent research, and it is now recommended that no more
than 20% of expected frequencies should be less than ﬁve.
An entirely diﬀerent method, which is not subject to bias when sample
size is small, can be used to analyze these data. It is an example of a group of
procedures called randomization tests that will be discussed further in
Chapter 19. Instead of calculating a statistic that is used to estimate the
probability of an outcome, a randomization test uses a computer program
to simulate the repeated random sampling of a hypothetical population
containing the expected proportions in each category. These samples will
often contain the same proportions as the population, but departures will
occur by chance. The simulated sampling is iterated, meaning it is repeated,
several thousand times and the resultant distribution of the statistic used
to identify the most extreme 5% of departures from the expected proportions. Finally, the actual proportions in the real sample are compared to this
distribution. If the sample statistic falls within the region where the most
Non-parametric tests for nominal scale data
Proportion of samples
234
0.3
0.2
0.1
0.0
0
2
4
6
8
10 12 14 16 18 20
Number of left-handed forams in a sample of 20
Figure 18.1 An example of the distribution of outcomes from a Monte Carlo
simulation where 10 000 samples of size 20 are taken at random from a
population containing 0.1 left-coiled and 0.9 right-coiled forams. Note that the
probability of obtaining four or more left-coiled forams in a sample of 20 is
greater than 0.05.
extreme 5% of departures from the expected occur, the sample is considered
signiﬁcantly diﬀerent from the population.
Repeated random sampling of a hypothetical population is an example
of a more general procedure called the Monte Carlo method that uses
the properties of the sample, or the expected properties of a population, and
takes a large number of simulated random samples to create a distribution
that would apply under the null hypothesis.
For the data in Table 18.1, where the sample size is 20 and the expected
proportions are 0.1 left-coiled to 0.9 right-coiled, a randomization test
works by taking several thousand random samples, each of size 20, from a
hypothetical population containing these proportions. This will generate a
distribution of outcomes similar to the one shown in Figure 18.1, which is
for 10 000 samples. If the procedure is repeated another 10 000 times, then
the outcome is unlikely to be exactly the same, but nevertheless will be very
similar to Figure 18.1 because so many samples have been taken. It is clear
from Figure 18.1 that the likelihood of a sample containing four or more
forams with tails coiling to the left is greater than 0.05.
18.3
Comparing proportions among two or more
independent samples
Earth scientists often need to compare the proportions in categories among
two or more samples to test the null hypothesis that these have come from
the same population. Unlike the previous example, there are no expected
18.3 Two or more independent samples
235
Table 18.2 Data for 20 water samples taken at each of three
locations to characterize the presence or absence of nitrate
contamination.
Contaminated
Uncontaminated
Townsville
Bowen
Mackay
12
8
7
13
14
6
proportions – instead these tests examine whether the proportions in each
category are heterogeneous among samples.
18.3.1 The chi-square test for heterogeneity
Here is an example for three samples, each containing two mutually exclusive categories. Hydrologists managing water aquifers are often concerned
about contamination from agricultural fertilizers containing nitrate (NO3−),
which is a very soluble form of nitrogen that can be absorbed by plant roots.
Unfortunately nitrate can leach into groundwater and make it unsafe for
drinking. A hydrologist hired to evaluate aquifers in three adjacent rural
areas sampled 20 wells in each for the presence/absence of detectable levels
of nitrate. The researcher did not have a preconceived hypothesis about the
expected proportions of contaminated and uncontaminated aquifers – they
simply wanted to compare the three locations. The data are shown in
Table 18.2. This format is often called a contingency table.
These data are used to calculate an expected frequency for each of the
six cells. This is done by ﬁrst calculating the row and column totals
(Table 18.3(a)) which are often called the marginal totals. The proportions
of contaminated and uncontaminated aquifers in the marginal totals
shown in the right-hand column of Figure 18.3 are the overall proportions
within the sample. Therefore, under the null hypothesis of no diﬀerence in
nitrate among locations, each will have the same proportion of contaminated wells. To obtain the expected frequency for any cell under the null
hypothesis, the column total and the row total corresponding to that cell
are multiplied together and divided by the grand total. For example, in
Table 18.3(b) the expected frequency of contaminated wells in a sample
of 20 from Townsville is (20 ì 33) ữ 60 = 11 and the expected frequency
of uncontaminated wells from Mackay is (20 × 27) ÷ 60 = 9.
236
Non-parametric tests for nominal scale data
Table 18.3 (a) The marginal totals for the data in Table 18.2. To obtain the expected
frequency for any cell, its row and column total are multiplied together and divided
by the grand total. (b) Note that the expected frequencies at each location (11:9)
are the same and also correspond to the proportions of the marginal totals (33:27).
(a) Observed frequencies and marginal totals.
Contaminated
Uncontaminated
Column totals
Townsville
Bowen
Mackay
Row totals
12
8
20
7
13
20
14
6
20
33
27
Grand total = 60
(b) Expected frequencies calculated from the marginal totals.
Contaminated
Uncontaminated
Column totals
Townsville
Bowen
Mackay
Row totals
11
9
20
11
9
20
11
9
20
33
27
Grand total = 60
After the expected frequencies have been calculated for all cells,
Equation (18.1) is used to calculate the chi-square statistic. The number
of degrees of freedom for this analysis is one less than the number of
columns, multiplied by one less than the number of rows, because all
but one of the values within each column and each row are free to vary,
but the ﬁnal one is not because of the ﬁxed marginal total. Here, therefore,
the number of degrees of freedom is 2 × 1 = 2. The smallest contingency
table possible has two rows and two columns (this is called a 2 × 2 table),
which will give a chi-square statistic with only one degree of freedom.
18.3.2 The G test or log-likelihood ratio
The G test or log-likelihood ratio is another way of estimating the chisquare statistic. The formula for the G statistic is:
!
n
X
fi
G¼2
fi ln
(18:3)
^f
i¼1
i
18.4 Bias when there is one degree of freedom
237
This means, “The G statistic is twice the sum of the frequency of each cell
multiplied by the natural logarithm of each observed frequency divided by
the expected frequency.” The formula will give a statistic of zero when each
expected frequency is equal to its observed frequency, but any discrepancy
will give a positive value of G. Some statisticians recommend the G test and
others recommend the chi-square test. There is a summary of tests recommended for categorical data near the end of this chapter.
18.3.3 Randomization tests for contingency tables
A randomization test procedure similar to the one discussed in
Section 18.2.1 for goodness-of-ﬁt tests can be used for any contingency
table. First, the marginal totals of the table are calculated and give the
expected proportions when there is no diﬀerence among samples. Then,
the Monte Carlo method is used to repeatedly “sample” a hypothetical
population containing these proportions, with the constraint that both
the column and row totals are ﬁxed. Randomization tests are available in
some statistical packages.
18.4
Bias when there is one degree of freedom
When there is only one degree of freedom and the total sample size is less
than 200, the calculated value of chi-square has been shown to be inaccurate
because it is too large. Consequently it gives a probability that is smaller
than appropriate, thus increasing the risk of Type 1 error. This bias
increases as sample size decreases, so the following formula, called Yates’
correction or the continuity correction, was designed to improve the
accuracy of the chi-square statistic for small samples with one degree of
freedom.
Yates’ correction removes 0.5 from the absolute diﬀerence between
each observed and expected frequency. (The absolute diﬀerence is used
because it converts all diﬀerences to positive numbers, which will be
reduced by subtracting 0.5. Otherwise, any negative values of oi – ei would
have to be increased by 0.5 to make their absolute size and the square of
that smaller.) The absolute value is the positive of any number and is
indicated by enclosing the number or its symbol by two vertical bars
238
Non-parametric tests for nominal scale data
(e.g. j À 6j ¼ 6). The subscript “adj” after the value of chi-square means it
has been adjusted by Yates correction.
2adj ẳ
n
X
joi ei j 0:5ị2
iẳ1
ei
(18:4)
From Equation (18.4) it is clear that the compensatory eﬀect of Yates’
correction will become less and less as sample size increases. Some authors
(e.g. Zar, 1996) recommend that Yates’ correction is applied to all chisquare tests having only one degree of freedom, but others suggest it is
unnecessary for large samples and recommend the use of the Fisher Exact
Test (see Section 18.4.1 below) for smaller ones.
18.4.1 The Fisher Exact Test for 2 × 2 tables
The Fisher Exact Test accurately calculates the probability that two samples,
each containing two categories, are from the same population. This test is
not subject to bias and is recommended when sample sizes are small or
more than 20% of expected frequencies are less than ﬁve, but it can be used
for any 2 × 2 contingency table.
The Fisher Exact Test is unusual in that it does not calculate a statistic
that is used to estimate the probability of a departure from the null hypothesis. Instead, the probability is calculated directly.
The easiest way to explain the Fisher Exact Test is with an example.
Table 18.4 gives data for the presence or absence of mollusc species with
anti-predator adaptations on either side of the Cretaceous/Tertiary (K/T)
extinction boundary. A typical adaptation might include development of a
thicker, stronger shell, or perhaps a decrease in the size of the aperture
(opening) of the shell to discourage shell-peeling by predatory crabs
(e.g. Vermeij, 1978). However, during an environmental event causing
mass extinction, such adaptations might require more food or reduce
mobility, either of which may diminish the species’ ability to survive. To
test this hypothesis, a paleontologist examined ten outcrops, ﬁve below the
K/T boundary and ﬁve above it. The results for the presence or lack of
detection of thick-shelled molluscs are in Table 18.4. These frequencies are
too small for accurate analysis using a chi-square test.
18.4 Bias when there is one degree of freedom
239
Table 18.4 Data for the presence/absence of mollusc species with thick shells in ten
samples above and below the mass extinction boundary between the Cretaceous
and Tertiary periods. The sample deliberately included ﬁve samples above the
boundary layer and ﬁve below it. The marginal totals show that four samples contain
species with thick shells and six do not.
Thick-shelled molluscs present
Thick-shelled molluscs not found
Totals
Above K/T boundary
Below K/T boundary
0
5
5
4
1
5
4
6
10
Table 18.5 Under the null hypothesis that there is no eﬀect of mass extinction on
the presence of molluscs with thick shells, the expected proportions of rocks with
and without thick-shelled molluscs in each sample (2:3 and 2:3) will correspond to
the marginal totals for the two rows (4:6). The proportions of samples from above
and below the K/T boundary (2:2) and (3:3) will also correspond to the marginal totals
for the two columns (5:5).
Thick-shelled molluscs present
Thick-shelled molluscs not found
Totals
Above K/T boundary
Below K/T boundary
2
3
5
2
3
5
4
6
10
If there were no eﬀect of mass extinction, then you would expect, under
the null hypothesis, that the proportion of samples containing molluscs
with thicker shells (representing anti-predatory adaptations) in each locality
(above and below the K/T boundary) would be the same as the marginal
totals (Table 18.5) with any departures being due to chance. The Fisher
Exact Test uses the following procedure to calculate the probability of an
outcome equal to or more extreme than the one observed, which can be used
to decide whether it is statistically signiﬁcant.
First, the four marginal totals are calculated, as shown in Table 18.5.
Second, all of the possible ways in which the data can be arranged within
the four cells of the 2 × 2 table are listed, subject to the constraint that the
marginal totals must remain unchanged. This is the total set of possible
outcomes for the sample. For these marginal totals, the most likely outcome under the null hypothesis of no diﬀerence between the samples is
shown in Table 18.5 and identiﬁed as (c) in Table 18.6.
Thick-shelled
molluscs
present
Thick-shelled
molluscs
not found
5
1
(a)
0
4
Above K/T Below K/T
boundary boundary
(b)
2
3
4
1
Above K/T Below K/T
boundary boundary
3
2
(c) Expected
under the null
hypothesis
3
2
Above K/T Below K/T
boundary boundary
(d)
4
1
2
3
Above K/T Below K/T
boundary boundary
1
4
(e) Observed
outcome
5
0
Above K/T Below K/T
boundary boundary
Table 18.6 The total set of possible outcomes for the number of outcrops with and without thick-shelled molluscs, subject to the constraint that there are
ﬁve outcrops on each side of K/T mass extinction and four have thick-shelled molluscs while six lack them. The most likely outcome, where the proportions
are the same both above and below the K/T boundary, is shown in the central box (c). The actual outcome is case (e).
18.4 Bias when there is one degree of freedom
241
For a sample of ten outcrops, ﬁve of which are above the K/T boundary
and ﬁve below, together with the constraint that four outcrops must have
thick-shelled molluscs and six must lack them, there are ﬁve possible outcomes (Table 18.6). To obtain these, you start with the outcome expected
under the null hypothesis (c), choose one of the four cells (it does not matter
which) and add one to that cell. Next, adjust the values in the other three
cells so the marginal totals do not change. Continue with this procedure
until the number within the cell you have chosen cannot be increased any
further without aﬀecting the marginal totals. Then go back to the expected
outcome and repeat the procedure by subtracting one from the same cell
until the number in it cannot decrease any further without aﬀecting the
marginal totals (Table 18.6).
Third, the actual outcome is identiﬁed within the total set of possible
outcomes. For this example, it is case (e) in Table 18.6. The probability of
this outcome, together with any more extreme departures in the same
direction from the one expected under the null hypothesis (here there are
none more extreme than (e)) can be calculated from the probability of
getting this particular arrangement within the four cells by sampling a set
of ten outcrops, four of which contain thick-shelled molluscs and six of
which do not, with the outcrops sampled from above and below the K/T
boundary. This is similar to the example used to introduce hypothesis
testing in Chapter 6, where you had to imagine a sample of hornblende
vs. quartz grains in a beach sand. Here, however, a very small group is
sampled without replacement, so the initial probability of selecting an outcrop with thick-shelled molluscs present is 4/10, but if one is drawn, the
probability of next drawing an outcrop with thick-shelled molluscs is now
3/9 (and 6/9 without). We deliberately have not given this calculation
because it is long and tedious, and most statistical packages do it as part
of the Fisher Exact Test.
The calculation gives the exact probability of getting the observed outcome or a more extreme departure in the same direction from that expected
under the null hypothesis. This is a one-tailed probability, because the
outcomes in the opposite direction (e.g. on the left of (c) in Table 18.6)
have been ignored. For a two-tailed hypothesis you need to double the
probability. When the probability is less than 0.05, the outcome is considered statistically signiﬁcant.