Tải bản đầy đủ - 0 (trang)
2 Comparing observed and expected frequencies: the chi-square test for goodness of fit

2 Comparing observed and expected frequencies: the chi-square test for goodness of fit

Tải bản đầy đủ - 0trang

18.2 Observed and expected frequencies



231



describing that locality give the sand mineralogy at all of the beaches on

the island to be 75% coral fragments and 25% basalt grains, and so your null

hypothesis is that the sand from the suspect’s shoes will also have these

proportions. When you examine the sample of 100 grains from the suspect’s

shoes, you find that it contains 86 coral fragments and 14 basalt grains.

Before you go and testify in court, you will need to know the probability

that this difference between the observed frequencies in the sample and

those expected from the composition of the beach is due to chance.

Second, you may want to know the probability that two or more samples

have come from the same population. As an example, consider the handedness of quartz crystals, which is important because of its effect on optical

properties. The handedness arises because there are chains of SiO4 tetrahedra

that form a helical spiral around the vertical axis, but the spiral can turn in

either a clockwise or counter-clockwise direction. A manufacturer noticed

that quartz crystals grown using a new type of alloy in the autoclave tended

to be predominantly right-handed. Consequently, 100 quartz crystals grown

with the new alloy method and 100 samples grown using the original one

were compared. For the new method 67 crystals were right-handed and

33 left-handed, while the original method produced 53 right-handed and

47 left-handed. Here too, the difference between the two samples might be

due to chance, or also be affected by the new procedure.

For both of these examples a method is needed that gives the probability

of obtaining the observed outcome under the null hypothesis. This chapter

describes some tests for analyzing samples of categorical data.



18.2



Comparing observed and expected frequencies:

the chi-square test for goodness of fit



The chi-square test for goodness of fit compares the observed frequencies in

a sample to those expected in a population. The chi-square statistic is the

sum, of each observed frequency minus its expected frequency, squared and

then divided by the expected frequency (and was rst discussed in Chapter 6):

2 ẳ



n

X

oi ei ị2

iẳ1



ei



(18:1)



232



Non-parametric tests for nominal scale data



This is sometimes written as:

n

X

ðfi À ^f i ị2

2 ẳ

^f

iẳ1

i



(18:2)



where fi is the observed frequency and ^f i is the expected frequency.

It does not matter whether the difference between the observed and

expected frequencies is positive or negative because the square of any

difference will be positive.

If there is perfect agreement between every observed and expected frequency, the value of chi-square will be zero. Nevertheless, even if the null

hypothesis applies, samples are unlikely to always contain the exact proportions present in the population. By chance, small departures are likely and

larger departures will also occur, all of which will generate positive values of

chi-square. The most extreme 5% of departures from the expected ratio are

considered statistically significant and will exceed a critical value of chi-square.

For example, forams can be coiled either counter-clockwise (to the left) or

clockwise (to the right). The proportion of forams that coil to the left is close

to 0.1 (10%), which can be considered the proportion in the population

because it is from a sample of several thousand specimens. A paleontologist,

who knew that the proportion of left- and right-coiled forams shows some

variation among outcrops, chose 20 forams at random from the same

locality and found that four were left-coiled and 16 right-coiled. The question is whether the proportions in the sample were significantly different

from the expected proportions of 0.1 and 0.9 respectively. The difference

between the population and the sample might be only due to chance, but it

might also reflect something about the environment in which the forams

lived, such as the water temperature. Table 18.1 gives a worked example of a

chi-square test for this sample of left- and right-coiled forams.

The value of chi-square in Table 18.1 has one degree of freedom because

the sample size is fixed, so as soon as the frequency of one of the two

categories is set the other is no longer free to vary. The 5% critical value of

chi-square with one degree of freedom is 3.84 (Appendix A), so the proportions of left- and right-coiled forams in the sample are not significantly

different to the expected proportions of 0.1 to 0.9. The chi-square test for

goodness of fit can be extended to any number of categories and the degrees

of freedom will be k − 1 (where k is the number of categories). Statistical

packages will calculate the value of chi-square and its probability.



18.2 Observed and expected frequencies



233



Table 18.1 A worked example using chi-square to compare the

observed frequencies in a foram sample to those expected from the

known proportions in the population. The observed frequencies in a

sample of 20 are 4:16 and the expected frequencies are 2:18.

Coil direction



Left



Right



Observed

Expected

Obs – Exp

(Obs – Exp)2

ðObs À ExpÞ2

Exp



4

2

2

4



16

18

−2

4



2



0.22



2 ¼



n

X

ðoi À ei Þ2

I¼1



ei



¼ 2:22



18.2.1 Small sample sizes

When expected frequencies are small, the calculated chi-square statistic is

inaccurate and tends to be too large, therefore indicating a lower than

appropriate probability which increases the risk of Type 1 error. It used to

be recommended that no expected frequency in a chi-square goodness of

fit test should be less than five, but this has been relaxed somewhat in the

light of more recent research, and it is now recommended that no more

than 20% of expected frequencies should be less than five.

An entirely different method, which is not subject to bias when sample

size is small, can be used to analyze these data. It is an example of a group of

procedures called randomization tests that will be discussed further in

Chapter 19. Instead of calculating a statistic that is used to estimate the

probability of an outcome, a randomization test uses a computer program

to simulate the repeated random sampling of a hypothetical population

containing the expected proportions in each category. These samples will

often contain the same proportions as the population, but departures will

occur by chance. The simulated sampling is iterated, meaning it is repeated,

several thousand times and the resultant distribution of the statistic used

to identify the most extreme 5% of departures from the expected proportions. Finally, the actual proportions in the real sample are compared to this

distribution. If the sample statistic falls within the region where the most



Non-parametric tests for nominal scale data

Proportion of samples



234



0.3

0.2

0.1

0.0

0



2



4



6



8



10 12 14 16 18 20



Number of left-handed forams in a sample of 20



Figure 18.1 An example of the distribution of outcomes from a Monte Carlo

simulation where 10 000 samples of size 20 are taken at random from a

population containing 0.1 left-coiled and 0.9 right-coiled forams. Note that the

probability of obtaining four or more left-coiled forams in a sample of 20 is

greater than 0.05.



extreme 5% of departures from the expected occur, the sample is considered

significantly different from the population.

Repeated random sampling of a hypothetical population is an example

of a more general procedure called the Monte Carlo method that uses

the properties of the sample, or the expected properties of a population, and

takes a large number of simulated random samples to create a distribution

that would apply under the null hypothesis.

For the data in Table 18.1, where the sample size is 20 and the expected

proportions are 0.1 left-coiled to 0.9 right-coiled, a randomization test

works by taking several thousand random samples, each of size 20, from a

hypothetical population containing these proportions. This will generate a

distribution of outcomes similar to the one shown in Figure 18.1, which is

for 10 000 samples. If the procedure is repeated another 10 000 times, then

the outcome is unlikely to be exactly the same, but nevertheless will be very

similar to Figure 18.1 because so many samples have been taken. It is clear

from Figure 18.1 that the likelihood of a sample containing four or more

forams with tails coiling to the left is greater than 0.05.



18.3



Comparing proportions among two or more

independent samples



Earth scientists often need to compare the proportions in categories among

two or more samples to test the null hypothesis that these have come from

the same population. Unlike the previous example, there are no expected



18.3 Two or more independent samples



235



Table 18.2 Data for 20 water samples taken at each of three

locations to characterize the presence or absence of nitrate

contamination.



Contaminated

Uncontaminated



Townsville



Bowen



Mackay



12

8



7

13



14

6



proportions – instead these tests examine whether the proportions in each

category are heterogeneous among samples.



18.3.1 The chi-square test for heterogeneity

Here is an example for three samples, each containing two mutually exclusive categories. Hydrologists managing water aquifers are often concerned

about contamination from agricultural fertilizers containing nitrate (NO3−),

which is a very soluble form of nitrogen that can be absorbed by plant roots.

Unfortunately nitrate can leach into groundwater and make it unsafe for

drinking. A hydrologist hired to evaluate aquifers in three adjacent rural

areas sampled 20 wells in each for the presence/absence of detectable levels

of nitrate. The researcher did not have a preconceived hypothesis about the

expected proportions of contaminated and uncontaminated aquifers – they

simply wanted to compare the three locations. The data are shown in

Table 18.2. This format is often called a contingency table.

These data are used to calculate an expected frequency for each of the

six cells. This is done by first calculating the row and column totals

(Table 18.3(a)) which are often called the marginal totals. The proportions

of contaminated and uncontaminated aquifers in the marginal totals

shown in the right-hand column of Figure 18.3 are the overall proportions

within the sample. Therefore, under the null hypothesis of no difference in

nitrate among locations, each will have the same proportion of contaminated wells. To obtain the expected frequency for any cell under the null

hypothesis, the column total and the row total corresponding to that cell

are multiplied together and divided by the grand total. For example, in

Table 18.3(b) the expected frequency of contaminated wells in a sample

of 20 from Townsville is (20 ì 33) ữ 60 = 11 and the expected frequency

of uncontaminated wells from Mackay is (20 × 27) ÷ 60 = 9.



236



Non-parametric tests for nominal scale data



Table 18.3 (a) The marginal totals for the data in Table 18.2. To obtain the expected

frequency for any cell, its row and column total are multiplied together and divided

by the grand total. (b) Note that the expected frequencies at each location (11:9)

are the same and also correspond to the proportions of the marginal totals (33:27).

(a) Observed frequencies and marginal totals.



Contaminated

Uncontaminated

Column totals



Townsville



Bowen



Mackay



Row totals



12

8

20



7

13

20



14

6

20



33

27

Grand total = 60



(b) Expected frequencies calculated from the marginal totals.



Contaminated

Uncontaminated

Column totals



Townsville



Bowen



Mackay



Row totals



11

9

20



11

9

20



11

9

20



33

27

Grand total = 60



After the expected frequencies have been calculated for all cells,

Equation (18.1) is used to calculate the chi-square statistic. The number

of degrees of freedom for this analysis is one less than the number of

columns, multiplied by one less than the number of rows, because all

but one of the values within each column and each row are free to vary,

but the final one is not because of the fixed marginal total. Here, therefore,

the number of degrees of freedom is 2 × 1 = 2. The smallest contingency

table possible has two rows and two columns (this is called a 2 × 2 table),

which will give a chi-square statistic with only one degree of freedom.



18.3.2 The G test or log-likelihood ratio

The G test or log-likelihood ratio is another way of estimating the chisquare statistic. The formula for the G statistic is:

!

n

X

fi

G¼2

fi ln

(18:3)

^f

i¼1

i



18.4 Bias when there is one degree of freedom



237



This means, “The G statistic is twice the sum of the frequency of each cell

multiplied by the natural logarithm of each observed frequency divided by

the expected frequency.” The formula will give a statistic of zero when each

expected frequency is equal to its observed frequency, but any discrepancy

will give a positive value of G. Some statisticians recommend the G test and

others recommend the chi-square test. There is a summary of tests recommended for categorical data near the end of this chapter.



18.3.3 Randomization tests for contingency tables

A randomization test procedure similar to the one discussed in

Section 18.2.1 for goodness-of-fit tests can be used for any contingency

table. First, the marginal totals of the table are calculated and give the

expected proportions when there is no difference among samples. Then,

the Monte Carlo method is used to repeatedly “sample” a hypothetical

population containing these proportions, with the constraint that both

the column and row totals are fixed. Randomization tests are available in

some statistical packages.



18.4



Bias when there is one degree of freedom



When there is only one degree of freedom and the total sample size is less

than 200, the calculated value of chi-square has been shown to be inaccurate

because it is too large. Consequently it gives a probability that is smaller

than appropriate, thus increasing the risk of Type 1 error. This bias

increases as sample size decreases, so the following formula, called Yates’

correction or the continuity correction, was designed to improve the

accuracy of the chi-square statistic for small samples with one degree of

freedom.

Yates’ correction removes 0.5 from the absolute difference between

each observed and expected frequency. (The absolute difference is used

because it converts all differences to positive numbers, which will be

reduced by subtracting 0.5. Otherwise, any negative values of oi – ei would

have to be increased by 0.5 to make their absolute size and the square of

that smaller.) The absolute value is the positive of any number and is

indicated by enclosing the number or its symbol by two vertical bars



238



Non-parametric tests for nominal scale data



(e.g. j À 6j ¼ 6). The subscript “adj” after the value of chi-square means it

has been adjusted by Yates correction.

2adj ẳ



n

X

joi ei j 0:5ị2

iẳ1



ei



(18:4)



From Equation (18.4) it is clear that the compensatory effect of Yates’

correction will become less and less as sample size increases. Some authors

(e.g. Zar, 1996) recommend that Yates’ correction is applied to all chisquare tests having only one degree of freedom, but others suggest it is

unnecessary for large samples and recommend the use of the Fisher Exact

Test (see Section 18.4.1 below) for smaller ones.



18.4.1 The Fisher Exact Test for 2 × 2 tables

The Fisher Exact Test accurately calculates the probability that two samples,

each containing two categories, are from the same population. This test is

not subject to bias and is recommended when sample sizes are small or

more than 20% of expected frequencies are less than five, but it can be used

for any 2 × 2 contingency table.

The Fisher Exact Test is unusual in that it does not calculate a statistic

that is used to estimate the probability of a departure from the null hypothesis. Instead, the probability is calculated directly.

The easiest way to explain the Fisher Exact Test is with an example.

Table 18.4 gives data for the presence or absence of mollusc species with

anti-predator adaptations on either side of the Cretaceous/Tertiary (K/T)

extinction boundary. A typical adaptation might include development of a

thicker, stronger shell, or perhaps a decrease in the size of the aperture

(opening) of the shell to discourage shell-peeling by predatory crabs

(e.g. Vermeij, 1978). However, during an environmental event causing

mass extinction, such adaptations might require more food or reduce

mobility, either of which may diminish the species’ ability to survive. To

test this hypothesis, a paleontologist examined ten outcrops, five below the

K/T boundary and five above it. The results for the presence or lack of

detection of thick-shelled molluscs are in Table 18.4. These frequencies are

too small for accurate analysis using a chi-square test.



18.4 Bias when there is one degree of freedom



239



Table 18.4 Data for the presence/absence of mollusc species with thick shells in ten

samples above and below the mass extinction boundary between the Cretaceous

and Tertiary periods. The sample deliberately included five samples above the

boundary layer and five below it. The marginal totals show that four samples contain

species with thick shells and six do not.



Thick-shelled molluscs present

Thick-shelled molluscs not found

Totals



Above K/T boundary



Below K/T boundary



0

5

5



4

1

5



4

6

10



Table 18.5 Under the null hypothesis that there is no effect of mass extinction on

the presence of molluscs with thick shells, the expected proportions of rocks with

and without thick-shelled molluscs in each sample (2:3 and 2:3) will correspond to

the marginal totals for the two rows (4:6). The proportions of samples from above

and below the K/T boundary (2:2) and (3:3) will also correspond to the marginal totals

for the two columns (5:5).



Thick-shelled molluscs present

Thick-shelled molluscs not found

Totals



Above K/T boundary



Below K/T boundary



2

3

5



2

3

5



4

6

10



If there were no effect of mass extinction, then you would expect, under

the null hypothesis, that the proportion of samples containing molluscs

with thicker shells (representing anti-predatory adaptations) in each locality

(above and below the K/T boundary) would be the same as the marginal

totals (Table 18.5) with any departures being due to chance. The Fisher

Exact Test uses the following procedure to calculate the probability of an

outcome equal to or more extreme than the one observed, which can be used

to decide whether it is statistically significant.

First, the four marginal totals are calculated, as shown in Table 18.5.

Second, all of the possible ways in which the data can be arranged within

the four cells of the 2 × 2 table are listed, subject to the constraint that the

marginal totals must remain unchanged. This is the total set of possible

outcomes for the sample. For these marginal totals, the most likely outcome under the null hypothesis of no difference between the samples is

shown in Table 18.5 and identified as (c) in Table 18.6.



Thick-shelled

molluscs

present

Thick-shelled

molluscs

not found



5



1



(a)



0



4



Above K/T Below K/T

boundary boundary



(b)



2



3



4



1



Above K/T Below K/T

boundary boundary



3



2



(c) Expected

under the null

hypothesis



3



2



Above K/T Below K/T

boundary boundary



(d)



4



1



2



3



Above K/T Below K/T

boundary boundary



1



4



(e) Observed

outcome



5



0



Above K/T Below K/T

boundary boundary



Table 18.6 The total set of possible outcomes for the number of outcrops with and without thick-shelled molluscs, subject to the constraint that there are

five outcrops on each side of K/T mass extinction and four have thick-shelled molluscs while six lack them. The most likely outcome, where the proportions

are the same both above and below the K/T boundary, is shown in the central box (c). The actual outcome is case (e).



18.4 Bias when there is one degree of freedom



241



For a sample of ten outcrops, five of which are above the K/T boundary

and five below, together with the constraint that four outcrops must have

thick-shelled molluscs and six must lack them, there are five possible outcomes (Table 18.6). To obtain these, you start with the outcome expected

under the null hypothesis (c), choose one of the four cells (it does not matter

which) and add one to that cell. Next, adjust the values in the other three

cells so the marginal totals do not change. Continue with this procedure

until the number within the cell you have chosen cannot be increased any

further without affecting the marginal totals. Then go back to the expected

outcome and repeat the procedure by subtracting one from the same cell

until the number in it cannot decrease any further without affecting the

marginal totals (Table 18.6).

Third, the actual outcome is identified within the total set of possible

outcomes. For this example, it is case (e) in Table 18.6. The probability of

this outcome, together with any more extreme departures in the same

direction from the one expected under the null hypothesis (here there are

none more extreme than (e)) can be calculated from the probability of

getting this particular arrangement within the four cells by sampling a set

of ten outcrops, four of which contain thick-shelled molluscs and six of

which do not, with the outcrops sampled from above and below the K/T

boundary. This is similar to the example used to introduce hypothesis

testing in Chapter 6, where you had to imagine a sample of hornblende

vs. quartz grains in a beach sand. Here, however, a very small group is

sampled without replacement, so the initial probability of selecting an outcrop with thick-shelled molluscs present is 4/10, but if one is drawn, the

probability of next drawing an outcrop with thick-shelled molluscs is now

3/9 (and 6/9 without). We deliberately have not given this calculation

because it is long and tedious, and most statistical packages do it as part

of the Fisher Exact Test.

The calculation gives the exact probability of getting the observed outcome or a more extreme departure in the same direction from that expected

under the null hypothesis. This is a one-tailed probability, because the

outcomes in the opposite direction (e.g. on the left of (c) in Table 18.6)

have been ignored. For a two-tailed hypothesis you need to double the

probability. When the probability is less than 0.05, the outcome is considered statistically significant.



Tài liệu bạn tìm kiếm đã sẵn sàng tải về

2 Comparing observed and expected frequencies: the chi-square test for goodness of fit

Tải bản đầy đủ ngay(0 tr)

×