Tải bản đầy đủ - 0 (trang)
2: Tests for Homogeneity and Independence in a Two-way Table

2: Tests for Homogeneity and Independence in a Two-way Table

Tải bản đầy đủ - 0trang

586



Chapter 12 The Analysis of Categorical Data and Goodness-of-Fit Tests



Bivariate categorical data of this sort can most easily be summarized by constructing a two-way frequency table, or contingency table. This is a rectangular table that

consists of a row for each possible value of x (each category specified by this variable)

and a column for each possible value of y. There is then a cell in the table for each

possible (x, y) combination. Once such a table has been constructed, the number of

times each particular (x, y) combination occurs in the data set is determined, and

these numbers (frequencies) are entered in the corresponding cells of the table. The

resulting numbers are called observed cell counts. The table for the example relating

political philosophy to preferred network contains 3 rows and 4 columns (because x

and y have 3 and 4 possible values, respectively). Table 12.2 is one possible table.



T A B L E 12 .2 An Example of a 3 ϫ 4 Frequency Table



Liberal

Moderate

Conservative

Column Marginal Total



ABC



CBS



NBC



PBS



Row

Marginal

Total



20

45

15

80



20

35

40

95



25

50

10

85



15

20

5

40



80

150

70

300



Marginal totals are obtained by adding the observed cell counts in each row and

also in each column of the table. The row and column marginal totals, along with the

total of all observed cell counts in the table—the grand total—have been included in

Table 12.2. The marginal totals provide information on the distribution of observed

values for each variable separately. In this example, the row marginal totals reveal that

the sample consisted of 80 liberals, 150 moderates, and 70 conservatives. Similarly,

column marginal totals indicate how often each of the preferred program categories

occurred: 80 preferred ABC news, 95 preferred CBS, and so on. The grand total,

300, is the number of observations in the bivariate data set.

Two-way frequency tables are often characterized by the number of rows and

columns in the table (specified in that order: rows first, then columns). Table 12.2 is

called a 3 ϫ 4 table. The smallest two-way frequency table is a 2 ϫ 2 table, which

has only two rows and two columns, resulting in four cells.

Two-way tables arise naturally in two different types of investigations. A researcher may be interested in comparing two or more populations or treatments on

the basis of a single categorical variable and so may obtain independent samples from

each population or treatment. For example, data could be collected at a university to

compare students, faculty, and staff on the basis of primary mode of transportation

to campus (car, bicycle, motorcycle, bus, or by foot). One random sample of 200

students, another of 100 faculty members, and a third of 150 staff members might be

chosen, and the selected individuals could be interviewed to obtain the necessary

transportation information. Data from such a study could be summarized in a 3 ϫ 5

two-way frequency table with row categories of student, faculty, and staff and column

categories corresponding to the five possible modes of transportation. The observed

cell counts could then be used to gain insight into differences and similarities among

the three groups with respect to mode of transportation. This type of bivariate categorical data set is characterized by having one set of marginal totals predetermined

(the sample sizes for the different groups). In the 3 ϫ 5 situation just discussed, the

row totals would be fixed at 200, 100, and 150.

A two-way table also arises when the values of two different categorical variables are

observed for all individuals or items in a single sample. For example, a sample of 500

Copyright 2010 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s).

Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it.



12.2 Tests for Homogeneity and Independence in a Two-way Table



587



registered voters might be selected. Each voter could then be asked both if he or she favored a particular property tax initiative and if he or she was a registered Democrat, Republican, or Independent. This would result in a bivariate data set with x representing the

variable political affiliation (with categories Democrat, Republican, and Independent) and

y representing the variable response (favors initiative or opposes initiative). The corresponding 3 ϫ 2 frequency table could then be used to investigate any association between

position on the tax initiative and political affiliation. This type of bivariate categorical data

set is characterized by having only the grand total predetermined (by the sample size).



Comparing Two or More Populations

or Treatments: A Test of Homogeneity

When the value of a categorical variable is recorded for members of independent random

samples obtained from each population or treatment under study, the question of interest is whether the category proportions are the same for all the populations or treatments.

As in Section 12.1, the test procedure uses a chi-square statistic that compares the observed counts to those that would be expected if there were no differences.



E X A M P L E 1 2 . 4 Risky Soccer?

The paper “No Evidence of Impaired Neurocognitive Performance in Collegiate

Soccer Players” (American Journal of Sports Medicine [2002]:157–162) compared

collegiate soccer players, athletes in sports other than soccer, and a group of students

who were not involved in collegiate sports with respect to history of head injuries.

Table 12.3, a 3 ϫ 4 two-way frequency table, is the result of classifying each student

in independently selected random samples of 91 soccer players, 96 non-soccer athletes, and 53 non-athletes according to the number of previous concussions the student reported on a medical history questionnaire.



TA BLE 1 2 . 3 Observed Counts for Example 12.4

Number of Concussions



Mike Powell/Allsport Concepts/

Getty Images



Soccer Players

Non-Soccer Athletes

Non-Athletes

Column Marginal Total



Data set available online



0

Concussions



1

Concussion



2

Concussions



3 or More

Concussions



Row Marginal

Total



45

68

45

158



25

15

5

45



11

8

3

22



10

5

0

15



91

96

53

240



Estimates of expected cell counts can be thought of in the following manner:

There were 240 responses on number of concussions, of which 158 were “0 concussions.” The proportion of the total responding “0 concussions” is then

158

5 .658

240

If there were no difference in response for the different groups, we would then

expect about 65.8% of the soccer players to have responded “0 concussions,” 65.8% of

the non-soccer athletes to have responded “0 concussions,” and so on. Therefore the

estimated expected cell counts for the three cells in the “0 concussions” column are

Expected count for soccer player and 0 concussions cell 5 .658 1912 5 59.9

Expected count for non–soccer athlete and 0 concussions cell 5 .658 1962 5 63.2

Expected count for non–athlete and 0 concussions cell 5 .658 1532 5 34.9



Copyright 2010 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s).

Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it.



588



Chapter 12 The Analysis of Categorical Data and Goodness-of-Fit Tests



Note that the expected cell counts need not be whole numbers. The expected cell

counts for the remaining cells can be computed in a similar manner. For example,

45

5 .188

240

of all responses were in the “1 concussion” category, so

Expected count for soccer player and 1 concussion cell 5 .188 1912 5 17.1

Expected count for non–soccer athlete and 1 concussion cell 5 .188 1962 5 18.0

Expected count for non–athlete and 1 concussion cell 5 .188 1532 5 10.0

It is common practice to display the observed cell counts and the corresponding

expected cell counts in the same table, with the expected cell counts enclosed in parentheses. Expected cell counts for the remaining cells have been computed and entered into

Table 12.4. Except for small differences resulting from rounding, each marginal total for

the expected cell counts is identical to that of the corresponding observed counts.



TABLE 1 2 . 4 Observed and Expected Counts for Example 12.4

Number of Concussions



Soccer Players

Non-Soccer Athletes

Non-Athletes

Column Marginal Total



0

Concussions



1

Concussion



2

Concussions



3 or More

Concussions



Row Marginal

Total



45 (59.9)

68 (63.2)

45 (34.9)

158



25 (17.1)

15 (18.0)

5 (10.0)

45



11 (8.3)

8 (8.8)

3 (4.9)

22



10 (5.7)

5 (6.0)

0 (3.3)

15



91

96

53

240



A quick comparison of the observed and expected cell counts in Table 12.4 reveals some large discrepancies, suggesting that the proportions falling into the concussion categories may not be the same for all three groups. This will be explored further

in Example 12.5.

In Example 12.4, the expected count for a cell corresponding to a particular

group–response combination was computed in two steps. First, the response marginal

proportion was computed (e.g., 158/240 for the “0 concussions” response). Then this

proportion was multiplied by a marginal group total (for example, 91(158/240) for

the soccer player group). Algebraically, this is equivalent to first multiplying the row

and column marginal totals and then dividing by the grand total:

1912 11582

240



To compare two or more populations or treatments on the basis of a categorical variable, calculate an expected cell count for each cell by selecting the corresponding row

and column marginal totals and then computing

1row marginal total2 1column marginal total2

expected cell count 5

grand total

These quantities represent what would be expected when there is no difference between the groups under study.



Copyright 2010 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s).

Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it.



12.2 Tests for Homogeneity and Independence in a Two-way Table



589



The X 2 statistic, introduced in Section 12.1, can now be used to compare the

observed cell counts to the expected cell counts. A large value of X 2 results when there

are large discrepancies between the observed and expected counts and suggests that

the hypothesis of no differences between the populations should be rejected. A formal

test procedure is described in the accompanying box.



X 2 Test for Homogeneity

Null hypothesis:



H0: The true category proportions are the same for all the populations or treatments

(homogeneity of populations or treatments).

Ha: The true category proportions are not all the same for all of the populations

or treatments.



Alternative hypothesis:



Test statistic:



1observed cell count 2 expected cell count2 2

expected cell count

all cells



X2 5 a



The expected cell counts are estimated from the sample data (assuming that H0 is true) using the formula

expected cell count 5



1row marginal total2 1column marginal total2

grand total



P-values: When H0 is true and the assumptions of the X 2 test are satisfied, X 2 has approximately a chisquare distribution with df 5 (number of rows 2 1)(number of columns 2 1). The P-value associated with

the computed test statistic value is the area to the right of X 2 under the chi-square curve with the

appropriate df. Upper-tail areas for chi-square distributions are found in Appendix Table 8.

Assumptions:



1. The data are from independently chosen random samples or from subjects who were assigned

at random to treatment groups.

2. The sample size is large: all expected counts are at least 5. If some expected counts are

less than 5, rows or columns of the table may be combined to achieve a table with

satisfactory expected counts.



EXAMPLE 12.5



Risky Soccer Revisited



The following table of observed and expected cell counts appeared in Example 12.4:

Number of Concussions



Soccer Players

Non-Soccer Athletes

Non-Athletes

Column Marginal Total



0

Concussions



1

Concussion



2

Concussions



3 or More

Concussions



Row Marginal

Total



45 (59.9)

68 (63.2)

45 (34.9)

158



25 (17.1)

15 (18.0)

5 (10.0)

45



11 (8.3)

8 (8.8)

3 (4.9)

22



10 (5.7)

5 (6.0)

0 (3.3)

15



91

96

53

240



Hypotheses: H0: Proportions in each response (number of concussions) category

are the same for all three groups

Ha: The category proportions are not all the same for all three groups.

Copyright 2010 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s).

Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it.



590



Chapter 12



The Analysis of Categorical Data and Goodness-of-Fit Tests



Significance level: A significance level of a ϭ .05 will be used.

1observed cell count 2 expected cell count2 2

expected cell count

all cells



Test statistic: X 2 5 a



Assumptions: The random samples were independently chosen, so use of the test is

appropriate if the sample size is large enough. One of the expected cell counts (in the

3 or more concussions column) is less than 5, so we will combine the last two columns of the table prior to carrying out the chi-square test. The table we will work

with is then



NUMBER OF CONCUSSIONS



Soccer Players

Non-Soccer Athletes

Non-Athletes

Column Marginal Total



0

Concussions



1

Concussion



2 or More

Concussions



Row Marginal

Total



45 (59.9)

68 (63.2)

45 (34.9)

158



25 (17.1)

15 (18.0)

5 (10.0)

45



21 (14.0)

13 (14.8)

3 (8.2)

22



91

96

53

240



Calculation:

X2 5



145 2 59.92 2

13 2 8.22 2

1%1

5 20.6

59.9

8.2



P-value: The two-way table for this example has 3 rows and 3 columns, so the appropriate df is (3 Ϫ 1)(3 Ϫ 1) ϭ 4. Since 20.6 is greater than 18.46, the largest entry

in the 4-df column of Appendix Table 8,

P-value Ͻ .001

Conclusion: P-value # a, so H0 is rejected. There is strong evidence to support the

claim that the proportions in the number of concussions categories are not the same

for the three groups compared. The largest differences between the observed frequencies and those that would be expected if there were no group differences occur in

the response categories for soccer players and for non-athletes, with soccer players

having higher than expected proportions in the 1 and 2 or more concussion categories

and non-athletes having a higher than expected proportion in the 0 concussion

category.



Most statistical computer packages can calculate expected cell counts, the value

of the X 2 statistic, and the associated P-value. This is illustrated in the following

example.



EXAMPLE 12.6



Keeping the Weight Off



The article “Daily Weigh-ins Can Help You Keep Off Lost Pounds, Experts Say”

(Associated Press, October 17, 2005) describes an experiment in which 291 people

Data set available online



who had lost at least 10% of their body weight in a medical weight loss program were

assigned at random to one of three groups for follow-up. One group met monthly in



Copyright 2010 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s).

Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it.



12.2 Tests for Homogeneity and Independence in a Two-way Table



591



person, one group “met” online monthly in a chat room, and one group received a

monthly newsletter by mail. After 18 months, participants in each group were classified according to whether or not they had regained more than 5 pounds, resulting in

the data given in Table 12.5.



T A B L E 12 .5



Observed and Expected Counts for Example 12.6

AMOUNT OF WEIGHT GAINED



In-Person

Online

Newsletter



Regained

5 Lb or Less



Regained More

Than 5 Lb



Row Marginal

Total



52 (41.0)

44 (41.0)

27 (41.0)



45 (56.0)

53 (56.0)

70 (56.0)



97

97

97



Does there appear to be a difference in the weight regained proportions for the

three follow-up methods? The relevant hypotheses are

H0: The proportions for the two weight-regained categories are the same for the

three follow-up methods.

Ha: The weight-regained category proportions are not the same for all three

follow-up methods.

Significance level: a ϭ .01

1observed cell count 2 expected cell count2 2

expected cell count

all cells



Test statistic: X 2 5 a



Assumptions: Table 12.5 contains the computed expected counts, all of which are

greater than 5. The subjects in this experiment were assigned at random to the treatment groups.

Calculation: Minitab output follows. For each cell, the Minitab output

includes the observed cell count, the expected cell count, and the value of

1observed cell count 2 expected cell count2 2

for that cell (this is the contribution to

expected cell count

the X 2 statistic for this cell). From the output, X 2 ϭ 13.773.

Chi-Square Test

Expected counts are printed below observed counts

Chi-Square contributions are printed below expected counts

<=5

>5

Total

52

45

97

41.00

56.00

2.951

2.161

Online

44

53

97

41.00

56.00

0.220

0.161

Newsletter

27

70

97

41.00

56.00

4.780

3.500

Total

123

168

291

Chi-Sq = 13.773, DF = 2, P-Value = 0.001

In-person



Copyright 2010 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s).

Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it.



592



Chapter 12 The Analysis of Categorical Data and Goodness-of-Fit Tests



P-value: From the Minitab output, P-value ϭ .001.

Conclusion: Since P-value Յ a, H0 is rejected. The data indicate that the proportions who have regained more than five pounds are not the same for the three

follow-up methods. Comparing the observed and expected cell counts, we can

see that the observed number in the newsletter group who had regained more

than 5 pounds was higher than would have been expected and the observed number in the in-person group who had regained 5 or more pounds was lower than

would have been expected if there were no difference in the three follow-up

methods.



Testing for Independence of Two

Categorical Variables

The X 2 test statistic and test procedure can also be used to investigate association

between two categorical variables in a single population. As an example, television

viewers in a particular city might be categorized with respect to both preferred network (ABC, CBS, NBC, or PBS) and favorite type of programming (comedy, drama,

or information and news). The question of interest is often whether knowledge of one

variable’s value provides any information about the value of the other variable—that

is, are the two variables independent?

Continuing the example, suppose that those who favor ABC prefer the three

types of programming in proportions .4, .5, and .1 and that these proportions are also

correct for individuals favoring any of the other three networks. Then, learning an

individual’s preferred network provides no added information about that individual’s

favorite type of programming. The categorical variables preferred network and favorite

program type would be independent.

To see how expected counts are obtained in this situation, recall from Chapter 6

that if two outcomes A and B are independent, then

P(A and B) ϭ P(A)P(B)

so the proportion of time that the two outcomes occur together in the long run is the

product of the two individual long-run relative frequencies. Similarly, two categorical

variables are independent in a population if, for each particular category of the first

variable and each particular category of the second variable,

proportion in

proportion of individuals

proportion in

#

° in a particular category ¢ 5 ° specified category ¢ ° specified category ¢

of second variable

combination

of first variable

Thus, if 30% of all viewers prefer ABC and the proportions of program type preferences are as previously given, then, assuming that the two variables are independent,

the proportion of individuals who both favor ABC and prefer comedy is (.3)(.4) ϭ

.12 (or 12%).

Multiplying the right-hand side of the expression above by the sample size gives

us the expected number of individuals in the sample who are in both specified categories of the two variables when the variables are independent. However, these expected

counts cannot be calculated, because the individual population proportions are not



Copyright 2010 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s).

Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it.



12.2 Tests for Homogeneity and Independence in a Two-way Table



593



known. The solution is to estimate each population proportion using the corresponding sample proportion:

observed number

observed number

° in category of ¢ ° in category of ¢

estimated expected number

first variable

# second variable

° in specified categories ¢ 5 1sample size2 #

sample size

sample size

of the two varibales

observed number in

observed number in

b#a

b

category of first variable

category of second variable

5

sample size

a



Suppose that the observed counts are displayed in a rectangular table in which

rows correspond to the categories of the first variable and columns to the categories

of the second variable. Then, the numerator in the preceding expression for expected

counts is just the product of the row and column marginal totals. This is exactly how

expected counts were computed in the test for homogeneity of several populations,

even though the reasoning used to arrive at the formula is different.



X 2 Test for Independence

H0: The two variables are independent.



Null hypothesis:



Alternative hypothesis:

Test statistic:



Ha: The two variables are not independent.



1observed cell count 2 expected cell count2 2

X 5 a

expected cell count

all cells

2



The expected cell counts are estimated (assuming H0 is true) by the formula

expected cell count 5



1row marginal total2 1column marginal total2

grand total



P-values: When H0 is true and the assumptions of the X 2 test are satisfied, X 2 has approximately a chisquare distribution with

df ϭ (number of rows Ϫ 1)(number of columns Ϫ 1)

The P-value associated with the computed test statistic value is the area to the right of X 2 under the chi-square

curve with the appropriate df. Upper-tail areas for chi-square distributions are found in Appendix Table 8.

Assumptions:



1. The observed counts are based on data from a random sample.

2. The sample size is large: All expected counts are at least 5. If some expected counts are

less than 5, rows or columns of the table should be combined to achieve a table with

satisfactory expected counts.



EXAMPLE 12.7



A Pained Expression



The paper “Facial Expression of Pain in Elderly Adults with Dementia” (Journal

of Undergraduate Research [2006]) examined the relationship between a nurse’s

Step-by-Step technology

instructions available online



assessment of a patient’s facial expression and his or her self-reported level of pain.

Data for 89 patients are summarized in Table 12.6.



Data set available online

Copyright 2010 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s).

Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it.



594



Chapter 12 The Analysis of Categorical Data and Goodness-of-Fit Tests



T AB LE 12 .6 Observed Counts

for Example 12.7

SELF-REPORT



Facial Expression

No Pain

Pain



No Pain



Pain



17

3



40

29



The authors were interested in determining if there is evidence of a relationship

between a facial expression that reflects pain and self-reported pain because patients

with dementia do not always give a verbal indication that they are in pain.

Using a .05 significance level, we will test

H0: Facial expression and self-reported pain are independent.

Ha: Facial expression and self-reported pain are not independent.

Significance level: a ϭ .05

1observed cell count 2 expected cell count2 2

Test statistic: X 5 a

expected cell count

all cells

2



Assumptions: Before we can check the assumptions we must first compute the expected cell counts.



CELL



Row



Column



1



1



1



2



2



1



2



2



Expected Cell Count

1572 1202

5 12.81

89

1572 1692

5 44.19

89

1322 1202

5 7.19

89

1322 1692

5 24.81

89



All expected cell counts are greater than 5. Although the participants in the study

were not randomly selected, they were thought to be representative of the population

of nursing home patients with dementia. The observed and expected counts are given

together in Table 12.7.



T AB LE 12 .7 Observed and Expected Counts

for Example 12.7

SELF-REPORT



Facial Expression

No Pain

Pain



No Pain



Pain



17 (12.81)

3 (7.19)



40 (44.19)

29 (24.81)



Copyright 2010 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s).

Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it.



12.2 Tests for Homogeneity and Independence in a Two-way Table



Calculation: X 2 5



595



117 2 12.812 2

129 2 24.812 2

1%1

5 4.92

12.81

24.81



P-value: The table has 2 rows and 2 columns, so df ϭ (2 Ϫ 1)(2 Ϫ 1) ϭ 1. The entry

closest to 4.92 in the 1-df column of Appendix Table 8 is 5.02, so the approximate

P-value for this test is

P-value Ϸ .025

Conclusion: Since P-value Յ a, we reject H0 and conclude that there is convincing

evidence of an association between a nurse’s assessment of facial expression and selfreported pain.



EXAMPLE 12.8



Stroke Mortality and Education



Table 12.8 was constructed using data from the article “Influence of Socioeconomic Status on Mortality After Stroke” (Stroke [2005]: 310–314). One of the

questions of interest to the author was whether there was an association between

survival after a stroke and level of education. Medical records for a sample of 2333

residents of Vienna, Austria, who had suffered a stroke were used to classify each individual according to two variables—survival (survived, died) and level of education

(no basic education, secondary school graduation, technical training/apprenticed,

higher secondary school degree, university graduate). Expected cell counts (computed

under the assumption of no association between survival and level of education) appear in parentheses in the table.



TA BLE 1 2 . 8 Observed and Expected Counts for Example 12.8



Died

Survived



No Basic

Education



Secondary

School

Graduation



Technical

Training/

Apprenticed



Higher

Secondary

School Degree



University

Graduate



13 (17.40)

97 (92.60)



91 (77.18)

397 (410.82)



196 (182.68)

959 (972.32)



33 (41.91)

232 (223.09)



36 (49.82)

279 (265.18)



The X 2 test with a significance level of .01 will be used to test the relevant

hypotheses:

H0: Survival and level of education are independent.

Ha: Survival and level of education are not independent.

Significance level: a ϭ .01

1observed cell count 2 expected cell count2 2

expected cell count

all cells



Test statistic: X 2 5 a



Assumptions: All expected cell counts are at least 5. Assuming that the data can be

viewed as representative of adults who suffer strokes, the X 2 test can be used.

Data set available online



Copyright 2010 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s).

Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it.



596



Chapter 12



The Analysis of Categorical Data and Goodness-of-Fit Tests



Calculation: Minitab output is shown. From the Minitab output, X 2 ϭ 12.219.

Chi-Square Test

Expected counts are printed below observed counts

Chi-Square contributions are printed below expected counts

1

2

3

13

91

196

17.40

77.18

182.68

1.112

2.473

0.971

2

97

397

959

92.60

410.82

972.32

0.209

0.465

0.182

Total

110

488

1155

Chi-Sq = 12.219, DF = 4, P-Value = 0.016

1



4

33

41.91

1.896

232

223.09

0.356

265



5

36

49.82

3.835

279

265.18

0.720

315



Total

369



1964



2333



P-value: From the Minitab output, P-value ϭ .016.

Conclusion: Since P-value Ͼ a, H0 is not rejected. There is not sufficient evidence

to conclude that an association exists between level of education and survival.



In some investigations, values of more than two categorical variables are recorded

for each individual in the sample. For example, in addition to the variable survival

and level of education, the researchers in the study referenced in Example 12.8 also

collected information on occupation. A number of interesting questions could then

be explored: Are all three variables independent of one another? Is it possible that

occupation and survival are dependent but that the relationship between them does

not depend on level of education? For a particular education level group, is there an

association between survival and occupation? The X 2 test procedure described in this

section for analysis of bivariate categorical data can be extended for use with multivariate categorical data. Appropriate hypothesis tests can then be used to provide insight into the relationships between variables. However, the computations required

to calculate expected cell counts and to compute the value of X 2 are quite tedious, so

they are seldom done without the aid of a computer. Most statistical computer packages can perform this type of analysis. Consult the references by Agresti and Findlay,

Everitt, or Mosteller and Rourke listed in the back of the book for further information on the analysis of categorical data.



EX E RC I S E S 1 2 . 1 4 - 1 2 . 3 1

12.14 A particular state university system has six campuses. On each campus, a random sample of students

will be selected, and each student will be categorized

with respect to political philosophy as liberal, moderate,

or conservative. The null hypothesis of interest is that the

proportion of students falling in these three categories is

the same at all six campuses.

a. On how many degrees of freedom will the resulting

X 2 test be based?

b. How does your answer in Part (a) change if there are

seven campuses rather than six?

Bold exercises answered in back



Data set available online



c. How does your answer in Part (a) change if there are

four rather than three categories for political philosophy?



12.15 A random sample of 1000 registered voters in a

certain county is selected, and each voter is categorized

with respect to both educational level (four categories)

and preferred candidate in an upcoming election for

county supervisor (five possibilities). The hypothesis of

interest is that educational level and preferred candidate

are independent.

Video Solution available



Copyright 2010 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s).

Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it.



Tài liệu bạn tìm kiếm đã sẵn sàng tải về

2: Tests for Homogeneity and Independence in a Two-way Table

Tải bản đầy đủ ngay(0 tr)

×