Tải bản đầy đủ
Example 6.4: A Goodness- of- Fit Test for Blood Pressure Status Category Proportions

Example 6.4: A Goodness- of- Fit Test for Blood Pressure Status Category Proportions

Tải bản đầy đủ

160

Applied Survey Data Analysis

Example 6.5:╇ Pie Charts and Vertical Bar Charts of
the Estimated Blood Pressure Status Classification for
U.S. Adults from the 2005–2006 NHANES Data
Using Stata graphics, simple weighted pie charts (Figure€6.3) or weighted vertical
bar charts (Figure€6.4) produce an effective display of the estimated population
distribution across categories of a single categorical variable. The Stata command
syntax used to generate these two figures is as follows:
* Pie Chart (one long command).
graph pie bp_cat1 bp_cat2 bp_cat3 bp_cat4 ///
[pweight=wtmec2yr] if age18p==1, plabel(_all percent, ///
format(%9.1f)) scheme(s2mono) ///
legend (label /// (1 “Normal”) label /// (2 “Pre-Hypertensive”)
/// label ///(3 “Stage 1 Hypertensive”) label (4 “Stage 2 ///
Hypertensive”))
* Vertical Bar Chart (one long command).
graph bar (mean) bp_cat1 bp_cat2 bp_cat3 bp_cat4 ///
[pweight=wtmec2yr] if age18p==1, blabel(bar, ///
format(%9.1f) color(none)) ///
bar(1,color(gs12)) bar(2,color(gs4)) bar(3,color(gs8)) ///
bar(4,color(black)) ///
bargap(7) scheme(s2mono) over(riagendr) percentages ///
legend (label(1 “Normal”)label(2 “Pre-Hypertensive”) ///
label(3 “Stage 1 Hypertensive”) label (4 “Stage 2 ///
Hypertensive”)) ///
ytitle (“Percentage”)

6.4╇ Bivariate Analysis of Categorical Data
Bivariate analysis of categorical data may take a number of forms, ranging
from estimation of proportions for the joint distribution of the two variables
(total proportions) or estimation of “conditional” proportions based on levels
of the second categorical variable (i.e., row proportions or column proportions) to techniques for measuring and testing bivariate association between
two categorical variables. Bivariate categorical data analyses and reporting
are important in their own right, and they are also important as exploratory
tools in the development of more complex multivariate models (see the
regression model building steps in Section 8.3).
6.4.1╇Response and Factor Variables
Unlike simple linear regression for continuous survey variables, there is no
requirement to differentiate variables by type (dependent or independent) or
postulate a cause–effect relationship between two or more categorical variables that are being analyzed simultaneously. An example where assigning

© 2010 by Taylor and Francis Group, LLC

161

Categorical Data Analysis

2.4%

8.6%

47.1%
41.9%

Normal

Pre-Hypertensive

Stage 1 Hypertensive

Stage 2 Hypertensive

Figure€6.3
Pie chart of the estimated distribution of blood pressure status of U.S. adults.

50

47.1%

Percentage

40

41.9%

30
20
8.6%

10

2.4%
0
Normal
Stage 1 Hypertensive

Pre-Hypertensive
Stage 2 Hypertensive

Figure€6.4
Bar chart of the estimated distribution of the blood pressure status of U.S. adults. (Modified
from the 2005–2006 NHANES data.)

© 2010 by Taylor and Francis Group, LLC

162

Applied Survey Data Analysis

the variables to dependent or independent status is not necessary would be a
bivariate analysis of the population distribution by education levels and census
region. Nevertheless, in many categorical data analysis problems it is natural to
think of one categorical variable as the response variable and the others as factor variables that may be associated with the discrete outcome for the response
variable. This section will introduce examples centered around the analysis of
the joint distribution of U.S. adults’ experience with a lifetime episode of major
depression (yes, no) and their gender (male, female), based on the National
Comorbidity Survey Replication (NCS-R) data set. In these examples, it is convenient to label the NCS-R indicator variable for a lifetime major depressive
episode (MDE) as the response variable and the respondent’s gender (SEX) as
the factor. Assigning response and factor labels to the categorical variables also
sets up the transition to later discussion of regression modeling of categorical
data where dependent and independent variables are clearly specified.
6.4.2╇Estimation of Total, Row, and Column
Proportions for Two-Way Tables
Based on the weighted frequencies illustrated in Figure€6.2, estimates of population proportions can be computed as the ratio of the weighted sample frequency for the cell to the appropriate weighted total or marginal frequency
value. For example, Figure€6.5 illustrates estimation of the total proportions
of the population in each cell and margin of the table. Note that the numerator of each estimated proportion is the weighted total frequency for the cell,
such as Nˆ A1 , and the denominator is the weighted total population frequency, Nˆ ++ . Statistical software also enables the user to condition estimates
of population proportions on the sample in particular rows or columns of the
crosstabulation. Figure€6.6 illustrates the calculations of weighted estimates
of row proportions for the estimated population distribution in Figure€6.2.
The following example uses the NCS-R data on lifetime major depressive
episode (MDE) and gender (SEX) to illustrate the Stata syntax to estimate total
proportions, row proportions, standard errors, and confidence intervals.
Response
Factor

0

A

B

1

pA0 =

Nˆ A0


pA1 =

Nˆ A1


pA+ = pA0 + pA1

pB0 =

Nˆ B0


pB1 =

Nˆ B1


pB+ = pB0 + pB1

++

++

p+0 = pA0 + pB0

++

++

p+1 = pA1 + pB1

p++ = 1.0

Figure€6.5
Estimation of overall (total) population proportions (multinomial sampling model).

© 2010 by Taylor and Francis Group, LLC

163

Categorical Data Analysis

Response
Factor

0

1

A

p0|A =

Nˆ A0
Nˆ A+

p1|A =

Nˆ A1
Nˆ A+

pA+ = 1.0

B

p0|B =

Nˆ B0
Nˆ B+

p1|B =

Nˆ B1
Nˆ B+

pB+ = 1.0

Figure€6.6
Estimation of row population proportions (product multinomial sampling model).

Example 6.6╇Estimation of Total and Row Proportions
for the Cross-Tabulation of Gender and Lifetime Major
Depression Status Using the NCS-R Data
The first of two svy: tab commands in Stata requests as follows the default
estimates of the total proportions for the SEX × MDE crosstabulation along with
the corresponding standard errors, 95% CIs, and design effects. The row option in
the second svy: tab command specifies that the estimates, standard errors, CIs,
and design effects will be for the row proportions:
svyset seclustr [pweight = ncsrwtsh], strata(sestrat)
svy: tab sex mde, se ci deff
svy: tab sex mde, row se ci deff

The estimated proportions, standard errors, 95% CIs, and design effect output
from these two commands are summarized in Table€6.4.

6.4.3╇Estimating and Testing Differences in Subpopulation Proportions
Estimates of row proportions in two-way tables (e.g., pˆ1|B = 0.230 in Table€6.4)
are in fact subpopulation estimates in which the subpopulation is defined
by the levels of the factor variable. Analysts interested in testing differences
in response category proportions between two levels of a factor variable can
use methodology similar to that discussed in Section 5.6 for comparison of
subpopulation means.
Example 6.7:╇ Comparing the Proportions of U.S. Adult
Men and Women with Lifetime Major Depression
This example uses data from the NCS-R to test a null hypothesis that there is no
difference in the proportions of U.S. adult men and women with a lifetime diagnosis of a major depressive episode. To compare male and female row proportions for MDE = 1, the svy: prop command with the over() option is used
to estimate the vector of row proportions (see Table€6.4) and their design-based

© 2010 by Taylor and Francis Group, LLC

164

Applied Survey Data Analysis

Table€6.4
Estimated Proportions of U.S. Adults by Gender and Lifetime Major
Depression Status
Description

Parameter

Estimated
Proportion

Linearized
SE

95% CI

Design
Effect

Male, no MDE
Male, MDE
Female, no MDE
Female, MDE

πA0
πA1
πB0
πB0

Total Proportions
0.406
0.007
0.072
0.003
0.402
0.005
0.120
0.003

(0.393, 0.421)
(0.066, 0.080)
(0.391, 0.413)
(0.114, 0.126)

1.87
1.64
1.11
0.81

No MDE|Male
MDE|Male
No MDE|Female

π0|A
π1|A
π0|B
π1|B

Row Proportions
0.849
0.008
0.151
0.008
0.770
0.006
0.230
0.006

(0.833, 0.864)
(0.136, 0.167)
(0.759, 0.782)
(0.218, 0.241)

2.08
2.08
0.87
0.87

MDE|Female

Source: Analysis based on NCS-R data.

variance–covariance matrix. Internally, Stata labels the estimated row proportions
for MDE = 1 as _ prop _ 2. The lincom command is then executed to estimate
the contrast of the male and female proportions and the standard error of this difference. The relevant Stata commands are as follows:
svyset seclustr [pweight=ncsrwtsh], strata(sestrat) ///
vce(linearized) singleunit(missing)
svy: proportion mde, over(sex)
lincom [_prop_2]Male - [_prop_2]Female

The output of the lincom command provides the following estimate of the male–
female difference, its standard error, and a 95% CI for the contrast of proportions.
∆ˆ = pmale − p female

se( ∆ˆ )

CI.95 ( ∆ )

–0.079

0.010

(–0.098, –0.059)

Because the design-based 95% CI for the difference in proportions does not
include 0, the data suggest that the rate of lifetime major depressive episodes for
women is significantly higher than that for men.

6.4.4╇ Chi-Square Tests of Independence of Rows and Columns
For a 2 × 2 table, the contrast of estimated subpopulation proportions examined in Example 6.7 is equivalent to a test of whether the response variable
(MDE) is independent of the factor variable (SEX). More generally, under
SRS, two categorical variables are independent of each other if the following

© 2010 by Taylor and Francis Group, LLC