Example 6.4: A Goodness- of- Fit Test for Blood Pressure Status Category Proportions
Tải bản đầy đủ
160
Applied Survey Data Analysis
Example 6.5:â•‡ Pie Charts and Vertical Bar Charts of
the Estimated Blood Pressure Status Classification for
U.S. Adults from the 2005–2006 NHANES Data
Using Stata graphics, simple weighted pie charts (FigureÂ€6.3) or weighted vertical
bar charts (FigureÂ€6.4) produce an effective display of the estimated population
distribution across categories of a single categorical variable. The Stata command
syntax used to generate these two figures is as follows:
* Pie Chart (one long command).
graph pie bp_cat1 bp_cat2 bp_cat3 bp_cat4 ///
[pweight=wtmec2yr] if age18p==1, plabel(_all percent, ///
format(%9.1f)) scheme(s2mono) ///
legend (label /// (1 “Normal”) label /// (2 “Pre-Hypertensive”)
/// label ///(3 “Stage 1 Hypertensive”) label (4 “Stage 2 ///
Hypertensive”))
* Vertical Bar Chart (one long command).
graph bar (mean) bp_cat1 bp_cat2 bp_cat3 bp_cat4 ///
[pweight=wtmec2yr] if age18p==1, blabel(bar, ///
format(%9.1f) color(none)) ///
bar(1,color(gs12)) bar(2,color(gs4)) bar(3,color(gs8)) ///
bar(4,color(black)) ///
bargap(7) scheme(s2mono) over(riagendr) percentages ///
legend (label(1 “Normal”)label(2 “Pre-Hypertensive”) ///
label(3 “Stage 1 Hypertensive”) label (4 “Stage 2 ///
Hypertensive”)) ///
ytitle (“Percentage”)
6.4â•‡ Bivariate Analysis of Categorical Data
Bivariate analysis of categorical data may take a number of forms, ranging
from estimation of proportions for the joint distribution of the two variables
(total proportions) or estimation of “conditional” proportions based on levels
of the second categorical variable (i.e., row proportions or column proportions) to techniques for measuring and testing bivariate association between
two categorical variables. Bivariate categorical data analyses and reporting
are important in their own right, and they are also important as exploratory
tools in the development of more complex multivariate models (see the
regression model building steps in Section 8.3).
6.4.1â•‡Response and Factor Variables
Unlike simple linear regression for continuous survey variables, there is no
requirement to differentiate variables by type (dependent or independent) or
postulate a cause–effect relationship between two or more categorical variables that are being analyzed simultaneously. An example where assigning
© 2010 by Taylor and Francis Group, LLC
161
Categorical Data Analysis
2.4%
8.6%
47.1%
41.9%
Normal
Pre-Hypertensive
Stage 1 Hypertensive
Stage 2 Hypertensive
FigureÂ€6.3
Pie chart of the estimated distribution of blood pressure status of U.S. adults.
50
47.1%
Percentage
40
41.9%
30
20
8.6%
10
2.4%
0
Normal
Stage 1 Hypertensive
Pre-Hypertensive
Stage 2 Hypertensive
FigureÂ€6.4
Bar chart of the estimated distribution of the blood pressure status of U.S. adults. (Modified
from the 2005–2006 NHANES data.)
© 2010 by Taylor and Francis Group, LLC
162
Applied Survey Data Analysis
the variables to dependent or independent status is not necessary would be a
bivariate analysis of the population distribution by education levels and census
region. Nevertheless, in many categorical data analysis problems it is natural to
think of one categorical variable as the response variable and the others as factor variables that may be associated with the discrete outcome for the response
variable. This section will introduce examples centered around the analysis of
the joint distribution of U.S. adults’ experience with a lifetime episode of major
depression (yes, no) and their gender (male, female), based on the National
Comorbidity Survey Replication (NCS-R) data set. In these examples, it is convenient to label the NCS-R indicator variable for a lifetime major depressive
episode (MDE) as the response variable and the respondent’s gender (SEX) as
the factor. Assigning response and factor labels to the categorical variables also
sets up the transition to later discussion of regression modeling of categorical
data where dependent and independent variables are clearly specified.
6.4.2â•‡Estimation of Total, Row, and Column
Proportions for Two-Way Tables
Based on the weighted frequencies illustrated in FigureÂ€6.2, estimates of population proportions can be computed as the ratio of the weighted sample frequency for the cell to the appropriate weighted total or marginal frequency
value. For example, FigureÂ€6.5 illustrates estimation of the total proportions
of the population in each cell and margin of the table. Note that the numerator of each estimated proportion is the weighted total frequency for the cell,
such as Nˆ A1 , and the denominator is the weighted total population frequency, Nˆ ++ . Statistical software also enables the user to condition estimates
of population proportions on the sample in particular rows or columns of the
crosstabulation. FigureÂ€6.6 illustrates the calculations of weighted estimates
of row proportions for the estimated population distribution in FigureÂ€6.2.
The following example uses the NCS-R data on lifetime major depressive
episode (MDE) and gender (SEX) to illustrate the Stata syntax to estimate total
proportions, row proportions, standard errors, and confidence intervals.
Response
Factor
0
A
B
1
pA0 =
Nˆ A0
Nˆ
pA1 =
Nˆ A1
Nˆ
pA+ = pA0 + pA1
pB0 =
Nˆ B0
Nˆ
pB1 =
Nˆ B1
Nˆ
pB+ = pB0 + pB1
++
++
p+0 = pA0 + pB0
++
++
p+1 = pA1 + pB1
p++ = 1.0
FigureÂ€6.5
Estimation of overall (total) population proportions (multinomial sampling model).
© 2010 by Taylor and Francis Group, LLC
163
Categorical Data Analysis
Response
Factor
0
1
A
p0|A =
Nˆ A0
Nˆ A+
p1|A =
Nˆ A1
Nˆ A+
pA+ = 1.0
B
p0|B =
Nˆ B0
Nˆ B+
p1|B =
Nˆ B1
Nˆ B+
pB+ = 1.0
FigureÂ€6.6
Estimation of row population proportions (product multinomial sampling model).
Example 6.6â•‡Estimation of Total and Row Proportions
for the Cross-Tabulation of Gender and Lifetime Major
Depression Status Using the NCS-R Data
The first of two svy: tab commands in Stata requests as follows the default
estimates of the total proportions for the SEX × MDE crosstabulation along with
the corresponding standard errors, 95% CIs, and design effects. The row option in
the second svy: tab command specifies that the estimates, standard errors, CIs,
and design effects will be for the row proportions:
svyset seclustr [pweight = ncsrwtsh], strata(sestrat)
svy: tab sex mde, se ci deff
svy: tab sex mde, row se ci deff
The estimated proportions, standard errors, 95% CIs, and design effect output
from these two commands are summarized in TableÂ€6.4.
6.4.3â•‡Estimating and Testing Differences in Subpopulation Proportions
Estimates of row proportions in two-way tables (e.g., pˆ1|B = 0.230 in TableÂ€6.4)
are in fact subpopulation estimates in which the subpopulation is defined
by the levels of the factor variable. Analysts interested in testing differences
in response category proportions between two levels of a factor variable can
use methodology similar to that discussed in Section 5.6 for comparison of
subpopulation means.
Example 6.7:â•‡ Comparing the Proportions of U.S. Adult
Men and Women with Lifetime Major Depression
This example uses data from the NCS-R to test a null hypothesis that there is no
difference in the proportions of U.S. adult men and women with a lifetime diagnosis of a major depressive episode. To compare male and female row proportions for MDE = 1, the svy: prop command with the over() option is used
to estimate the vector of row proportions (see TableÂ€6.4) and their design-based
© 2010 by Taylor and Francis Group, LLC
164
Applied Survey Data Analysis
TableÂ€6.4
Estimated Proportions of U.S. Adults by Gender and Lifetime Major
Depression Status
Description
Parameter
Estimated
Proportion
Linearized
SE
95% CI
Design
Effect
Male, no MDE
Male, MDE
Female, no MDE
Female, MDE
πA0
πA1
πB0
πB0
Total Proportions
0.406
0.007
0.072
0.003
0.402
0.005
0.120
0.003
(0.393, 0.421)
(0.066, 0.080)
(0.391, 0.413)
(0.114, 0.126)
1.87
1.64
1.11
0.81
No MDE|Male
MDE|Male
No MDE|Female
π0|A
π1|A
π0|B
π1|B
Row Proportions
0.849
0.008
0.151
0.008
0.770
0.006
0.230
0.006
(0.833, 0.864)
(0.136, 0.167)
(0.759, 0.782)
(0.218, 0.241)
2.08
2.08
0.87
0.87
MDE|Female
Source: Analysis based on NCS-R data.
variance–covariance matrix. Internally, Stata labels the estimated row proportions
for MDE = 1 as _ prop _ 2. The lincom command is then executed to estimate
the contrast of the male and female proportions and the standard error of this difference. The relevant Stata commands are as follows:
svyset seclustr [pweight=ncsrwtsh], strata(sestrat) ///
vce(linearized) singleunit(missing)
svy: proportion mde, over(sex)
lincom [_prop_2]Male - [_prop_2]Female
The output of the lincom command provides the following estimate of the male–
female difference, its standard error, and a 95% CI for the contrast of proportions.
∆ˆ = pmale − p female
se( ∆ˆ )
CI.95 ( ∆ )
–0.079
0.010
(–0.098, –0.059)
Because the design-based 95% CI for the difference in proportions does not
include 0, the data suggest that the rate of lifetime major depressive episodes for
women is significantly higher than that for men.
6.4.4â•‡ Chi-Square Tests of Independence of Rows and Columns
For a 2 × 2 table, the contrast of estimated subpopulation proportions examined in Example 6.7 is equivalent to a test of whether the response variable
(MDE) is independent of the factor variable (SEX). More generally, under
SRS, two categorical variables are independent of each other if the following
© 2010 by Taylor and Francis Group, LLC