Example 6.8: Testing the Independence of Alcohol Dependence and Education Level in Young Adults ( Ages 18– 28) Using the NCS- R Data
Tải bản đầy đủ
169
Categorical Data Analysis
TableÂ€6.6
Design-Based Analysis of the Association between NCS-R Alcohol
Dependence and Education Level for Young Adults Aged 18–28
Alcohol Dependence Row Proportions (Linearized SE)
Education Level
(Grades)
0–11
12
13–15
16+
Total
Unadjusted X2
X2Pearson = 27.21
n18-28 = 1,275
0 = No
1 = Yes
Total
0.909 (0.029)
0.951 (0.014)
0.951 (0.010)
0.931 (0.014)
0.940 (0.009)
0.091 (0.029)
0.049 (0.014)
0.049 (0.010)
0.069 (0.014)
0.060 (0.009)
1.000
1.000
1.000
1.000
1.000
Tests of Independence
Rao–Scott F
P ( χ(23 ) > X2Pearson)
p < 0.0001
FR-S,Pearson = 1.64
P (F2.75, 115.53 > FR-S)
p = 0.18
Parameters of the Rao–Scott Design-Adjusted Test
Design df = 42
GDEFF = 6.62
a = 0.56
school, 3 = some college, 4 = college and above). The analysis is restricted to the
subpopulation of NCS-R Part II respondents 18–28 years of age. After identifying
the complex design features to Stata, we request the cross-tabulation analysis and
any related design-adjusted test statistics by using the svy: tab command:
svyset seclustr [pweight = ncsrwtlg], strata(sestrat)
svy, subpop(if 18<=age<29): tab ed4cat ald, row se ci deff
ED4CAT is specified as the row (factor) variable and ALD as the column
(response) variable. Weighted estimates of the row proportions are requested
using the row option. TableÂ€6.6 summarizes the estimated row proportions and
standard errors for the ALD × ED4CAT crosstabulation along with the Rao–Scott
F-test of independence.
An estimated 9.1% of young adults in the lowest education group have been
diagnosed with alcohol dependence at some point in their lifetime (95% CI = 4.7%,
17.0%), while an estimated 6.9% of young adults in the highest education group
have been diagnosed with alcohol dependence (95% CI = 4.6%, 10.2%). By default,
Stata reports the standard uncorrected Pearson chi-square test statistic ( X 2Pearson =
27.21, p < 0.0001) and then reports the (second-order) design-adjusted Rao–Scott
F-test statistic (FR-S,Pearson = 1.64, p = 0.18) (see TableÂ€6.5). The standard Pearson X2
test rejects the null hypothesis of independence at α = 0.05; however, when the
corrections for the complex sample design are introduced, the Rao–Scott designadjusted test statistic fails to reject a null hypothesis of independence between education and a lifetime diagnosis of alcohol dependence in this younger population.
The appropriate inference in this case would thus be that there is no evidence of a
bivariate association between these two categorical factors in this subpopulation.
Multivariate analyses examining additional potential predictors of alcohol dependence could certainly be examined at this point (see Chapter 8 for examples).
© 2010 by Taylor and Francis Group, LLC
170
Applied Survey Data Analysis
We remind readers that Stata is using a second-order design correction for the
test statistic, which is why the results of these analyses may differ from those
found using other software packages (note the decimal degrees of freedom for
the design-adjusted F-statistic in Table 6.6, due to the second-order correction).
If a user specifies the deff option, Stata also reports both the mean generalized design effect (GDEFF = 6.63) used in the first-order correction and the
coefficient of variation of the generalized design effects (a = 0.56) used in the
second-order correction.
Additional test statistics, including design-adjusted likelihood ratio and Wald
test statistics, can be requested in Stata by using the lr and wald options for the
svy: tab command. These options do not lead to substantially different conclusions in this illustration and will generally not lead to different inferences about
associations between two categorical variables. As mentioned previously, Stata
developers advocate the use of the second-order design-adjusted Pearson chisquare statistic (or the Rao–Scott chi-square statistic and its F-transformed version)
in all situations involving crosstabulations of two categorical variables measured
in complex sample surveys (Sribney, 1998).
6.4.5â•‡ Odds Ratios and Relative Risks
The odds ratio, which we denote by ψ, can be used to quantify the association
between the levels of a response variable and a categorical factor. FigureÂ€6.7
displays NCS-R weighted estimates (row proportions) of the prevalence of
one or more lifetime episodes of major depression by gender.
The odds ratio compares the odds that the response variable takes a specific value across two levels of the factor variable. If the response variable
is truly independent of the chosen factor, then ψ = 1.0. For example, from
FigureÂ€6.7, the estimated male (A)/female (B) odds ratio for MDE is
ψˆ =
Odds( MDE = 1|Male )
p /(1 − p1|A ) p1|A / p0|A 0.151 / 0.849
=
=
= 0.595
= 1|A
Odds( MDE = 1|Female ) p1|B /(1 − p1|B ) p1|B / p0|B 0.230 / 0.770
MDE
SEX
0
1
A—Male
p0|A =
Nˆ A0
= 0.849
Nˆ A+
p1|A =
Nˆ A1
= 0.151
Nˆ A+
pA+ = 1.0
B—Female
p0|B =
Nˆ B0
= 0.770
Nˆ B+
p1|B =
Nˆ B1
= 0.230
Nˆ B+
pB+ = 1.0
FigureÂ€6.7
Estimates of row proportions for MDE by gender.
© 2010 by Taylor and Francis Group, LLC
Categorical Data Analysis
171
Note that although this estimate of ψ is computed using the estimated row
proportions for the SEX × MDE table, the same estimate would be obtained
if the estimated total proportions had been used (TableÂ€6.4):
ψˆ =
pA1 / pA0 0.072 /0.407
=
= 0.595
pB1 / pB0 0.120 /0.402
Since this odds ratio is estimated with no additional controls for other
factors such as age or education, it is labeled as an unadjusted odds ratio.
Note that a correct description of this result is the following: “The odds that
adult men experience major depression in their lifetime are estimated to be
only 59.5% as large as the odds for women.” A common mistake in reporting
results for estimated odds ratios is to make a statement like the following:
“The probability that a man experiences an episode of major depression in
their lifetime is 59% of that for women.”
The latter statement is confusing the odds ratio statistic with a related, yet
different, comparative measure, the relative risk (computed here using the
estimates in TableÂ€6.4):
ˆ = Prob( MDE = 1|Male ) = p1|A = 0.151 = 0.656
RR
Prob( MDE = 1|Female ) p1|B 0.230
The relative risk is the ratio of two conditional probabilities: the probability of MDE for males and the probability of MDE for females. Although
both the odds ratio and the relative risk measure the association of a categorical response and a factor variable, they should be distinguished. Only
in instances where the prevalence of the response of interest is very small for
all levels of the factor (i.e., p1|A and p1|B < 0.01 ) will the odds ratio and relative
risk statistics converge to similar numerical values.
If the response and factor variables are independent, then
ψ = 1.0 (and RR = 1.0). Therefore, to test if categorical response and factor
variables are independent, it would be reasonable to construct a confidence
interval of the form ψˆ ± t1−α/2 ,df ⋅ se( ψˆ ) , and establish whether the null value
of ψ = 1 is contained within the interval. Although a TSL approximation to
se( ψˆ ) can be derived directly, a CI for ψ is generally obtained from the technique of simple logistic regression.
6.4.6â•‡ Simple Logistic Regression to Estimate the Odds Ratio
Logistic regression for binary dependent variables will be covered in depth
in Chapter 8. Here, the logit function and simple logistic regression models
are briefly introduced to demonstrate their application to estimation of the
unadjusted odds ratio and its confidence interval.
© 2010 by Taylor and Francis Group, LLC
172
Applied Survey Data Analysis
The natural logarithm of the odds is termed a logit function. Again, using
the NCS-R MDE example in TableÂ€6.4, the logits of the probabilities of MDE
for the male and female factor levels are
p
0.151
= −1.727
logit( p1|A ) = ln(Odds( MDE = 1|Male )) = ln 1|A = ln
0.849
1 − p1|A
p
0.230
= −1.208
logit( p1|B ) = ln(Odds( MDE = 1|Female ) = ln 1|B = ln
0.770
1 − p1|B
Consider a single indicator variable, Imale, coded 1 = male and 0 = female,
that distinguishes the two levels of SEX. The outcome MDE is coded 1 =
yes, 0 = no. A simple logistic regression model for these data is written
as follows:
ψˆ =
p1|A /(1 − p1|A ) exp( logit( p1|A )) exp(βˆ 0 + βˆ 1 ⋅ 1)
=
= exp(βˆ 1 )
=
p1|B /(1 − p1|B ) exp( logit( p1|B )) exp(βˆ 0 + βˆ 1 ⋅ 0)
Then, we can derive the following result:
CI ( ψ ) = (exp(βˆ 1 − t1−α/2 ,df ⋅ se(βˆ 1 )), exp(βˆ 1 + t1−α/2 ,df ⋅ se(βˆ 1 )))
The resulting confidence interval is not symmetric about the estimated
odds ratio but has been shown to provide more accurate coverage of the true
population value for a specified level of Type I error (α).
Example 6.9:â•‡ Simple Logistic Regression to
Estimate the NCS-R Male/Female Odds Ratio
for Lifetime Major Depressive Episode
As mentioned previously, logistic regression will be covered in detail in later chapters. Here, a simple logistic regression of the NCS-R MDE variable on the indicator
of male gender (SEXM) is used to illustrate the technique for estimating the unadjusted Male/Female odds ratio for MDE and a 95% CI for that odds ratio:
svyset seclustr [pweight = ncsrwtsh], strata(sestrat)
svy: logistic mde sexm
From the output provided by the svy: logistic command, the estimated
odds ratio and a 95% CI for the population odds ratio are as follows:
ψˆ MDE (SE)
0.597 (0.041)
© 2010 by Taylor and Francis Group, LLC
CI .95 ( ψ )
(0.520, 0.685)
173
Categorical Data Analysis
Based on this analysis, the odds that an adult male has experienced a lifetime
MDE are only 59.7% as large as the odds of MDE for adult females, which agrees
(allowing for some rounding error) with the simple direct calculation. Since the
95% CI does not include ψ = 1, we would reject the null hypothesis that MDE
status is independent of gender.
6.4.7â•‡Bivariate Graphical Analysis
Graphical displays also are useful tools to describe the bivariate distribution
of two categorical variables. The following Stata graphics command generates gender-specific vertical bar charts for the BP_CAT variable generated
in Example 6.3 (note that the pweight option is used to specify the survey
weights, and the over() option is used to generate a plot for each level of
gender). The output is shown in FigureÂ€6.8.
graph bar (mean) bp_cat1 bp_cat2 bp_cat3 bp_cat4 ///
[pweight=wtmec2yr] if age18p==1, blabel(bar, format(%9.1f) ///
color(none)) bar(1,color(gs12)) bar(2,color(gs4)) ///
bar(3,color(gs8)) bar(4,color(black)) ///
bargap(7) scheme(s2mono) over(riagendr) percentages ///
legend (label(1 “Normal”) label(2 “Pre-Hypertensive”) ///
label(3 “Stage 1 Hypertensive”) label(4 “Stage 2 ///
Hypertensive”)) ytitle(“Percentage”)
60
53.7%
Percentage
49.9%
40
40.0%
34.4%
20
8.9%
8.4%
0
3.0%
1.7%
Male
Normal
Stage 1 Hypertensive
Female
Pre-Hypertensive
Stage 2 Hypertensive
FigureÂ€6.8
Bar chart of the estimated distribution of blood pressure status of U.S. adult men and women.
(Modified from the 2005–2006 NHANES data.)
© 2010 by Taylor and Francis Group, LLC