Tải bản đầy đủ
Example 8.1: Examining Predictors of a Lifetime Major Depressive Episode in the NCS- R Data

Example 8.1: Examining Predictors of a Lifetime Major Depressive Episode in the NCS- R Data

Tải bản đầy đủ

252

Applied Survey Data Analysis

Odds Ratio (Arthritis)

2

1.5

1

0.5

HS

>HS

Education Level
Male

Female

Figure€8.3
Plot of estimated odds ratios, showing the interaction between gender and education in the
arthritis model. (Modified from the 2006 HRS data.)

value of 1 for persons who meet lifetime criteria for major depression and 0 for all
others. The following predictors are considered: AG4CAT (a categorical variable
measuring age brackets, including 18–29, 30–44, 45–59, and 60+), SEX (1 = Male,
2 = Female), ALD (an indicator of any lifetime alcohol dependence), ED4CAT
(a categorical variable measuring education brackets, including 0–11 years, 12
years, 13–15 years, and 16+ years), and MAR3CAT (a categorical variable measuring marital status, with values 1 = “married,” 2 = “separated/widowed/divorced,”
and 3 = “never married”). The primary research question of analytical interest is
whether MDE is related to alcohol dependence after adjusting for the effects of the
previously listed demographic factors .

8.7.1╇ Stage 1: Model Specification
The analysis session begins by specifying the complex design features of
the NCS-R sample in the Stata svyset command. Note that we specify the
“long” or Part 2 NCS-R sampling weight (NCSRWTLG) in the svyset command. This is due to the use of the alcohol dependence variable in the analysis, which was measured in Part 2 of the NCS-R survey.
There are 42 sampling error strata and 84 sampling error computation
units (two per stratum) in the NCS-R sampling error calculation model,
resulting in 42 design-based degrees of freedom.
Following the recommendations of Hosmer and Lemeshow (2000), the
model building begins by examining the bivariate associations of MDE with
each of the potential predictor variables. Since the candidate predictors are
all categorical variables, the bivariate relationship of each predictor with

© 2010 by Taylor and Francis Group, LLC

253

Logistic Regression and Generalized Linear Models

MDE is analyzed in Stata by using the svy: tab command and requesting
row percentages (as discussed in Chapter 6).
svy:
svy:
svy:
svy:
svy:

tab
tab
tab
tab
tab

ag4cat mde, row
sex mde, row
ald mde, row
ed4cat mde, row
mar3cat mde, row

Table€8.5 presents the results of these bivariate analyses, including the Rao–
Scott F-tests of association. The table also presents estimates of the percentages of each predictor category that received the lifetime MDE diagnosis.
Based on these initial tests of association, all of the predictor variables
appear to have significant bivariate associations with MDE, including ALD,
the indicator of lifetime alcohol dependence, and all of the predictors appear
to be good candidates for inclusion in the initial multivariate logistic regression model.
8.7.2╇ Stage 2: Model Estimation
The next step in the Hosmer–Lemeshow (2000) model building procedure is
to fit the “initial” multivariate model, examining the main effects for all five
Table€8.5
Initial Bivariate Design-Based Tests of Association Assessing
Potential Predictors of Lifetime Major Depressive Episode (MDE)
for the NCS-R Adult Sample
Predictor

Rao–Scott F-testa

Category

% with MDE (SE)

AG4CAT

F(2.76,115.97) = 26.39
p < 0.01

18–29
30–44
45–59
60+
Male
Female
Yes
No
<12 yrs
12 yrs
13–15 yrs
16+ yrs
Married
Previously
Never

18.4 (0.9)
22.9 (1.1)
22.3 (1.3)
11.1 (1.0)
15.3 (0.9)
22.6 (0.7)
45.2 (0.3)
17.7 (0.7)
16.3 (1.2)
18.6 (0.8)
21.3 (1.0)
19.7 (1.1)
17.3 (0.7)
23.9 (1.5)
19.4 (1.1)

SEX

F(1,42) = 44.83
p < 0.01

ALD

F(1,42) = 120.03
p < 0.01

ED4CAT

F(2.90,121.93) = 4.30
p < 0.01

MAR3CAT

F(1.90,79.74) = 11.08
p < 0.01

a

See Chapter 6 for more details on the derivation of this test statistic, which
can be used to test the null hypothesis of no association between the predictor variable and the outcome variable (MDE).

© 2010 by Taylor and Francis Group, LLC

254

Applied Survey Data Analysis

predictors. Stata provides two programs for fitting the multivariate logistic
regression model: svy: logistic and svy: logit. The default output for
the svy: logistic command is estimates of adjusted odds ratios and 95%
confidence intervals for the adjusted odds ratios. Note the use of the char
command and specification of the omitted reference category for SEX (2, or
females) prior to the svy: logistic command. This command allows the
user to specify custom reference categories rather than accepting the default
of having the lowest category omitted:
char sex[omit] 2
xi: svy: logistic mde i.ag4cat i.sex ald i.ed4cat i.mar3cat

In svy: logistic, estimates of the parameters and standard errors for
the logit model can be requested by using the coef option.
xi: svy: logistic mde i.ag4cat i.sex ald i.ed4cat ///
i.mar3cat, coef

Stata users who wish to see the estimated logistic regression coefficients
and standard errors may also use the companion program, svy: logit.
xi: svy: logit mde i.ag4cat i.sex ald i.ed4cat i.mar3cat

Estimated odds ratios and 95% CIs can be generated in svy: logit by
adding the or option:
xi: svy: logit mde i.ag4cat i.sex ald i.ed4cat ///
i.mar3cat, or

Other software systems (e.g., SAS PROC SURVEYLOGISTIC) will output
both the estimated logistic regression coefficients (and standard errors) and
the corresponding odds ratio estimates. (Readers should be aware that the
svy: logit and svy: logistic commands differ slightly in the procedures for calculation of the standard errors for odds ratios.)
In each form of the Stata svy: logistic or svy: logit command,
the dependent variable, MDE, is listed first, followed by the predictor variables (and their interactions, if applicable). Note that the xi: modifier is used
in conjunction with the svy: logistic and svy: logit commands to
indicate those predictor variables that are categorical in nature: AG4CAT,
ED4CAT, SEX, and MAR3CAT. (Because the ALD variable is already coded
as either 0 or 1, it does not require program-generated indicator variables.)
The xi: specification of the model instructs Stata to create a set of indicator
variables for each independent variable preceded by the i. prefix. Stata will
create K – 1 indicator variables to represent the K levels of the categorical
predictor. By default, Stata will select the lowest valued category as the reference category and only the K – 1 indicators for the remaining categories will
be included as predictors in the model.

© 2010 by Taylor and Francis Group, LLC

255

Logistic Regression and Generalized Linear Models

Table€8.6
Estimated Logistic Regression Model for the Lifetime MDE
Outcome (Output Generated by Using the svy: logit
Command)
Predictora
INTERCEPT
AG4CAT

SEX
ALD
ED4CAT

MAR3CAT

Category



30–44
45–59
60+
Male
Yes
12
13–15
16+
Previously
Never

–1.583
0.255
0.206
–0.676
–0.577
1.424
0.079
0.231
0.163
0.486
0.116

se( Bˆ )
0.121
0.094
0.092
0.141
0.077
0.154
0.097
0.093
0.111
0.085
0.108

t

P(t42 > t)

–13.12
2.71
2.26
–4.78
–7.48
9.24
0.82
2.48
1.47
5.69
1.07

<0.001
0.01
0.029
< 0.001
< 0.001
< 0.001
0.418
0.017
0.148
<0.001
0.290

Source: Analysis based on the NCS-R data.
Notes: n = 5,692, adjusted Wald test for all parameters: F(10,33) = 28.07,
p < 0.001.
a Reference categories for categorical predictors are: AG4CAT (18–29);
SEX (Female); ALD (No); ED4CAT (<12 yrs); MAR3CAT (Married).

Tables€ 8.6 and 8.7 summarize the output generated by fitting the MDE
logistic regression model. The initial model includes main effects for the
chosen predictor variable candidates but at this point does not include any
interactions between the predictors.
8.7.3╇ Stage 3: Model Evaluation
The adjusted Wald tests in Stata for the AG4CAT, ED4CAT, and MAR3CAT
categorical predictors in this initial model are generated by using the
test command:
test _Iag4cat_2 _Iag4cat_3 _Iag4cat_4
test _Imar3cat_2 _Imar3cat_3
test _Ied4cat_2 _Ied4cat_3 _Ied4cat_4

Note that we do not request these multiparameter Wald tests for the SEX
and ALD predictor variables, because they are represented by single indicator variables in the regression model and the overall Wald test for each
predictor is equivalent to the t-test reported for the single estimated parameter for that predictor. Further, note in each of the test statements that we
include the K – 1 indicator variables generated by Stata for each of the categorical predictors (e.g., _ Iag4cat _ 2) when the xi: modifier is used to
identify categorical predictor variables. Stata users can find the names of

© 2010 by Taylor and Francis Group, LLC

256

Applied Survey Data Analysis

Table€8.7
Estimates of Adjusted Odds Ratios for the
Lifetime MDE Outcome
Predictora

Category

ψˆ

AG4CAT

30–44
45–59
60+
Male
Yes
12
13–15
16+
Previously
Never

1.29
1.23
0.51
0.56
4.15
1.08
1.26
1.18
1.63
1.12

SEX
ALD
ED4CAT

MAR3CAT

95% CI for ψ
(1.067, 1.562)
(1.022, 1.479)
(0.383, 0.677)
(0.480, 0.656)
(3.042, 5.668)
(0.890, 1.316)
(1.044, 1.519)
(0.941, 1.471)
(1.369, 1.932)
(0.903, 1.396)

Source: Analysis based on the NCS-R data.
Notes: n = 5,692. Adjusted Wald test for all parameters: F(10,33) = 28.07. p < 0.001.
a Reference categories for categorical predictors
are: AG4CAT (18–29); SEX (Female); ALD (No);
ED4CAT (<12 yrs); MAR3CAT (Married).

these indicators in the Stata Variables window once the model has been fitted. Table€8.8 provides the design-adjusted F-versions of the resulting Wald
test statistics and associated p-values.
Two of the three design-adjusted Wald tests are significant at the 0.01 level.
The exception is the Wald test for education [F(3,40) = 2.13, p = 0.112], which
suggests that the parameters associated with education in this logistic regression model are not significantly different from zero and that education may
not be an important predictor of lifetime MDE when adjusting for the relationships of the other predictor variables with the outcome. If the objective
of the model-building process is the construction of a parsimonious model,
education could probably be dropped as a predictor at this point. For the
purposes of this illustration (and because of the marginal significance), we
will retain education in the model moving forward.
8.7.4╇ Stage 4: Model Interpretation/Inference
Based on the results in Table€8.6 and Table€8.8, it appears that each of the predictors in the multivariate model has a significant (or marginally significant)
relationship with the probability of MDE after adjusting for the relationships
of the other predictors. Focusing on the primary predictor variable of interest, we see that the odds of having had a major depressive episode at some
point in the lifetime are multiplied by 4.15 when a person has had a diagnosis of alcohol dependence at some point in his or her lifetime, when adjusting

© 2010 by Taylor and Francis Group, LLC

257

Logistic Regression and Generalized Linear Models

Table€8.8
Design-Adjusted Wald Tests for the Parameters
Associated with the Categorical Predictors in
the Initial MDE Logistic Regression Model
Categorical Predictor
AG4CAT
ED4CAT
MAR3CAT

F-Test Statistic

P-value

F(3,40) = 19.03
F(3,40) = 2.13
F(2,40) = 16.60

< 0.001
0.112
< 0.001

Source: Analysis based on the NCS-R data.

for the relationships of age, sex, education, and marital status. Of course, this
model does not allow for any kind of causal inference, given that time ordering of the events is not available in the NCS-R data set; we can, however,
conclude that there is strong evidence of an association between the two
disorders in this finite population when adjusting for other demographic
covariates. We also note that relative to married respondents, respondents
who were previously married have significantly higher (63% higher) odds
of having had a major depressive episode in their lifetime when adjusting
for the other covariates. Further, middle-age respondents have significantly
higher odds of lifetime MDE (relative to younger respondents), while older
respondents and males have significantly reduced odds of lifetime MDE
(again relative to younger respondents and females).
Respondent age is represented in the model as four grouped categories of
age. Including grouped categories for age (or recoded categories of any continuous predictor, more generally) in a logistic regression model will result
in estimates of the expected contrasts in log-odds for respondents in each
of the defined categories, relative to the reference category. Since the model
parameters are estimated separately for each defined age group (with age
18–29 as the reference), the model will capture any nonlinearity of effect in
the ordered age groupings. Inspecting the estimated coefficients and odds
ratios for the grouped age categories in Table€ 8.6 and Table 8.7, it appears
that there is significant nonlinearity in the effect of age on the probability
of MDE. Relative to the 18–29-year-old group, the odds of MDE increase by
factors of 1.29 (aged 30–44) and 1.23 (aged 45–59) for the middle-age ranges
but decrease by a factor of 0.51 in the age 60 and older group. Such nonlinear effects of age are common in models of human disorders and are possibly attributable to normal processes of aging and selective mortality. If the
example model was estimated with age (in years) as a continuous predictor
variable, at this stage in the model-building process the analyst would reestimate the model including both the linear and quadratic terms for age.
Therefore, at this stage in the model-building process, we have chosen to
retain all of the candidate main effects. Next, we apply Archer and Lemeshow’s
(2006) design-adjusted test to assess the goodness of fit of this initial model
(assuming that this procedure has been downloaded and installed):

© 2010 by Taylor and Francis Group, LLC

258

Applied Survey Data Analysis

svylogitgof

The resulting design-adjusted F-statistic reported in the Stata Results
window is equal to FA-L = 1.229, with a p-value of 0.310. This suggests that
the null hypothesis that the model fits the data well is not rejected. We
therefore have confidence moving forward that the fit of this initial model
is acceptable.
Next, we consider testing some scientifically relevant two-way interactions
between the candidate predictor variables. For illustration purposes, we suppose that possible two-way interactions of sex with the other four covariates
measuring age, lifetime alcohol dependence, education, and marital status
are of interest, if sex is posited by an NCS-R analyst as being a possible moderator of the relationships of these other four covariates with lifetime MDE.
We fit a model including these two-way interactions in Stata using the following command:
xi: svy: logistic mde i.ag4cat*i.sex i.sex*ald ///
i.ed4cat*i.sex i.mar3cat*i.sex, coef

Note how the interactions are specified in this command. When the xi:
modifier is used for a regression command, the products of the two factors
listed after the dependent variable specify that the regression parameters
associated with each individual factor should be included in the regression
model in addition to the parameters associated with the relevant cross-product terms defined by the interaction (e.g., the indicator for AG4CAT = 2 × the
indicator for SEX = 1). In other words, listing AG4CAT and SEX in addition
to the previous product terms would be redundant, and the main effects are
included in the model by default when the interaction terms are specified.
Table€8.9 presents the estimates of the regression parameters in this model
generated by executing the previous command in Stata.
At this point, the statistical question is whether these two-way interactions
are making a significant additional contribution or improvement to the fit
of this model to the NCS-R data. That is, are any of the parameters associated with the two-way interaction terms significantly different from 0? We
can test this hypothesis by once again using design-adjusted Wald tests. The
relevant interaction terms for the regression model are automatically generated by Stata and included in the data set when using the xi: modifier, so
the cross-product terms in the test commands that follow can be easily
selected from the Stata Variables window:
test
test
test
test

_Iag4Xsex_2_1 _Iag4Xsex_3_1 _Iag4Xsex_4_1
_IsexXald_1
_Ied4Xsex_2_1 _Ied4Xsex_3_1 _Ied4Xsex_4_1
_ImarXsex_2_1 _ImarXsex_3_1

© 2010 by Taylor and Francis Group, LLC

259

Logistic Regression and Generalized Linear Models

Table€8.9
Estimated Logistic Regression Model for Lifetime MDE, Including
First Order Interactions of the Other Predictor Variables with SEX
Predictora
INTERCEPT
AG4CAT

SEX
ALD
ED4CAT

MAR3CAT
AG4CAT × SEX

ALD × SEX
ED4CAT × SEX

MAR3CAT × SEX

Category
Constant
30–44
45–59
60+
Male
Yes
12
13–15
16+
Previously
Never
30–44 × Male
45–59 × Male
60+ × Male
Yes × Male
12 × Male
13–15 × Male
16+ × Male
Previously × Male
Never × Male


–1.600
0.220
0.214
–0.646
–0.546
1.553
0.131
0.297
0.242
0.418
0.017
0.097
0.002
–0.038
–0.200
–0.138
–0.169
–0.194
0.182
0.232

se( Bˆ )

t

P(t42 > t)

0.134
0.114
0.102
0.175
0.357
0.211
0.084
0.117
0.152
0.111
0.130
0.201
0.213
0.302
0.242
0.271
0.269
0.344
0.208
0.212

–11.94
1.94
2.09
–3.68
–1.53
7.36
1.56
2.54
1.59
3.78
0.13
0.48
0.01
–0.13
–0.83
–0.51
–0.63
–0.56
0.88
1.09

<0.001
0.059
0.042
0.001
0.134
<0.001
0.126
0.015
0.118
<0.001
0.894
0.633
0.990
0.901
0.413
0.614
0.534
0.576
0.385
0.280

Source: Analysis based on the NCS-R data.
Notes: n = 5,692. Adjusted Wald test for all parameters: F(19,24) = 17.15. p <
0.001.
a Reference categories for categorical predictors are: AG4CAT (18–29 yrs); GENDER
(female); ALD (no); ED4CAT(<12 yrs); MAR3CAT (married); SEX(female).

Based on test results presented in Table 8.10, we fail to reject the null
hypotheses for all four of the tests, suggesting that these two-way interactions are actually not making a significant contribution to the fit of the
model. We therefore do not consider these two-way interactions any further
and would proceed with making inferences based on the estimates from the
model presented in Table€8.6.

8.8╇Comparing the Logistic, Probit, and Complementary
Log–Log GLMs for Binary Dependent Variables
This chapter has focused on logistic regression techniques for modeling
π(x) for a binary dependent variable. As discussed in Section 8.2, alternative

© 2010 by Taylor and Francis Group, LLC

260

Applied Survey Data Analysis

Table€8.10
Design-Adjusted Wald Tests of
First-Order Interactions of Sex and Other
Categorical Predictors in the MDE
Logistic Regression Model
Interaction Term

F-Test Statistic

P(F > F)

AG4CAT × SEX
ALD × SEX
ED4CAT × SEX
MAR3CAT × SEX

F(3,40) = 0.25
F(1,42) = 0.68
F(3,40) = 0.13
F(2,41) = 0.77

0.863
0.413
0.944
0.472

Source: Analysis based on the NCS-R data.

generalized linear models for a binary dependent variable may be estimated
using the probit or CLL link function. In discussing these alternative GLMs,
we noted that inferences derived from logistic, probit and CLL regression
models should generally be consistent.
To illustrate, consider the results in Table€8.11 for a side-by-side comparison of estimated logistic, probit and CLL regression models. The example
used for this comparison is a model of the probability that a U.S. adult is
alcohol dependent. The data are from the NCS-R long interview (or Part 2
of the survey), and each model includes the same demographic main effects
considered in Section 8.7 for the model of MDE: SEX, AG4CAT, ED4CAT,
and MAR3CAT. The Stata commands for the estimation of the three models
follow (note the use of the char sex[omit]2 syntax to specify the desired
omitted category for sex):
char sex[omit] 2
xi: svy: logit ald i.ag4cat i.sex i.ed4cat i.mar3cat
test _Iag4cat_2 _Iag4cat_3 _Iag4cat_4
test _Ied4cat_2 _Ied4cat_3 _Ied4cat_4
test _Imar3cat_2 _Imar3cat_3
xi: svy: probit ald i.ag4cat i.sex i.ed4cat i.mar3cat
test _Iag4cat_2 _Iag4cat_3 _Iag4cat_4
test _Ied4cat_2 _Ied4cat_3 _Ied4cat_4
test _Imar3cat_2 _Imar3cat_3
xi: svy: cloglog ald i.ag4cat i.sex i.ed4cat i.mar3cat
test _Iag4cat_2 _Iag4cat_3 _Iag4cat_4
test _Ied4cat_2 _Ied4cat_3 _Ied4cat_4
test _Imar3cat_2 _Imar3cat_3

Table€ 8.11 presents a summary of the estimated coefficients, standard
errors, and p-values for simple hypothesis tests of the form H0: Bj = 0.
Table€ 8.12 presents the results for the Wald tests of the overall age, education, and marital status effects. Note that although the coefficients and
standard errors for the probit model show the expected difference in scale

© 2010 by Taylor and Francis Group, LLC