6 Building the Logistic Regression Model: Stage 4, Interpretation and Inference
Tải bản đầy đủ
246
Applied Survey Data Analysis
associated with a level of a categorical predictor is the natural logarithm of
the odds ratio comparing the odds that y = 1 for the level represented by the
indicator to the odds that y = 1 for the reference level of the categorical variable. Consequently, the estimated coefficients are often labeled the log-odds
for the corresponding predictor in the model. A design-based confidence
interval for the logistic regression parameter is computed as
CI1−α (B j ) = Bˆ j ± tdf ,1−α/2 ⋅ se(Bˆ j )
(8.20)
Typically, α = 0.05 is used (along with the design-based degrees of freedom, df), and the result is a 95% confidence interval for the parameter. In
theory, the correct inference to make is that over repeated sampling, 95 of
100 confidence intervals computed in this way are expected to include the
true population value of Bj. If the estimated CI includes ln(1) = 0, analysts
may choose to infer that H0: Bj = 0 is accepted with a Type I error rate of α
= 0.05.
Inference concerning the significance/importance of predictors can be
performed directly for the Bˆ j s (on the log-odds scale). However, to quantify the magnitude of the effect of an individual predictor, it is more useful
to transform the inference to a scale that is easily interpreted by scientists,
policy makers, and the general public. As discussed in Section 6.4.5, in a
logistic regression model with a single predictor, x1, an estimate of the odds
ratio corresponding to a one unit increase in the value of x1 can be obtained
by exponentiating the estimated logistic regression coefficient:
ψˆ = exp(Bˆ1 )
(8.21)
If the model contains only a single predictor, the result is an estimate of
the unadjusted odds ratio. If the fitted logistic regression model includes
multiple predictors, that is,
πˆ (x ) ˆ
ˆ
ˆ
logit(πˆ (x )) = ln
= B0 + B1 x1 + ⋅⋅⋅ + Bp x p
1-πˆ (x )
(8.22)
the result ψˆ j |Bˆ k ≠ j = exp(Bˆ j ) is an adjusted odds ratio. In general, the adjusted
odds ratio represents the multiplicative impact of a one-unit increase in the
predictor variable xj on the odds of the outcome variable being equal to 1,
holding all other predictor variables constant. Confidence limits can also be
computed for adjusted odds ratios:
© 2010 by Taylor and Francis Group, LLC
247
Logistic Regression and Generalized Linear Models
CI ( ψ j ) = exp(Bˆ j ± tdf ,1−α/2 ⋅ se(Bˆ j ))
(8.23)
Software procedures for logistic regression analysis of survey data generally offer the analyst the option to output parameter estimates and standard
errors on the log-odds scale (the original Bˆ j s ) or transformed estimates of
the corresponding adjusted odds ratios and confidence intervals.
The adjusted odds ratios and confidence intervals can be estimated and
reported for any form of a predictor variable, including categorical variables, ordinal variables, and continuous variables. To illustrate, consider a
simple logistic regression model (based on the 2006 Health and Retirement
Study [HRS] data) of the probability that a U.S. adult age 50+ has arthritis.
The predictors in this main effects only model are gender, education level
(with levels less than high school, high school, and more than high school),
and age:
logit(π(x )) = B0 + B1I Male + B2IEduc ,< HS + B3IEduc ,HS + B4 XAge( yrs )
where:
I Male = indicator variable for male gender (feemale is reference);
IEduc ,< HS , IEduc ,HS = indicators for education level (>high school is reference);
XAge( yrs ) = respondent age in years.
The results from fitting this simple model in Stata (code not shown) are summarized in TablesÂ€8.1 and 8.2.
TableÂ€8.2
Estimates of Adjusted Odds Ratios in the
Arthritis Model and 95% CIs for the Odds Ratios
Predictora
Category
GENDER
ED3CAT
Male
<12 yrs
12 yrs
Continuous
AGE
ψˆ
0.552
1.558
1.306
1.050
CI .95 ( ψ )
(0.505, 0.603)
(1.410, 1.766)
(1.202, 1.420)
(1.044, 1.053)
Source: Analysis based on the 2006 HRS data.
Reference categories for categorical predictors are:
GENDER (female); ED3CAT(>12 yrs).
a
© 2010 by Taylor and Francis Group, LLC
248
Applied Survey Data Analysis
Interpreting the output from this simple example, we can make the following statements:
• The estimated ratio of odds of arthritis for men relative to women is
ψˆ = 0.55.
• The estimated odds of arthritis for persons with less than a high
school education are ψˆ = 1.56 times the odds of arthritis for persons
with more than a high school education.
• The estimated odds of arthritis increase by a factor of ψˆ = 1.05 for
each additional year of age.
Note that for continuous predictors, the increment x to x + 1 can be a relatively small step on the full range of x. For this reason, analysts may choose
to report odds ratios for continuous predictors for a greater increment in x.
A common choice is to report the odds ratio for a one standard deviation
increase in x. For example, the standard deviation of 2006 HRS respondents’
age is approximately 10 years. The estimated odds ratio and the 95% confidence interval for the odds ratio associated with a one standard deviation
increase in age are computed as follows:
ˆ
ψˆ 10 yrs = e B4 ⋅10 = e 0.047 × 10 = exp(0.047 × 10) = 1.60
(
CI.95 (ψ 10 yrs ) = exp(Bˆ 4 × 10 − t56 ,0.975 × 10 × se(Bˆ 4 )),exp(Bˆ 4 × 10 + t56 ,0.975 × 10 × se(Bˆ 4 ))
)
= exp(0.47 ± 2.003 × 10 × 0.002) = (1.54 , 1.67 )
When interactions between predictor variables are included in the specified model, analysts need to carefully consider the interpretation of the
parameter estimates. For example, consider an extension of the 2006 HRS
model for the logit of the probability of arthritis that includes the first-order
interaction of education level and gender. The estimated coefficients and
standard errors for this extended model reported by Stata (code not shown)
are shown in TableÂ€8.3.
Note that when the interaction of gender and education is introduced
in the model, the parameter estimates for age, gender, and less than high
school education change slightly but the estimated parameter for high
school education is substantially reduced from Bˆ HS = 0.267 to Bˆ HS = 0.177.
This is due to the fact that this parameter now represents the contrast in
log-odds between high school education and greater than high school
education for females only (the reference level for gender) and is combined
with the parameter for the (12 yrs × Male) product term to define the same
contrast in log-odds for males. The parameter associated with the product
© 2010 by Taylor and Francis Group, LLC
249
Logistic Regression and Generalized Linear Models
TableÂ€8.3
Estimated Logistic Regression Model for Arthritis, Including the
First-Order Interaction of Education and Gender
Predictora
INTERCEPT
GENDER
ED3CAT
AGE
ED3CAT × GENDER
Category
Bˆ
se( Bˆ )
t
P(t56 > t)
Constant
Male
<12 yrs
12 yrs
Continuous
<12 yrs × Male
12 yrs × Male
–2.728
–0.659
0.454
0.177
0.047
0.004
0.201
0.135
0.061
0.063
0.050
0.002
0.102
0.087
–20.22
–10.81
7.20
3.56
22.11
0.04
2.20
< 0.01
< 0.01
< 0.01
< 0.01
< 0.01
0.970
0.026
Source: Analysis based on the 2006 HRS data.
Reference categories for categorical predictors are GENDER (female);
ED3CAT(>12 yrs).
a
term therefore represents a change in this contrast for males relative to
females. Apparently linked to this decrease in “main effect” size for high
school education is a significant positive interaction between a 12th-grade
education and male gender. The estimated change in log-odds for males
with 12th-grade education relative to males with greater than 12th-grade
education is thus computed as 0.177 + 0.201 = 0.378. (Although the results
are not shown, the first-order interaction of GENDER and AGE was tested
in a separate model and was not significant.)
To explore the impact of the interaction of GENDER and ED3CAT on
the estimated logits and odds ratios, assume that AGE is fixed at 65 years.
Consider the patterns of covariates shown in columns 2–4 of Table 8.4.
To estimate the value of the logit for each covariate pattern, the estimated
coefficients in TableÂ€8.3 are applied to the corresponding values of the predictor variables:
logit(π(x )) = − 2.728 − 0.659I Male + 0.454 IEduc ,
+ 0.004(I Male × IEduc ,< HS ) + 0.201( I Male × IEduc ,HS )
.
TableÂ€8.4 shows the estimated logits for the six unique covariate patterns
(with age fixed at 65). To evaluate the ratio of odds of arthritis for men and
women of different education levels, a general technique to compare the
estimated odds for two different patterns of covariates is used. Consider
two “patterns” of covariate values x 1 and x 2. Using the example in TableÂ€8.4,
x 1 might be pattern 1, 65 year old males with be the reference category for the GENDER × ED3CAT interaction, which
© 2010 by Taylor and Francis Group, LLC
250
Applied Survey Data Analysis
TableÂ€8.4
Covariate Patterns, Logits, and Odds Ratios for the 2006 HRS
Arthritis Model
Pattern (j)
Gender
Education
Age
Logit: zj
1
2
3
4
5
6
Male
Male
Male
Female
Female
Female
HS
>HS
HS
>HS
65
65
65
65
65
65
0.190
0.111
–0.267
0.845
0.569
0.392
a
Odds Ratioa
0.82
0.75
0.52
1.57
1.19
1.00
Relative to joint reference of Female with > High School Education.
would be 65 year old women with >HS education. To obtain the estimated
odds ratio that compares x 1 with x 2, the following five steps are required:
1. Based on the estimated model, compute the values of the logit function for the two sets of covariates:
p
logit1 =
∑
Bˆ j x1 j ; logit 2 =
j =0
p
∑ Bˆ x
j 2j
j=0
These are shown for the six example covariate patterns in TableÂ€8.4;
for example, logit1 = 0.190 (Pattern 1) and logit2 = 0.392 (Pattern 6).
2.Compute the difference between the two logits, ∆ˆ 1:2 = logit1 − logit 2
= 0.190 – 0.392 = –0.202.
3. Exponentiate the difference in the two logits to estimate the odds
ˆ
ratio comparing x 1 with x 2, ψˆ 1:2 = e ∆1:2 = e −.202 = 0.817 .
4.To reflect the uncertainty in the estimated odds ratios, the estimates
should be accompanied by an estimated confidence interval. The CI
of an odds ratio comparing two arbitrary covariate patterns, x 1 and
x 2, takes the general form of expression 8.23. The standard error estimate requires the algebraically complicated derivation of the standard error of the difference in the two logits:
se( ∆ˆ 1:2 ) =
p
∑(x
1j
j =0
∑(x
− x2 j )var(Bˆ j ) + 2
1j
− x2 j )( x1k − x2 k )cov(Bˆ j , Bˆ k ) (8.24)
j
Note that any common values for xj in logit1 and logit2 can be
ignored in evaluating this standard error. The calculation of the
© 2010 by Taylor and Francis Group, LLC
Logistic Regression and Generalized Linear Models
251
standard error of the difference in logits requires the values of
var(Bˆ j ) and cov(Bˆ j , Bˆ k ). These may be obtained in Stata by issuing
the estat vce command after the model has been estimated.
5. Exponentiate the CI limits for the difference in logits to estimate the
odds ratio and its 95% CI:
ψˆ 1: 2 = exp( ∆ˆ 1:2 ); CI ( ψˆ 1: 2 ) = exp[∆ˆ 1:2 ± tdf ,.975 ⋅ se( ∆ˆ 1:2 )]
The technique just described for estimating odds ratios applies generally
to any observed patterns of covariates x1 and x 2. The final column in TableÂ€8.4
shows the estimated odds ratios comparing each of the six covariate patterns
based on the estimated logistic regression model including the interaction of
gender and education level. Holding age constant at 65 years, the coding of
gender (female reference), and education level (>High School is the reference)
results in 65 year old women with >HS education as the natural reference
group. A convenient way to analyze and report odds ratios for predictors
that have a significant interaction is to use a graphical display of the type
shown for the arthritis example in FigureÂ€8.3.
The X (horizontal) axis in FigureÂ€8.3 represents the three ordinal education
categories. The Y (vertical) axis is the value of the estimated odds ratio, using
women with >HS education as the comparison group. At each education
level, the estimated odds ratios are plotted separately for men and women.
Note that compared with the women, the odds of arthritis are lower for men
and, consistent with the significant interaction in the estimated model, drop
substantially for men in the >HS education group. Confidence bars may also
be added to this graphical display to enhance the presentation and visual
comparison of odds ratios for important covariate patterns.
We include detailed code for generating FigureÂ€8.3 in Stata on the book’s
Web site.
8.7â•‡ Analysis Application
This section presents an example logistic regression analysis that follows the
four general modeling stages described in Sections 8.3 through 8.6.
Example 8.1:â•‡Examining Predictors of a Lifetime
Major Depressive Episode in the NCS-R Data
The aim of this example is to build a logistic regression model for the probability
that a U.S. adult has been diagnosed with major depressive episode (MDE) in
their lifetime. The dependent variable is the NCS-R variable MDE, which takes a
© 2010 by Taylor and Francis Group, LLC
252
Applied Survey Data Analysis
Odds Ratio (Arthritis)
2
1.5
1
0.5
HS
>HS
Education Level
Male
Female
FigureÂ€8.3
Plot of estimated odds ratios, showing the interaction between gender and education in the
arthritis model. (Modified from the 2006 HRS data.)
value of 1 for persons who meet lifetime criteria for major depression and 0 for all
others. The following predictors are considered: AG4CAT (a categorical variable
measuring age brackets, including 18–29, 30–44, 45–59, and 60+), SEX (1 = Male,
2 = Female), ALD (an indicator of any lifetime alcohol dependence), ED4CAT
(a categorical variable measuring education brackets, including 0–11 years, 12
years, 13–15 years, and 16+ years), and MAR3CAT (a categorical variable measuring marital status, with values 1 = “married,” 2 = “separated/widowed/divorced,”
and 3 = “never married”). The primary research question of analytical interest is
whether MDE is related to alcohol dependence after adjusting for the effects of the
previously listed demographic factors .
8.7.1â•‡ Stage 1: Model Specification
The analysis session begins by specifying the complex design features of
the NCS-R sample in the Stata svyset command. Note that we specify the
“long” or Part 2 NCS-R sampling weight (NCSRWTLG) in the svyset command. This is due to the use of the alcohol dependence variable in the analysis, which was measured in Part 2 of the NCS-R survey.
There are 42 sampling error strata and 84 sampling error computation
units (two per stratum) in the NCS-R sampling error calculation model,
resulting in 42 design-based degrees of freedom.
Following the recommendations of Hosmer and Lemeshow (2000), the
model building begins by examining the bivariate associations of MDE with
each of the potential predictor variables. Since the candidate predictors are
all categorical variables, the bivariate relationship of each predictor with
© 2010 by Taylor and Francis Group, LLC