8 Comparing the Logistic, Probit, and Complementary Log– Log GLMs for Binary Dependent Variables
Tải bản đầy đủ
260
Applied Survey Data Analysis
TableÂ€8.10
Design-Adjusted Wald Tests of
First-Order Interactions of Sex and Other
Categorical Predictors in the MDE
Logistic Regression Model
Interaction Term
F-Test Statistic
P(F > F)
AG4CAT × SEX
ALD × SEX
ED4CAT × SEX
MAR3CAT × SEX
F(3,40) = 0.25
F(1,42) = 0.68
F(3,40) = 0.13
F(2,41) = 0.77
0.863
0.413
0.944
0.472
Source: Analysis based on the NCS-R data.
generalized linear models for a binary dependent variable may be estimated
using the probit or CLL link function. In discussing these alternative GLMs,
we noted that inferences derived from logistic, probit and CLL regression
models should generally be consistent.
To illustrate, consider the results in TableÂ€8.11 for a side-by-side comparison of estimated logistic, probit and CLL regression models. The example
used for this comparison is a model of the probability that a U.S. adult is
alcohol dependent. The data are from the NCS-R long interview (or Part 2
of the survey), and each model includes the same demographic main effects
considered in Section 8.7 for the model of MDE: SEX, AG4CAT, ED4CAT,
and MAR3CAT. The Stata commands for the estimation of the three models
follow (note the use of the char sex[omit]2 syntax to specify the desired
omitted category for sex):
char sex[omit] 2
xi: svy: logit ald i.ag4cat i.sex i.ed4cat i.mar3cat
test _Iag4cat_2 _Iag4cat_3 _Iag4cat_4
test _Ied4cat_2 _Ied4cat_3 _Ied4cat_4
test _Imar3cat_2 _Imar3cat_3
xi: svy: probit ald i.ag4cat i.sex i.ed4cat i.mar3cat
test _Iag4cat_2 _Iag4cat_3 _Iag4cat_4
test _Ied4cat_2 _Ied4cat_3 _Ied4cat_4
test _Imar3cat_2 _Imar3cat_3
xi: svy: cloglog ald i.ag4cat i.sex i.ed4cat i.mar3cat
test _Iag4cat_2 _Iag4cat_3 _Iag4cat_4
test _Ied4cat_2 _Ied4cat_3 _Ied4cat_4
test _Imar3cat_2 _Imar3cat_3
TableÂ€ 8.11 presents a summary of the estimated coefficients, standard
errors, and p-values for simple hypothesis tests of the form H0: Bj = 0.
TableÂ€ 8.12 presents the results for the Wald tests of the overall age, education, and marital status effects. Note that although the coefficients and
standard errors for the probit model show the expected difference in scale
© 2010 by Taylor and Francis Group, LLC
Comparison of Logistic, Probit, and CLL Models of Alcohol Dependency in U.S. Adults
Logistic
Predictora
Intercept
SEX
AG4CAT
ED4CAT
MAR3CAT
Probit
C-L-L
Category
Bˆ
se( Bˆ )
p
Bˆ
se( Bˆ )
p
Bˆ
se( Bˆ )
p
Male
30–44
45–59
60+
12 yrs
13–15 yrs
16+ yrs
Previously
Never
–3.124
0.997
0.146
–0.051
–1.120
–0.268
–0.264
–0.736
0.517
0.065
0.225
0.119
0.178
0.144
0.212
0.194
0.176
0.197
0.142
0.169
<0.001
<0.001
0.416
0.726
<0.001
0.173
0.141
<0.001
<0.001
0.070
–1.719
0.471
0.065
–0.034
–0.531
–0.124
–0.124
–0.339
0.255
0.039
0.105
0.056
0.084
0.067
0.093
0.095
0.085
0.092
0.069
0.077
<0.001
<0.001
0.444
0.609
<0.001
0.200
0.152
<0.001
<0.001
0.616
–3.140
0.965
0.143
–0.045
–1.083
–0.260
–0.256
–0.713
0.494
0.060
0.218
0.115
0.171
0.140
0.209
0.185
0.169
0.190
0.136
0.164
<0.001
<0.001
0.408
0.748
<0.001
0.167
0.137
<0.001
<0.001
0.713
Logistic Regression and Generalized Linear Models
TableÂ€8.11
Source: Analysis based on the NCS-R data.
Notes: n = 5,692.
a Reference categories for categorical predictors are AG4CAT (18–29 yrs); SEX (female); ED4CAT(<12 yrs);
MAR3CAT (Married).
261
© 2010 by Taylor and Francis Group, LLC
262
Applied Survey Data Analysis
TableÂ€8.12
Design-Adjusted Wald Tests of Categorical Predictors in the MDE Models
Wald F-Test Statistic (p-Value = P(F > F))
Categorical Predictor
AG4CAT
ED4CAT
MAR3CAT
Logistic
Probit
CLL
F(3,40) =12.06 (<0.001)
F(3,40) = 4.80 (0.006)
F(2,41) = 6.54 (0.003)
F(3,40) = 15.26 (<0.001)
F(3,40) = 4.79 (0.006)
F(2,41) = 6.66 (0.003)
F(3,40) = 11.52 (<0.001)
F(3,40) = 4.77 (0.006)
F(2,41) = 6.50 (0.004)
Source: Analysis based on the NCS-R data.
from those of the logistic and CLL models, the three models produce similar p-values for the test of each parameter and would not lead to significant
differences in inferences concerning the effects of the individual parameters. Given the similarity in these results, we recommend using the logit
model for general applications. Faraway (2006, p. 38) points out three advantages of this approach: simpler mathematical formulation of the models,
ease of interpretation via odds ratios, and easier analysis of retrospectively
sampled data.
8.9â•‡ Exercises
1.Using the software procedure of your choice, fit the following simple
(binary) logistic regression model to the 2006 HRS data set. Model
the binary dependent variable, DIABETES (1 = yes, 0 = no) as a function of the following independent variables: AGE (KAGE), GENDER
(1 = Male, 2 = Female), RACE (0 = None Given, 1 = White, 2 = Black,
7 = Other), and ARTHRITIS (1 = yes, 0 = no). Be sure to model the
logit(P(Diabetes = 1)); that is, model the probability that a person has
diabetes. The predictors GENDER, RACE, and ARTHRITIS should
be treated as categorical, and the reference category parameterization should be used. For gender, choose females (2) as the reference
category. For race, choose white (1) as the reference category. For
arthritis, choose no (0) as the reference category. All analyses should
use the 2006 final individual sampling weight, KWGTR, as the analysis weight, and all analyses should incorporate design adjustments
for the stratification and clustering of the 2006 HRS sample (the
STRATUM and SECU variables). Prepare a table showing the parameter estimates in this model, their design-based standard errors, and
95% confidence intervals for the parameters.
2.Based on the fitted model from Exercise 1, what is the estimated
odds ratio comparing men’s odds of diabetes with that for women
© 2010 by Taylor and Francis Group, LLC
Logistic Regression and Generalized Linear Models
263
(holding all other factors constant)? What is a 95% confidence interval for this odds ratio? What would you conclude about the relationship of gender with diabetes based on these results?
3.Based on the fitted model from Exercise 1, if all other variables are
held constant, what is the estimated odds ratio for diabetes associated with a 30-year increase in age? Compute the design-adjusted
95% confidence interval for this odds ratio.
4.Perform a joint design-adjusted Wald test of the null hypothesis that
Bmale, Bblack, and Barthritis are all not significantly different from zero.
Report the test statistic, the degrees of freedom, and a p-value for
this test. How would you explain the result of this test to a colleague
in plain English?
5. Perform a design-adjusted Wald test of the null hypothesis that BBLACK
= BOTHER. Report the test statistic, the degrees of freedom and a p-value
for this test. What does the result of the test mean in plain English?
6.Construct a new variable for the interaction of AGE and GENDER.
Refit the original logistic regression model with this age × gender
interaction term added. Test and report whether the interaction of
age and gender significantly improves the fit of the model. Hint: HRS
codes GENDER as 1 and 2. You have been asked to use female (2)
as the reference category. To create the interaction variable, consider
recoding sex to 1 = male, 0 = female. What is your interpretation of
the interaction effect?
7.(Stata Only) Apply the Archer and Lemeshow (2006, 2007) procedure for testing the goodness of fit of a model of your choosing, and
be clear about the model being tested. What is your conclusion about
the fit of the model based on this test? Is this fit adequate or not?
8.Prepare a short discussion (two to four paragraphs) describing the
results of your analysis of the specified set of potential risk factors
for diabetes. To illustrate how one might use your estimated model
in practice, include the detailed computation of the predicted probability of having diabetes for someone with a specified set of values
on the covariates included in the model.
© 2010 by Taylor and Francis Group, LLC
9
Generalized Linear Models for Multinomial,
Ordinal, and Count Variables
9.1â•‡ Introduction
Chapter 8 covered generalized linear models (GLMs) for survey variables that
are measured on a binary or dichotomous scale. The aim of this chapter is
to introduce generalized linear modeling techniques for three other types of
dependent variables that are common in survey data sets: nominal categorical variables, ordinal categorical variables, and counts of events or outcomes.
Chapter 8 laid the foundation for generalized linear modeling, and this chapter will emphasize specific methods and software applications for three principal methods. Section 9.2 will introduce the “baseline” multinomial logit
regression model for a survey variable with three or more nominal response
categories. The cumulative logit model for dependent variables that are measured on an ordinal scale will be covered in Section 9.3. Regression methods
for dependent variables that are counts (e.g., number of events, attributes),
including Poisson regression models and negative binomial regression
models, are presented in Section 9.4. Stata software will be used to illustrate
the applications of these methods, but the reader is encouraged to visit the
companion Web site for this text to find each example replicated in the other
major software systems that support these advanced modeling procedures.
9.2â•‡A nalyzing Survey Data Using Multinomial
Logit Regression Models
9.2.1â•‡ The Multinomial Logit Regression Model
The multinomial logit regression model is the natural extension of the simple binary logistic regression model to survey responses that have three
or more distinct categories. This technique is most appropriate for survey
variables with nominal response categories; we present examples of these
265
© 2010 by Taylor and Francis Group, LLC
266
Applied Survey Data Analysis
(a) NHANES HUQ.040
What kind of place do you go to most often: is it a clinic, doctor’s office, emergency room,
or some other place?
1. CLINIC OR HEALTH CENTER......................
2. DOCTOR’S OFFICE OR HMO......................
3. HOSPITAL EMERGENCY ROOM.................
4. HOSPITAL OUTPATIENT DEPARTMENT...
5. SOME OTHER PLACE..................................
6. REFUSED.....................................................
7. DON’T KNOW...............................................
(b) NCS-R EM7.1
What about your current employment situation as of today -- are you?
1. EMPLOYED.............................................................................
2. SELF-EMPLOYED..................................................................
3. LOOKING FOR WORK; UNEMPLOYED............................
4. TEMPORARILY LAID OFF..................................................
5. RETIRED.................................................................................
6. HOMEMAKER.......................................................................
7. STUDENT...............................................................................
8. MATERNITY LEAVE...........................................................
9. ILLNESS/SICK LEAVE........................................................
10. DISABLED..........................................................................
11. OTHER (SPECIFY).............................................................
FigureÂ€9.1
Survey questions with multinomial response categories.
variables from the 2005–2006 National Health and Nutrition Examination
Survey (NHANES) and the National Comorbidity Survey Replication (NCSR) in FigureÂ€9.1. It is common practice in surveys to use a fairly detailed set
of response categories to code the respondent’s answer and then recode the
multiple categories to a smaller but still scientifically useful set of nominal
groupings. For example, the NCS-R public-use data set contains a recoded
labor force status variable, WKSTAT3, that combines the 11 questionnaire
responses for current work force status into three grouped categories: (1)
employed (EMP); (2) unemployed (UN); and (3) not in the labor force (NLF).
The multinomial logit regression model is ideally suited for multivariate
analysis of dependent variables like WKSTAT3.
Multinomial logit regression may also be applied to survey variables measured on Likert-type scales (e.g., 1 = strongly agree to 5 = strongly disagree)
or other ordered categorical response scales (e.g., self-rated health status: 1 =
excellent to 5 = poor), but the cumulative logit regression model covered in
Section 9.3 may be the more efficient technique for modeling such ordinal
dependent variables.
To understand the multinomial logit regression model for a dependent
variable y with K nominal categories, assume that category y = 1 is chosen as
the baseline category. Multinomial logit regression is a method of simultaneously estimating a set of K – 1 simple logistic regression models that model
© 2010 by Taylor and Francis Group, LLC
Generalized Linear Models for Multinomial, Ordinal, and Count Variables 267
the odds of being in category y = 2, …, K versus the baseline category y = 1.
Consider the example of the NCS-R recoded variable for labor force status,
WKSTAT3, with three nominal categories: 1 = EMP; 2 = UN; 3 = NLF. To fit
the multinomial logit regression model to this “trinomial” dependent variable, two generalized logits are needed:
π( y = 2| x )
= B2:0 + B2:1 x1 + ⋅⋅⋅ + B2: p x p
logit(π(“UN”| x )) = logit(π 2 ) = ln
| x )
π( y = 1|
π( y = 3| x )
logit(π(“NLF”| x )) = logit(π 3 ) = ln
= B3:0 + B3:1 x1 + ⋅⋅⋅ + B3:p xp
π( y = 1| x )
(9.1)
A natural question to ask at this point is, “Is it possible to simply estimate
the multinomial logit regression model as a series of binary logistic regression models that consider only the response data for two categories at a time?”
Strictly speaking, the answer is no. The parameter estimates for what Agresti
(2002) labels the “separate-fitting” approach will be similar but not identical to those for simultaneous estimation of the multinomial logits. Standard
errors for the former will be greater than those for the simultaneous estimation, and only the latter yields the full variance–covariance matrix needed
to test hypotheses concerning the significance or equivalence of parameters
across the estimated logits. Fortunately, almost all software systems that
support analysis of complex sample survey data now include the capability
for the simultaneous estimation of the multinomial logit regression model.
9.2.2â•‡ Multinomial Logit Regression Model: Specification Stage
The specification stage of building a multinomial logit model parallels that
described in detail in Section 8.3 for specifying a logistic regression model
for a binary dependent variable. However, two aspects of the model specification require special emphasis:
1.Choice of the baseline category. In the example model formulation for
the two distinct logits in Equation 9.1, category y = 1 is the selected
baseline category. The survey analyst is free to choose which of the
K categories he or she prefers to use as the baseline. This choice will
not affect the overall fit of the multinomial logit model or overall
tests of significance for the parameters associated with predictors
included in the model. However, interpretation of the parameter
estimates will depend on the selected baseline category, given how
the generalized logits are defined. Stata will default to use the lowest
numbered category as the baseline category for estimating the logits and corresponding odds ratios. To choose a different category as
the baseline for the multinomial logits, the Stata analyst can use the
© 2010 by Taylor and Francis Group, LLC
268
Applied Survey Data Analysis
baseoutcome(#) option, where # represents the value of the desired
baseline category. In general, when a choice of a baseline category is
not clear based on research objectives, we recommend using the most
common category (or mode) of the nominal dependent variable.
2.Parsimony. Because each of the K – 1 logits that form the multinomial logit model will include the identical design vector of
covariates, x = {1,x1 ,..., x p } , and each estimated logit will have
Β k = {Bk:0 , Bk:1 ,..., Bk: p } parameters, the total number of parameter estimates will be (K – 1) × (p + 1). Consequently, to ensure efficiency in
estimation and accuracy of interpretation, the final specification of
the model should attempt to minimize the number of predictors that
are either not significant or are highly collinear with other significant covariates. Analysts can use design-adjusted multiparameter
Wald tests to determine the overall importance of predictors across
the K – 1 logit functions, and we will consider an example of this in
Section 9.2.6.
9.2.3â•‡ Multinomial Logit Regression Model: Estimation Stage
The principal difference in estimation for the multinomial logit model versus the simple binary logit model of Chapter 8 is that the pseudo-likelihood
function for the data is based on the multinomial distribution (as opposed
to the binomial) and the number of parameters and standard errors to be
estimated increases from p + 1 for the logistic model to (K – 1) × (p + 1) for the
multinomial logit regression model. When survey data are collected from a
sample with a complex design, the default in most current software systems is
to employ a multinomial version of Binder’s (1983) Taylor series linearization
(TSL) estimator to derive the estimated variance–covariance matrix of the
model parameter estimates. Most software systems also provide a balanced
repeated replication (BRR) or jackknife repeated replication (JRR) option to
ˆ ( Bˆ )rep . Theory BoxÂ€9.1 provides
compute replication variance estimates, Var
a more mathematically oriented summary of the estimation of the multinomial logit regression parameters and their variance–covariance matrix when
working with complex sample survey data.
In Stata, the svy: mlogit command is used to estimate the multinomial logit regression coefficients and their standard errors. In SAS, analysts
employ the standard PROC SURVEYLOGISTIC procedure with the GLOGIT
option to perform a multinomial logit regression analysis. Other software
options for estimation of multinomial logit regression models are detailed
on the book’s Web site.
9.2.4â•‡ Multinomial Logit Regression Model: Evaluation Stage
Like simple logistic regression and all other forms of generalized linear models, the evaluation stage in building the multinomial logit regression model
© 2010 by Taylor and Francis Group, LLC
Generalized Linear Models for Multinomial, Ordinal, and Count Variables 269
Theory BoxÂ€9.1â•… Estimation for the
Multinomial Logit Regression Model
Estimation of the model parameters involves maximizing the following multinomial version of the pseudo-likelihood function:
PLMult (Βˆ | X ) =
n
K
∏ ∏ πˆ ( x )
i =1
k
k =1
i
y(i k )
wi
(9.2)
where
y(i k ) = 1 if y = k for sampled unit i, 0 otherwise;
πˆ k ( x i ) is the estimated probability that yi = k|x i; and
wi is the survey weight for sampled unit i.
The maximization involves application of the Newton-Raphson algorithm to solve the following set of (K – 1) × (p + 1) estimating equations,
assuming a complex sample design with strata indexed by h and clusters within strata indexed by α:
S( B )Mult =
∑∑∑ w
h
α
hαi
( y(hkα)i − π k ( B )) xh′αi = 0
(9.3)
i
where
y(hkα)i = 1 if y = k for sampled unit i, 0 otherwise
xh′αi = a column vector of the p + 1 design matrix elements for case i
= [1 x1 ,hαi ⋅⋅⋅ x p ,hαi ]′ ;
B = { B2 ,0 ,..., B2 ,p ,...., BK,0 ,..., BK,p } is a (K – 1) × (p + 1) vector of
parameters;
exp( x′hαiΒ k )
πk ( B) =
K
1+
∑ exp( x′
hαi
Bk )
k =1
with B1 = 0 for k = 1 (the baseline).
The variance–covariance matrix of the estimated parameters takes
the now familiar sandwich form, based on Binder’s (1983) application
of Taylor series linearization to estimates derived using pseudo-maximum likelihood estimation:
© 2010 by Taylor and Francis Group, LLC
ˆ ( Bˆ ) = ( J −1 )var[S( Bˆ )]( J −1 )
Var
(9.4)
270
Applied Survey Data Analysis
The matrices J and var[S( Bˆ )] are derived as illustrated in Theory
BoxÂ€8.3 for simple logistic regression, with the important change that
both are now (K – 1) × (p + 1) symmetric matrices, reflecting the full
dimension of the parameter vector for the multinomial logit regression
model.
begins with Wald tests of hypotheses concerning the model parameters.
With (K – 1) × (p + 1) parameter estimates, the number of possible hypothesis tests is almost limitless. However, a series of hypothesis tests should be
standard practice for evaluating these complex models. Standard t-tests for
single parameters and Wald tests for multiple parameters should be used to
evaluate the significance of the covariate effects in individual logits, that is,
H0: Bk:j = 0, or across all estimated logits, that is, H0: B2:j = …= BK:j = 0. Example
questions that could drive hypothesis tests include the following: Is gender
a significant predictor of the odds that a U.S. adult is unemployed versus
employed? Is gender a significant predictor in determining the labor force
status of U.S. adults regardless of category? Other multiparameter Wald tests
can be readily constructed to test custom hypotheses that are relevant for
interpretation of a given model. If gender significantly alters the odds that
an adult is unemployed or not in the labor force relative to employed, is the
gender effect equivalent for unemployment and NLF status? Examples of
these general forms of hypothesis tests will be provided in the analytical
example in Section 9.2.6.
We note that at the time of this writing, methods for evaluating the goodness of fit of multinomial logit models for complex sample survey data have
yet to be developed. Any developments in this area will be reported on the
companion Web site for this book.
9.2.5â•‡ Multinomial Logit Regression Model: Interpretation Stage
The interpretation of the parameter estimates in a multinomial logit
regression model is a natural extension of the interpretation of effects in
the simple logistic regression model. Simply exponentiating a parameter
estimate results in an adjusted odds ratio, corresponding to the multiplicative impact of a one-unit increase in the predictor variable, xjâ•›, on the odds
that the response is equal to k relative to the odds of a response in the
baseline category:
ψˆ k: j = exp(Bˆ k: j )
CI ( ψˆ k: j ) = exp[Bˆ k: j ± tdf ,1−α/2 ⋅ se(Bˆ k: j )]
© 2010 by Taylor and Francis Group, LLC
(9.5)