FigureÂ€8.3 Plot of estimated odds ratios, showing the interaction between gender and education in the arthritis model. (Modified from the 2006 HRS data.)
value of 1 for persons who meet lifetime criteria for major depression and 0 for all others. The following predictors are considered: AG4CAT (a categorical variable measuring age brackets, including 18–29, 30–44, 45–59, and 60+), SEX (1 = Male, 2 = Female), ALD (an indicator of any lifetime alcohol dependence), ED4CAT (a categorical variable measuring education brackets, including 0–11 years, 12 years, 13–15 years, and 16+ years), and MAR3CAT (a categorical variable measuring marital status, with values 1 = “married,” 2 = “separated/widowed/divorced,” and 3 = “never married”). The primary research question of analytical interest is whether MDE is related to alcohol dependence after adjusting for the effects of the previously listed demographic factors .
8.7.1â•‡ Stage 1: Model Specification The analysis session begins by specifying the complex design features of the NCS-R sample in the Stata svyset command. Note that we specify the “long” or Part 2 NCS-R sampling weight (NCSRWTLG) in the svyset command. This is due to the use of the alcohol dependence variable in the analysis, which was measured in Part 2 of the NCS-R survey. There are 42 sampling error strata and 84 sampling error computation units (two per stratum) in the NCS-R sampling error calculation model, resulting in 42 design-based degrees of freedom. Following the recommendations of Hosmer and Lemeshow (2000), the model building begins by examining the bivariate associations of MDE with each of the potential predictor variables. Since the candidate predictors are all categorical variables, the bivariate relationship of each predictor with
TableÂ€8.5 presents the results of these bivariate analyses, including the Rao– Scott F-tests of association. The table also presents estimates of the percentages of each predictor category that received the lifetime MDE diagnosis. Based on these initial tests of association, all of the predictor variables appear to have significant bivariate associations with MDE, including ALD, the indicator of lifetime alcohol dependence, and all of the predictors appear to be good candidates for inclusion in the initial multivariate logistic regression model. 8.7.2â•‡ Stage 2: Model Estimation The next step in the Hosmer–Lemeshow (2000) model building procedure is to fit the “initial” multivariate model, examining the main effects for all five TableÂ€8.5 Initial Bivariate Design-Based Tests of Association Assessing Potential Predictors of Lifetime Major Depressive Episode (MDE) for the NCS-R Adult Sample Predictor
Rao–Scott F-testa
Category
% with MDE (SE)
AG4CAT
F(2.76,115.97) = 26.39 p < 0.01
18–29 30–44 45–59 60+ Male Female Yes No <12 yrs 12 yrs 13–15 yrs 16+ yrs Married Previously Never
See Chapter 6 for more details on the derivation of this test statistic, which can be used to test the null hypothesis of no association between the predictor variable and the outcome variable (MDE).
predictors. Stata provides two programs for fitting the multivariate logistic regression model: svy: logistic and svy: logit. The default output for the svy: logistic command is estimates of adjusted odds ratios and 95% confidence intervals for the adjusted odds ratios. Note the use of the char command and specification of the omitted reference category for SEX (2, or females) prior to the svy: logistic command. This command allows the user to specify custom reference categories rather than accepting the default of having the lowest category omitted: char sex[omit] 2 xi: svy: logistic mde i.ag4cat i.sex ald i.ed4cat i.mar3cat
In svy: logistic, estimates of the parameters and standard errors for the logit model can be requested by using the coef option. xi: svy: logistic mde i.ag4cat i.sex ald i.ed4cat /// i.mar3cat, coef
Stata users who wish to see the estimated logistic regression coefficients and standard errors may also use the companion program, svy: logit. xi: svy: logit mde i.ag4cat i.sex ald i.ed4cat i.mar3cat
Estimated odds ratios and 95% CIs can be generated in svy: logit by adding the or option: xi: svy: logit mde i.ag4cat i.sex ald i.ed4cat /// i.mar3cat, or
Other software systems (e.g., SAS PROC SURVEYLOGISTIC) will output both the estimated logistic regression coefficients (and standard errors) and the corresponding odds ratio estimates. (Readers should be aware that the svy: logit and svy: logistic commands differ slightly in the procedures for calculation of the standard errors for odds ratios.) In each form of the Stata svy: logistic or svy: logit command, the dependent variable, MDE, is listed first, followed by the predictor variables (and their interactions, if applicable). Note that the xi: modifier is used in conjunction with the svy: logistic and svy: logit commands to indicate those predictor variables that are categorical in nature: AG4CAT, ED4CAT, SEX, and MAR3CAT. (Because the ALD variable is already coded as either 0 or 1, it does not require program-generated indicator variables.) The xi: specification of the model instructs Stata to create a set of indicator variables for each independent variable preceded by the i. prefix. Stata will create K – 1 indicator variables to represent the K levels of the categorical predictor. By default, Stata will select the lowest valued category as the reference category and only the K – 1 indicators for the remaining categories will be included as predictors in the model.
TableÂ€8.6 Estimated Logistic Regression Model for the Lifetime MDE Outcome (Output Generated by Using the svy: logit Command) Predictora INTERCEPT AG4CAT
SEX ALD ED4CAT
MAR3CAT
Category
Bˆ
30–44 45–59 60+ Male Yes 12 13–15 16+ Previously Never
Source: Analysis based on the NCS-R data. Notes: n = 5,692, adjusted Wald test for all parameters: F(10,33) = 28.07, p < 0.001. a Reference categories for categorical predictors are: AG4CAT (18–29); SEX (Female); ALD (No); ED4CAT (<12 yrs); MAR3CAT (Married).
TablesÂ€ 8.6 and 8.7 summarize the output generated by fitting the MDE logistic regression model. The initial model includes main effects for the chosen predictor variable candidates but at this point does not include any interactions between the predictors. 8.7.3â•‡ Stage 3: Model Evaluation The adjusted Wald tests in Stata for the AG4CAT, ED4CAT, and MAR3CAT categorical predictors in this initial model are generated by using the test command: test _Iag4cat_2 _Iag4cat_3 _Iag4cat_4 test _Imar3cat_2 _Imar3cat_3 test _Ied4cat_2 _Ied4cat_3 _Ied4cat_4
Note that we do not request these multiparameter Wald tests for the SEX and ALD predictor variables, because they are represented by single indicator variables in the regression model and the overall Wald test for each predictor is equivalent to the t-test reported for the single estimated parameter for that predictor. Further, note in each of the test statements that we include the K – 1 indicator variables generated by Stata for each of the categorical predictors (e.g., _ Iag4cat _ 2) when the xi: modifier is used to identify categorical predictor variables. Stata users can find the names of
Source: Analysis based on the NCS-R data. Notes: n = 5,692. Adjusted Wald test for all parameters: F(10,33) = 28.07. p < 0.001. a Reference categories for categorical predictors are: AG4CAT (18–29); SEX (Female); ALD (No); ED4CAT (<12 yrs); MAR3CAT (Married).
these indicators in the Stata Variables window once the model has been fitted. TableÂ€8.8 provides the design-adjusted F-versions of the resulting Wald test statistics and associated p-values. Two of the three design-adjusted Wald tests are significant at the 0.01 level. The exception is the Wald test for education [F(3,40) = 2.13, p = 0.112], which suggests that the parameters associated with education in this logistic regression model are not significantly different from zero and that education may not be an important predictor of lifetime MDE when adjusting for the relationships of the other predictor variables with the outcome. If the objective of the model-building process is the construction of a parsimonious model, education could probably be dropped as a predictor at this point. For the purposes of this illustration (and because of the marginal significance), we will retain education in the model moving forward. 8.7.4â•‡ Stage 4: Model Interpretation/Inference Based on the results in TableÂ€8.6 and TableÂ€8.8, it appears that each of the predictors in the multivariate model has a significant (or marginally significant) relationship with the probability of MDE after adjusting for the relationships of the other predictors. Focusing on the primary predictor variable of interest, we see that the odds of having had a major depressive episode at some point in the lifetime are multiplied by 4.15 when a person has had a diagnosis of alcohol dependence at some point in his or her lifetime, when adjusting
TableÂ€8.8 Design-Adjusted Wald Tests for the Parameters Associated with the Categorical Predictors in the Initial MDE Logistic Regression Model Categorical Predictor AG4CAT ED4CAT MAR3CAT
F-Test Statistic
P-value
F(3,40) = 19.03 F(3,40) = 2.13 F(2,40) = 16.60
< 0.001 0.112 < 0.001
Source: Analysis based on the NCS-R data.
for the relationships of age, sex, education, and marital status. Of course, this model does not allow for any kind of causal inference, given that time ordering of the events is not available in the NCS-R data set; we can, however, conclude that there is strong evidence of an association between the two disorders in this finite population when adjusting for other demographic covariates. We also note that relative to married respondents, respondents who were previously married have significantly higher (63% higher) odds of having had a major depressive episode in their lifetime when adjusting for the other covariates. Further, middle-age respondents have significantly higher odds of lifetime MDE (relative to younger respondents), while older respondents and males have significantly reduced odds of lifetime MDE (again relative to younger respondents and females). Respondent age is represented in the model as four grouped categories of age. Including grouped categories for age (or recoded categories of any continuous predictor, more generally) in a logistic regression model will result in estimates of the expected contrasts in log-odds for respondents in each of the defined categories, relative to the reference category. Since the model parameters are estimated separately for each defined age group (with age 18–29 as the reference), the model will capture any nonlinearity of effect in the ordered age groupings. Inspecting the estimated coefficients and odds ratios for the grouped age categories in TableÂ€ 8.6 and Table 8.7, it appears that there is significant nonlinearity in the effect of age on the probability of MDE. Relative to the 18–29-year-old group, the odds of MDE increase by factors of 1.29 (aged 30–44) and 1.23 (aged 45–59) for the middle-age ranges but decrease by a factor of 0.51 in the age 60 and older group. Such nonlinear effects of age are common in models of human disorders and are possibly attributable to normal processes of aging and selective mortality. If the example model was estimated with age (in years) as a continuous predictor variable, at this stage in the model-building process the analyst would reestimate the model including both the linear and quadratic terms for age. Therefore, at this stage in the model-building process, we have chosen to retain all of the candidate main effects. Next, we apply Archer and Lemeshow’s (2006) design-adjusted test to assess the goodness of fit of this initial model (assuming that this procedure has been downloaded and installed):
The resulting design-adjusted F-statistic reported in the Stata Results window is equal to FA-L = 1.229, with a p-value of 0.310. This suggests that the null hypothesis that the model fits the data well is not rejected. We therefore have confidence moving forward that the fit of this initial model is acceptable. Next, we consider testing some scientifically relevant two-way interactions between the candidate predictor variables. For illustration purposes, we suppose that possible two-way interactions of sex with the other four covariates measuring age, lifetime alcohol dependence, education, and marital status are of interest, if sex is posited by an NCS-R analyst as being a possible moderator of the relationships of these other four covariates with lifetime MDE. We fit a model including these two-way interactions in Stata using the following command: xi: svy: logistic mde i.ag4cat*i.sex i.sex*ald /// i.ed4cat*i.sex i.mar3cat*i.sex, coef
Note how the interactions are specified in this command. When the xi: modifier is used for a regression command, the products of the two factors listed after the dependent variable specify that the regression parameters associated with each individual factor should be included in the regression model in addition to the parameters associated with the relevant cross-product terms defined by the interaction (e.g., the indicator for AG4CAT = 2 × the indicator for SEX = 1). In other words, listing AG4CAT and SEX in addition to the previous product terms would be redundant, and the main effects are included in the model by default when the interaction terms are specified. TableÂ€8.9 presents the estimates of the regression parameters in this model generated by executing the previous command in Stata. At this point, the statistical question is whether these two-way interactions are making a significant additional contribution or improvement to the fit of this model to the NCS-R data. That is, are any of the parameters associated with the two-way interaction terms significantly different from 0? We can test this hypothesis by once again using design-adjusted Wald tests. The relevant interaction terms for the regression model are automatically generated by Stata and included in the data set when using the xi: modifier, so the cross-product terms in the test commands that follow can be easily selected from the Stata Variables window: test test test test
TableÂ€8.9 Estimated Logistic Regression Model for Lifetime MDE, Including First Order Interactions of the Other Predictor Variables with SEX Predictora INTERCEPT AG4CAT
SEX ALD ED4CAT
MAR3CAT AG4CAT × SEX
ALD × SEX ED4CAT × SEX
MAR3CAT × SEX
Category Constant 30–44 45–59 60+ Male Yes 12 13–15 16+ Previously Never 30–44 × Male 45–59 × Male 60+ × Male Yes × Male 12 × Male 13–15 × Male 16+ × Male Previously × Male Never × Male
Source: Analysis based on the NCS-R data. Notes: n = 5,692. Adjusted Wald test for all parameters: F(19,24) = 17.15. p < 0.001. a Reference categories for categorical predictors are: AG4CAT (18–29 yrs); GENDER (female); ALD (no); ED4CAT(<12 yrs); MAR3CAT (married); SEX(female).
Based on test results presented in Table 8.10, we fail to reject the null hypotheses for all four of the tests, suggesting that these two-way interactions are actually not making a significant contribution to the fit of the model. We therefore do not consider these two-way interactions any further and would proceed with making inferences based on the estimates from the model presented in TableÂ€8.6.
8.8â•‡Comparing the Logistic, Probit, and Complementary Log–Log GLMs for Binary Dependent Variables This chapter has focused on logistic regression techniques for modeling π(x) for a binary dependent variable. As discussed in Section 8.2, alternative
TableÂ€8.10 Design-Adjusted Wald Tests of First-Order Interactions of Sex and Other Categorical Predictors in the MDE Logistic Regression Model Interaction Term
generalized linear models for a binary dependent variable may be estimated using the probit or CLL link function. In discussing these alternative GLMs, we noted that inferences derived from logistic, probit and CLL regression models should generally be consistent. To illustrate, consider the results in TableÂ€8.11 for a side-by-side comparison of estimated logistic, probit and CLL regression models. The example used for this comparison is a model of the probability that a U.S. adult is alcohol dependent. The data are from the NCS-R long interview (or Part 2 of the survey), and each model includes the same demographic main effects considered in Section 8.7 for the model of MDE: SEX, AG4CAT, ED4CAT, and MAR3CAT. The Stata commands for the estimation of the three models follow (note the use of the char sex[omit]2 syntax to specify the desired omitted category for sex): char sex[omit] 2 xi: svy: logit ald i.ag4cat i.sex i.ed4cat i.mar3cat test _Iag4cat_2 _Iag4cat_3 _Iag4cat_4 test _Ied4cat_2 _Ied4cat_3 _Ied4cat_4 test _Imar3cat_2 _Imar3cat_3 xi: svy: probit ald i.ag4cat i.sex i.ed4cat i.mar3cat test _Iag4cat_2 _Iag4cat_3 _Iag4cat_4 test _Ied4cat_2 _Ied4cat_3 _Ied4cat_4 test _Imar3cat_2 _Imar3cat_3 xi: svy: cloglog ald i.ag4cat i.sex i.ed4cat i.mar3cat test _Iag4cat_2 _Iag4cat_3 _Iag4cat_4 test _Ied4cat_2 _Ied4cat_3 _Ied4cat_4 test _Imar3cat_2 _Imar3cat_3
TableÂ€ 8.11 presents a summary of the estimated coefficients, standard errors, and p-values for simple hypothesis tests of the form H0: Bj = 0. TableÂ€ 8.12 presents the results for the Wald tests of the overall age, education, and marital status effects. Note that although the coefficients and standard errors for the probit model show the expected difference in scale