Tải bản đầy đủ

Example 8.1: Examining Predictors of a Lifetime Major Depressive Episode in the NCS- R Data

252

Applied Survey Data Analysis

Odds Ratio (Arthritis)

2

1.5

1

0.5

HS

>HS

Education Level

Male

Female

FigureÂ€8.3

Plot of estimated odds ratios, showing the interaction between gender and education in the

arthritis model. (Modified from the 2006 HRS data.)

value of 1 for persons who meet lifetime criteria for major depression and 0 for all

others. The following predictors are considered: AG4CAT (a categorical variable

measuring age brackets, including 18–29, 30–44, 45–59, and 60+), SEX (1 = Male,

2 = Female), ALD (an indicator of any lifetime alcohol dependence), ED4CAT

(a categorical variable measuring education brackets, including 0–11 years, 12

years, 13–15 years, and 16+ years), and MAR3CAT (a categorical variable measuring marital status, with values 1 = “married,” 2 = “separated/widowed/divorced,”

and 3 = “never married”). The primary research question of analytical interest is

whether MDE is related to alcohol dependence after adjusting for the effects of the

previously listed demographic factors .

8.7.1â•‡ Stage 1: Model Specification

The analysis session begins by specifying the complex design features of

the NCS-R sample in the Stata svyset command. Note that we specify the

“long” or Part 2 NCS-R sampling weight (NCSRWTLG) in the svyset command. This is due to the use of the alcohol dependence variable in the analysis, which was measured in Part 2 of the NCS-R survey.

There are 42 sampling error strata and 84 sampling error computation

units (two per stratum) in the NCS-R sampling error calculation model,

resulting in 42 design-based degrees of freedom.

Following the recommendations of Hosmer and Lemeshow (2000), the

model building begins by examining the bivariate associations of MDE with

each of the potential predictor variables. Since the candidate predictors are

all categorical variables, the bivariate relationship of each predictor with

© 2010 by Taylor and Francis Group, LLC

253

Logistic Regression and Generalized Linear Models

MDE is analyzed in Stata by using the svy: tab command and requesting

row percentages (as discussed in Chapter 6).

svy:

svy:

svy:

svy:

svy:

tab

tab

tab

tab

tab

ag4cat mde, row

sex mde, row

ald mde, row

ed4cat mde, row

mar3cat mde, row

TableÂ€8.5 presents the results of these bivariate analyses, including the Rao–

Scott F-tests of association. The table also presents estimates of the percentages of each predictor category that received the lifetime MDE diagnosis.

Based on these initial tests of association, all of the predictor variables

appear to have significant bivariate associations with MDE, including ALD,

the indicator of lifetime alcohol dependence, and all of the predictors appear

to be good candidates for inclusion in the initial multivariate logistic regression model.

8.7.2â•‡ Stage 2: Model Estimation

The next step in the Hosmer–Lemeshow (2000) model building procedure is

to fit the “initial” multivariate model, examining the main effects for all five

TableÂ€8.5

Initial Bivariate Design-Based Tests of Association Assessing

Potential Predictors of Lifetime Major Depressive Episode (MDE)

for the NCS-R Adult Sample

Predictor

Rao–Scott F-testa

Category

% with MDE (SE)

AG4CAT

F(2.76,115.97) = 26.39

p < 0.01

18–29

30–44

45–59

60+

Male

Female

Yes

No

<12 yrs

12 yrs

13–15 yrs

16+ yrs

Married

Previously

Never

18.4 (0.9)

22.9 (1.1)

22.3 (1.3)

11.1 (1.0)

15.3 (0.9)

22.6 (0.7)

45.2 (0.3)

17.7 (0.7)

16.3 (1.2)

18.6 (0.8)

21.3 (1.0)

19.7 (1.1)

17.3 (0.7)

23.9 (1.5)

19.4 (1.1)

SEX

F(1,42) = 44.83

p < 0.01

ALD

F(1,42) = 120.03

p < 0.01

ED4CAT

F(2.90,121.93) = 4.30

p < 0.01

MAR3CAT

F(1.90,79.74) = 11.08

p < 0.01

a

See Chapter 6 for more details on the derivation of this test statistic, which

can be used to test the null hypothesis of no association between the predictor variable and the outcome variable (MDE).

© 2010 by Taylor and Francis Group, LLC

254

Applied Survey Data Analysis

predictors. Stata provides two programs for fitting the multivariate logistic

regression model: svy: logistic and svy: logit. The default output for

the svy: logistic command is estimates of adjusted odds ratios and 95%

confidence intervals for the adjusted odds ratios. Note the use of the char

command and specification of the omitted reference category for SEX (2, or

females) prior to the svy: logistic command. This command allows the

user to specify custom reference categories rather than accepting the default

of having the lowest category omitted:

char sex[omit] 2

xi: svy: logistic mde i.ag4cat i.sex ald i.ed4cat i.mar3cat

In svy: logistic, estimates of the parameters and standard errors for

the logit model can be requested by using the coef option.

xi: svy: logistic mde i.ag4cat i.sex ald i.ed4cat ///

i.mar3cat, coef

Stata users who wish to see the estimated logistic regression coefficients

and standard errors may also use the companion program, svy: logit.

xi: svy: logit mde i.ag4cat i.sex ald i.ed4cat i.mar3cat

Estimated odds ratios and 95% CIs can be generated in svy: logit by

adding the or option:

xi: svy: logit mde i.ag4cat i.sex ald i.ed4cat ///

i.mar3cat, or

Other software systems (e.g., SAS PROC SURVEYLOGISTIC) will output

both the estimated logistic regression coefficients (and standard errors) and

the corresponding odds ratio estimates. (Readers should be aware that the

svy: logit and svy: logistic commands differ slightly in the procedures for calculation of the standard errors for odds ratios.)

In each form of the Stata svy: logistic or svy: logit command,

the dependent variable, MDE, is listed first, followed by the predictor variables (and their interactions, if applicable). Note that the xi: modifier is used

in conjunction with the svy: logistic and svy: logit commands to

indicate those predictor variables that are categorical in nature: AG4CAT,

ED4CAT, SEX, and MAR3CAT. (Because the ALD variable is already coded

as either 0 or 1, it does not require program-generated indicator variables.)

The xi: specification of the model instructs Stata to create a set of indicator

variables for each independent variable preceded by the i. prefix. Stata will

create K – 1 indicator variables to represent the K levels of the categorical

predictor. By default, Stata will select the lowest valued category as the reference category and only the K – 1 indicators for the remaining categories will

be included as predictors in the model.

© 2010 by Taylor and Francis Group, LLC

255

Logistic Regression and Generalized Linear Models

TableÂ€8.6

Estimated Logistic Regression Model for the Lifetime MDE

Outcome (Output Generated by Using the svy: logit

Command)

Predictora

INTERCEPT

AG4CAT

SEX

ALD

ED4CAT

MAR3CAT

Category

Bˆ

30–44

45–59

60+

Male

Yes

12

13–15

16+

Previously

Never

–1.583

0.255

0.206

–0.676

–0.577

1.424

0.079

0.231

0.163

0.486

0.116

se( Bˆ )

0.121

0.094

0.092

0.141

0.077

0.154

0.097

0.093

0.111

0.085

0.108

t

P(t42 > t)

–13.12

2.71

2.26

–4.78

–7.48

9.24

0.82

2.48

1.47

5.69

1.07

<0.001

0.01

0.029

< 0.001

< 0.001

< 0.001

0.418

0.017

0.148

<0.001

0.290

Source: Analysis based on the NCS-R data.

Notes: n = 5,692, adjusted Wald test for all parameters: F(10,33) = 28.07,

p < 0.001.

a Reference categories for categorical predictors are: AG4CAT (18–29);

SEX (Female); ALD (No); ED4CAT (<12 yrs); MAR3CAT (Married).

TablesÂ€ 8.6 and 8.7 summarize the output generated by fitting the MDE

logistic regression model. The initial model includes main effects for the

chosen predictor variable candidates but at this point does not include any

interactions between the predictors.

8.7.3â•‡ Stage 3: Model Evaluation

The adjusted Wald tests in Stata for the AG4CAT, ED4CAT, and MAR3CAT

categorical predictors in this initial model are generated by using the

test command:

test _Iag4cat_2 _Iag4cat_3 _Iag4cat_4

test _Imar3cat_2 _Imar3cat_3

test _Ied4cat_2 _Ied4cat_3 _Ied4cat_4

Note that we do not request these multiparameter Wald tests for the SEX

and ALD predictor variables, because they are represented by single indicator variables in the regression model and the overall Wald test for each

predictor is equivalent to the t-test reported for the single estimated parameter for that predictor. Further, note in each of the test statements that we

include the K – 1 indicator variables generated by Stata for each of the categorical predictors (e.g., _ Iag4cat _ 2) when the xi: modifier is used to

identify categorical predictor variables. Stata users can find the names of

© 2010 by Taylor and Francis Group, LLC

256

Applied Survey Data Analysis

TableÂ€8.7

Estimates of Adjusted Odds Ratios for the

Lifetime MDE Outcome

Predictora

Category

ψˆ

AG4CAT

30–44

45–59

60+

Male

Yes

12

13–15

16+

Previously

Never

1.29

1.23

0.51

0.56

4.15

1.08

1.26

1.18

1.63

1.12

SEX

ALD

ED4CAT

MAR3CAT

95% CI for ψ

(1.067, 1.562)

(1.022, 1.479)

(0.383, 0.677)

(0.480, 0.656)

(3.042, 5.668)

(0.890, 1.316)

(1.044, 1.519)

(0.941, 1.471)

(1.369, 1.932)

(0.903, 1.396)

Source: Analysis based on the NCS-R data.

Notes: n = 5,692. Adjusted Wald test for all parameters: F(10,33) = 28.07. p < 0.001.

a Reference categories for categorical predictors

are: AG4CAT (18–29); SEX (Female); ALD (No);

ED4CAT (<12 yrs); MAR3CAT (Married).

these indicators in the Stata Variables window once the model has been fitted. TableÂ€8.8 provides the design-adjusted F-versions of the resulting Wald

test statistics and associated p-values.

Two of the three design-adjusted Wald tests are significant at the 0.01 level.

The exception is the Wald test for education [F(3,40) = 2.13, p = 0.112], which

suggests that the parameters associated with education in this logistic regression model are not significantly different from zero and that education may

not be an important predictor of lifetime MDE when adjusting for the relationships of the other predictor variables with the outcome. If the objective

of the model-building process is the construction of a parsimonious model,

education could probably be dropped as a predictor at this point. For the

purposes of this illustration (and because of the marginal significance), we

will retain education in the model moving forward.

8.7.4â•‡ Stage 4: Model Interpretation/Inference

Based on the results in TableÂ€8.6 and TableÂ€8.8, it appears that each of the predictors in the multivariate model has a significant (or marginally significant)

relationship with the probability of MDE after adjusting for the relationships

of the other predictors. Focusing on the primary predictor variable of interest, we see that the odds of having had a major depressive episode at some

point in the lifetime are multiplied by 4.15 when a person has had a diagnosis of alcohol dependence at some point in his or her lifetime, when adjusting

© 2010 by Taylor and Francis Group, LLC

257

Logistic Regression and Generalized Linear Models

TableÂ€8.8

Design-Adjusted Wald Tests for the Parameters

Associated with the Categorical Predictors in

the Initial MDE Logistic Regression Model

Categorical Predictor

AG4CAT

ED4CAT

MAR3CAT

F-Test Statistic

P-value

F(3,40) = 19.03

F(3,40) = 2.13

F(2,40) = 16.60

< 0.001

0.112

< 0.001

Source: Analysis based on the NCS-R data.

for the relationships of age, sex, education, and marital status. Of course, this

model does not allow for any kind of causal inference, given that time ordering of the events is not available in the NCS-R data set; we can, however,

conclude that there is strong evidence of an association between the two

disorders in this finite population when adjusting for other demographic

covariates. We also note that relative to married respondents, respondents

who were previously married have significantly higher (63% higher) odds

of having had a major depressive episode in their lifetime when adjusting

for the other covariates. Further, middle-age respondents have significantly

higher odds of lifetime MDE (relative to younger respondents), while older

respondents and males have significantly reduced odds of lifetime MDE

(again relative to younger respondents and females).

Respondent age is represented in the model as four grouped categories of

age. Including grouped categories for age (or recoded categories of any continuous predictor, more generally) in a logistic regression model will result

in estimates of the expected contrasts in log-odds for respondents in each

of the defined categories, relative to the reference category. Since the model

parameters are estimated separately for each defined age group (with age

18–29 as the reference), the model will capture any nonlinearity of effect in

the ordered age groupings. Inspecting the estimated coefficients and odds

ratios for the grouped age categories in TableÂ€ 8.6 and Table 8.7, it appears

that there is significant nonlinearity in the effect of age on the probability

of MDE. Relative to the 18–29-year-old group, the odds of MDE increase by

factors of 1.29 (aged 30–44) and 1.23 (aged 45–59) for the middle-age ranges

but decrease by a factor of 0.51 in the age 60 and older group. Such nonlinear effects of age are common in models of human disorders and are possibly attributable to normal processes of aging and selective mortality. If the

example model was estimated with age (in years) as a continuous predictor

variable, at this stage in the model-building process the analyst would reestimate the model including both the linear and quadratic terms for age.

Therefore, at this stage in the model-building process, we have chosen to

retain all of the candidate main effects. Next, we apply Archer and Lemeshow’s

(2006) design-adjusted test to assess the goodness of fit of this initial model

(assuming that this procedure has been downloaded and installed):

© 2010 by Taylor and Francis Group, LLC

258

Applied Survey Data Analysis

svylogitgof

The resulting design-adjusted F-statistic reported in the Stata Results

window is equal to FA-L = 1.229, with a p-value of 0.310. This suggests that

the null hypothesis that the model fits the data well is not rejected. We

therefore have confidence moving forward that the fit of this initial model

is acceptable.

Next, we consider testing some scientifically relevant two-way interactions

between the candidate predictor variables. For illustration purposes, we suppose that possible two-way interactions of sex with the other four covariates

measuring age, lifetime alcohol dependence, education, and marital status

are of interest, if sex is posited by an NCS-R analyst as being a possible moderator of the relationships of these other four covariates with lifetime MDE.

We fit a model including these two-way interactions in Stata using the following command:

xi: svy: logistic mde i.ag4cat*i.sex i.sex*ald ///

i.ed4cat*i.sex i.mar3cat*i.sex, coef

Note how the interactions are specified in this command. When the xi:

modifier is used for a regression command, the products of the two factors

listed after the dependent variable specify that the regression parameters

associated with each individual factor should be included in the regression

model in addition to the parameters associated with the relevant cross-product terms defined by the interaction (e.g., the indicator for AG4CAT = 2 × the

indicator for SEX = 1). In other words, listing AG4CAT and SEX in addition

to the previous product terms would be redundant, and the main effects are

included in the model by default when the interaction terms are specified.

TableÂ€8.9 presents the estimates of the regression parameters in this model

generated by executing the previous command in Stata.

At this point, the statistical question is whether these two-way interactions

are making a significant additional contribution or improvement to the fit

of this model to the NCS-R data. That is, are any of the parameters associated with the two-way interaction terms significantly different from 0? We

can test this hypothesis by once again using design-adjusted Wald tests. The

relevant interaction terms for the regression model are automatically generated by Stata and included in the data set when using the xi: modifier, so

the cross-product terms in the test commands that follow can be easily

selected from the Stata Variables window:

test

test

test

test

_Iag4Xsex_2_1 _Iag4Xsex_3_1 _Iag4Xsex_4_1

_IsexXald_1

_Ied4Xsex_2_1 _Ied4Xsex_3_1 _Ied4Xsex_4_1

_ImarXsex_2_1 _ImarXsex_3_1

© 2010 by Taylor and Francis Group, LLC

259

Logistic Regression and Generalized Linear Models

TableÂ€8.9

Estimated Logistic Regression Model for Lifetime MDE, Including

First Order Interactions of the Other Predictor Variables with SEX

Predictora

INTERCEPT

AG4CAT

SEX

ALD

ED4CAT

MAR3CAT

AG4CAT × SEX

ALD × SEX

ED4CAT × SEX

MAR3CAT × SEX

Category

Constant

30–44

45–59

60+

Male

Yes

12

13–15

16+

Previously

Never

30–44 × Male

45–59 × Male

60+ × Male

Yes × Male

12 × Male

13–15 × Male

16+ × Male

Previously × Male

Never × Male

Bˆ

–1.600

0.220

0.214

–0.646

–0.546

1.553

0.131

0.297

0.242

0.418

0.017

0.097

0.002

–0.038

–0.200

–0.138

–0.169

–0.194

0.182

0.232

se( Bˆ )

t

P(t42 > t)

0.134

0.114

0.102

0.175

0.357

0.211

0.084

0.117

0.152

0.111

0.130

0.201

0.213

0.302

0.242

0.271

0.269

0.344

0.208

0.212

–11.94

1.94

2.09

–3.68

–1.53

7.36

1.56

2.54

1.59

3.78

0.13

0.48

0.01

–0.13

–0.83

–0.51

–0.63

–0.56

0.88

1.09

<0.001

0.059

0.042

0.001

0.134

<0.001

0.126

0.015

0.118

<0.001

0.894

0.633

0.990

0.901

0.413

0.614

0.534

0.576

0.385

0.280

Source: Analysis based on the NCS-R data.

Notes: n = 5,692. Adjusted Wald test for all parameters: F(19,24) = 17.15. p <

0.001.

a Reference categories for categorical predictors are: AG4CAT (18–29 yrs); GENDER

(female); ALD (no); ED4CAT(<12 yrs); MAR3CAT (married); SEX(female).

Based on test results presented in Table 8.10, we fail to reject the null

hypotheses for all four of the tests, suggesting that these two-way interactions are actually not making a significant contribution to the fit of the

model. We therefore do not consider these two-way interactions any further

and would proceed with making inferences based on the estimates from the

model presented in TableÂ€8.6.

8.8â•‡Comparing the Logistic, Probit, and Complementary

Log–Log GLMs for Binary Dependent Variables

This chapter has focused on logistic regression techniques for modeling

π(x) for a binary dependent variable. As discussed in Section 8.2, alternative

© 2010 by Taylor and Francis Group, LLC

260

Applied Survey Data Analysis

TableÂ€8.10

Design-Adjusted Wald Tests of

First-Order Interactions of Sex and Other

Categorical Predictors in the MDE

Logistic Regression Model

Interaction Term

F-Test Statistic

P(F > F)

AG4CAT × SEX

ALD × SEX

ED4CAT × SEX

MAR3CAT × SEX

F(3,40) = 0.25

F(1,42) = 0.68

F(3,40) = 0.13

F(2,41) = 0.77

0.863

0.413

0.944

0.472

Source: Analysis based on the NCS-R data.

generalized linear models for a binary dependent variable may be estimated

using the probit or CLL link function. In discussing these alternative GLMs,

we noted that inferences derived from logistic, probit and CLL regression

models should generally be consistent.

To illustrate, consider the results in TableÂ€8.11 for a side-by-side comparison of estimated logistic, probit and CLL regression models. The example

used for this comparison is a model of the probability that a U.S. adult is

alcohol dependent. The data are from the NCS-R long interview (or Part 2

of the survey), and each model includes the same demographic main effects

considered in Section 8.7 for the model of MDE: SEX, AG4CAT, ED4CAT,

and MAR3CAT. The Stata commands for the estimation of the three models

follow (note the use of the char sex[omit]2 syntax to specify the desired

omitted category for sex):

char sex[omit] 2

xi: svy: logit ald i.ag4cat i.sex i.ed4cat i.mar3cat

test _Iag4cat_2 _Iag4cat_3 _Iag4cat_4

test _Ied4cat_2 _Ied4cat_3 _Ied4cat_4

test _Imar3cat_2 _Imar3cat_3

xi: svy: probit ald i.ag4cat i.sex i.ed4cat i.mar3cat

test _Iag4cat_2 _Iag4cat_3 _Iag4cat_4

test _Ied4cat_2 _Ied4cat_3 _Ied4cat_4

test _Imar3cat_2 _Imar3cat_3

xi: svy: cloglog ald i.ag4cat i.sex i.ed4cat i.mar3cat

test _Iag4cat_2 _Iag4cat_3 _Iag4cat_4

test _Ied4cat_2 _Ied4cat_3 _Ied4cat_4

test _Imar3cat_2 _Imar3cat_3

TableÂ€ 8.11 presents a summary of the estimated coefficients, standard

errors, and p-values for simple hypothesis tests of the form H0: Bj = 0.

TableÂ€ 8.12 presents the results for the Wald tests of the overall age, education, and marital status effects. Note that although the coefficients and

standard errors for the probit model show the expected difference in scale

© 2010 by Taylor and Francis Group, LLC

## 2010 applied survey data analysis

## 4 Simple Random Sampling: A Simple Model for Design- Based Inference

## 2 Analysis Weights: Review by the Data User

## Example 5.1: A Weighted Histogram of Total Cholesterol Using the 2005– 2006 NHANES Data

## Example 5.8: Estimating Population Quantiles for Total Household Assets Using the HRS Data

## Example 5.11: Estimating Mean Systolic Blood Pressure for Males and Females Age > 45 Using the NHANES Data

## Example 5.13: E stimating Differences in Mean Total Household Assets from 2004 to 2006 Using Data from the HRS

## Example 6.4: A Goodness- of- Fit Test for Blood Pressure Status Category Proportions

## Example 6.6 E stimation of Total and Row Proportions for the Cross- Tabulation of Gender and Lifetime Major Depression Status Using the NCS- R Data

## Example 6.8: Testing the Independence of Alcohol Dependence and Education Level in Young Adults ( Ages 18– 28) Using the NCS- R Data

## Example 6.9: Simple Logistic Regression to Estimate the NCS- R Male/ Female Odds Ratio for Lifetime Major Depressive Episode

Tài liệu liên quan