Example 6.9: Simple Logistic Regression to Estimate the NCS- R Male/ Female Odds Ratio for Lifetime Major Depressive Episode
Tải bản đầy đủ
173
Categorical Data Analysis
Based on this analysis, the odds that an adult male has experienced a lifetime
MDE are only 59.7% as large as the odds of MDE for adult females, which agrees
(allowing for some rounding error) with the simple direct calculation. Since the
95% CI does not include ψ = 1, we would reject the null hypothesis that MDE
status is independent of gender.
6.4.7â•‡Bivariate Graphical Analysis
Graphical displays also are useful tools to describe the bivariate distribution
of two categorical variables. The following Stata graphics command generates gender-specific vertical bar charts for the BP_CAT variable generated
in Example 6.3 (note that the pweight option is used to specify the survey
weights, and the over() option is used to generate a plot for each level of
gender). The output is shown in FigureÂ€6.8.
graph bar (mean) bp_cat1 bp_cat2 bp_cat3 bp_cat4 ///
[pweight=wtmec2yr] if age18p==1, blabel(bar, format(%9.1f) ///
color(none)) bar(1,color(gs12)) bar(2,color(gs4)) ///
bar(3,color(gs8)) bar(4,color(black)) ///
bargap(7) scheme(s2mono) over(riagendr) percentages ///
legend (label(1 “Normal”) label(2 “Pre-Hypertensive”) ///
label(3 “Stage 1 Hypertensive”) label(4 “Stage 2 ///
Hypertensive”)) ytitle(“Percentage”)
60
53.7%
Percentage
49.9%
40
40.0%
34.4%
20
8.9%
8.4%
0
3.0%
1.7%
Male
Normal
Stage 1 Hypertensive
Female
Pre-Hypertensive
Stage 2 Hypertensive
FigureÂ€6.8
Bar chart of the estimated distribution of blood pressure status of U.S. adult men and women.
(Modified from the 2005–2006 NHANES data.)
© 2010 by Taylor and Francis Group, LLC
174
Applied Survey Data Analysis
6.5â•‡ Analysis of Multivariate Categorical Data
During the past 25 years, multivariate analysis involving three (or more)
categorical variables has increasingly shifted to regression-based methods
for generalized linear models (Chapters 8 and 9). The regression framework
provides the flexibility to estimate the association of categorical responses
and factors as well as the ability to control for continuous covariates. In this
section, we briefly review the adaptation of two long-standing techniques for
the analysis of multivariate categorical data to complex sample survey data:
(1) the Cochran–Mantel–Haenszel test; and (2) simple log-linear modeling
of the expected proportions of counts in multiway tables defined by crossclassifications of categorical variables.
6.5.1â•‡ The Cochran–Mantel–Haenszel Test
Commonly used in epidemiology and related health sciences, the CMH test
permits tests of association between two categorical variables while controlling for the categorical levels of a third variable. For example, an analyst may
be interested in testing the association between a lifetime diagnosis of major
depressive episode and gender while controlling for age categories. Although
not widely available in software systems that support complex sample survey data analysis, design-based versions of the CMH test are available in
SUDAAN’s CROSSTAB procedure.
SUDAAN PROC CROSSTAB supports two alternative methods for estimating the adjusted or common odds ratio and common relative risk statistics: (1) the Mantel–Haenszel (M–H) method; and (2) the logit method. Both
methods adjust for the complex sample design and generally result in very
similar estimates of common odds ratios and relative risks.
Example 6.10:â•‡Using the NCS-R Data to Estimate and Test
the Association between Gender and Depression in the
U.S. Adult Population When Controlling for Age
In Examples 6.7 to 6.9, we found evidence of a significant overall association
between gender and a diagnosis of lifetime depression when analyzing the NCS-R
data, where females had greater odds of receiving a diagnosis of depression at
some point in their lives. This example is designed to test whether this association
holds in the U.S. adult population when controlling for age. Given that the CMH
test can be applied when the control variable is a categorical variable, a fourcategory age variable named AGECAT is constructed: 1 = ages 18–29, 2 = ages
30–39, 3 = ages 40–49, and 4 = ages 50+. The SUDAAN CROSSTAB procedure
is then run to derive the CMH test and to use both the Mantel–Haenszel and logit
methods to estimate the common age-adjusted male/female odds ratio and relative risk for MDE:
© 2010 by Taylor and Francis Group, LLC
175
Categorical Data Analysis
proc crosstab ;
nest sestrat seclustr ;
weight ncsrwtsh ;
class agecat sexm mde ;
tables agecat*sexm*mde ;
risk MHOR MHRR1 LOR LRR1 ;
TEST cmh chisq;
print nsum wsum rowper serow colper secol / tests=all
adjrisk=all cmhtest=all ;
run ;
Note that the sampling error codes (NEST statement) and the sampling weights
(WEIGHT statement) are identified first. Next, the CLASS statement specifies that
all three variables are categorical. In the TABLES statement, the AGECAT variable
is identified first, defining it as the categorical control variable for this analysis. We
define SEXM as the row variable and MDE as the column variable. The RISK statement then requests estimates of the Mantel–Haenszel common odds ratio and
relative risk ratio and the logit-based common odds and relative risk ratios. Finally,
the TEST statement requests the overall design-adjusted CMH test in addition to
the Wald chi-square tests, which will be performed for each age stratum. TableÂ€6.7
summarizes the key elements of the SUDAAN output.
2
The value of the design-adjusted CMH test statistic in this case is XCMH
= 92.46,
which has p <Â€ 0.0001 on one degree of freedom, suggesting that there is still
a strong association between gender and lifetime depression after adjusting for
age. This is supported by the design-adjusted Wald chi-square statistics produced
by SUDAAN for each age group (see Equation 6.12). In each age stratum, there
is evidence of a significant association of gender with lifetime depression. After
adjusting for respondent’s age, both the M-H and logit estimates of the common
male/female odds ratio are about ψˆ ≈ 0.60. Males appear to have roughly 40%
lower odds of having a lifetime diagnosis of depression compared with females
TableÂ€6.7
SUDAAN Output for the Cochran–Mantel–Haenszel Test of
MDE versus SEX, Controlling for AGECAT
Cochran-Mantel-Haenszel Test Results
X
2
CMH
= 94.26
df = 1
p < 0.0001
Age Category-Specific Wald Tests of Independence for MDE and
SEX
Age Category
18–29
30–39
40–49
50+
23.47
23.22
9.12
38.28
QWald
P(QWald > χ12)
p < 0.0001
p < 0.0001
p < 0.0043
p < 0.0001
Age-Adjusted Estimates of Common Male/Female Odds Ratios and
Relative Risks
RRM–H
RRLogit
Statistic
ψLogit
ψM–H
Point estimate
0.59
0.60
0.91
0.91
95% CI
(0.51, 0.67)
(0.52, 0.68)
(0.88, 0.93)
(0.89, 0.93)
Source: Analysis based on the NCS-R data.
© 2010 by Taylor and Francis Group, LLC
176
Applied Survey Data Analysis
when adjusting for age. The M–H and logit method estimates of the common
risk ratios also suggest a significant difference in the probability of depression for
the two groups, with the expected probability about 10% lower for males when
adjusting for age.
6.5.2â•‡Log-Linear Models for Contingency Tables
A text on the analysis of survey data would not be complete without a mention of log-linear models for multiway contingency tables (Agresti, 2002;
Bishop, Feinberg, and Holland, 1975; Stokes, Davis, and Koch, 2002). Loglinear models permit analysts to study the association structure among
categorical variables. In a sense, a log-linear model for categorical data is
analogous to an analysis of variance (ANOVA) model for the cell means of
a continuous dependent variable. The dependent variable in the log-linear
model is the natural logarithm of the expected counts for cells of a multiway
contingency table. The model parameters are estimated effects associated
with the categorical variables and their interactions. For example, the following is the log-linear model under the null hypothesis of independence for
three categorical variables X, Y, and Z:
log( mijk ) = µ + λ Xi + λYj + λ Zk
(6.15)
where mijk is the expected cell count under the model.
A model that includes a first-order interaction between the X and Y variables would be written as
log( mijk ) = µ + λ Xi + λYj + λ Zk + λ XY
ij
(6.16)
Under simple random sampling assumptions, the cell counts are assumed
to follow a Poisson distribution and the model parameters are estimated
using the method of maximum likelihood or iterative procedures such iterative proportional fitting (IPF). Tests of nested models (note that the model
in Equation 6.15 is nested in the model in Equation 6.16) are performed using
the likelihood ratio test.
Log-linear models for SRS data can be analyzed in virtually every major
software package (e.g., SAS PROC CATMOD). Presently, the major software packages such as Stata, SAS, and SPSS that include programs for
analysis of complex sample survey data do not include a program to perform the traditional log-linear modeling. There are likely two explanations
for this omission. The first is that the general structure of the input data
(grouped cell counts of individual observations) does not lend itself readily to design-based estimation and inference. In Skinner, Holt, and Smith
(1989), Rao and Thomas discuss the extension of their design-based adjustment for chi-square test statistics to the conventional likelihood ratio tests for
© 2010 by Taylor and Francis Group, LLC
Categorical Data Analysis
177
log-linear models, but this technique would take on considerable programming complexity as the dimension and number of variations on the association structure in the model increases. Grizzle, Starmer, and Koch (GSK; 1969)
introduced the weighted least squares method of estimating log-linear and
other categorical data models. This generalized technique was programmed
in the GENCAT software (Landis et al., 1976) and requires the user to input
a design-based variance–covariance matrix for the vector of cell proportions
in the full cross-tabular array. Under the GSK method, tests of hypotheses
are performed using Wald statistics.
A second explanation for the scarcity of log-linear modeling software
for complex sample survey data is that log-linear models for expected cell
counts can be reparameterized as logistic regression models (Agresti, 2002)
and all of the major software systems have more advanced programs for fitting logistic and other generalized linear models to complex sample survey
data. These models will be considered in detail in Chapters 8 and 9.
6.6â•‡ Exercises
1.Using the software procedure of your choice and the NCS-R data
set, estimate the row proportions in a two-way table where race/
ethnicity (RACECAT: 1 = Asian/Other, 2 = Hispanic, 3 = Black, 4 =
White) is the factor variable (or row variable) and major depressive
episode is the response variable (or column variable). Recall from the
Chapter 5 exercises that the sampling error stratum, sampling error
cluster, and final sampling weight variables in the NCS-R data set
are SESTRAT, SECLUSTR, and NCSRWTSH (Part 1 weight), respectively. Then, answer the following questions:
a. What is the value of the Rao–Scott F-statistic for the overall test
of the null hypothesis that race category and major depressive
episode are not associated? Don’t forget to report the degrees of
freedom for this design-adjusted F-statistic.
b. Under this null hypothesis, what is the p-value for the F reference
distribution?
c. Based on this test, what is your statistical decision regarding
the independence of race and MDE status in the NCS-R survey
population?
2. Repeat the analysis from Exercise 1, performing unconditional subclass analyses for men and women separately (SEX: 1 = Male 2 =
Female). Do your inferences change when restricting the target population to only men or only women?
© 2010 by Taylor and Francis Group, LLC
178
Applied Survey Data Analysis
3.What proportion of white females in the NCS-R survey population
is estimated to have MDE? Compute a 95% CI for this proportion
that has been appropriately adjusted for the complex design.
4.Conduct a similar analysis of the association between U.S. REGION
(REGION: 1 = Northeast, 2 = North central, 3 = South and 4 = West)
and MDE status, estimating the row proportions in a two-way contingency table. Then, answer the following questions:
a. Is there a significant association between REGION of residence
and MDE status in the NCS-R target population? Provide the
Rao–Scott design-adjusted F-statistic (including the appropriate
degrees of freedom) and a p-value for the test statistic to support
your decision.
b. What proportion of the NCS-R survey population in the North
Central region has a diagnosis of MDE? Provide a point estimate
and 95% confidence interval for the proportion. What are the corresponding estimates of the proportion with MDE and the 95%
CI for the NCS-R population that resides in the South region?
5.Extend the analysis in Exercise 4 by conducting an analysis of the
association between REGION and MDE separately for each of the
four race groups (defined by the NCS-R variable RACECAT). Then,
answer these questions:
a. When the race groups are analyzed separately, does the association (or lack thereof) between REGION and MDE continue to
hold? Provide the design-adjusted F-statistic and p-value for each
of the race-specific analyses to support your answer.
b. If the answer to part a is no, how do you explain this pattern of
results?
© 2010 by Taylor and Francis Group, LLC
7
Linear Regression Models
7.1â•‡ Introduction
Study regression. All of statistics is regression.
This quote came as a recommendation from a favorite professor to one of
the authors while he was in the process of choosing a concentration topic for
his comprehensive exam. The broader interpretation of the quote requires
placing the descriptor in quotes, “regression,” but ask individuals with backgrounds as varied as social science graduate students or quality control officers in a paper mill to decipher the statement and they will think first of the
linear regression model. Given the importance of the linear regression model
in the history of statistical analysis, the emphasis that it receives in applied
statistical training and its importance in real-world statistical applications,
the narrower interpretation is quite understandable.
This chapter introduces linear regression modeling for complex sample
survey data—its similarities to and how it differs (theoretically and procedurally) from standard ordinary least squares (OLS) regression analysis. We
assume that the reader is familiar with the basic theory and methods for
simple (single-predictor) and multiple (multiple-predictor) linear regression
analysis for continuous dependent variables. Readers interested in a comprehensive reference on the topic of linear regression are referred to Draper
and Smith (1981), Kleinbaum, Kupper, and Muller (1988), Neter et al. (1996),
DeMaris (2004), Faraway (2005), Fox (2008), or many other excellent texts on
the subject.
Focusing on practical approaches for complex sample survey data, we
emphasize “aggregated” design-based approaches to the linear regression
analysis of survey data (sometimes referred to as population-averaged modeling), where design-based variance estimates for weighted estimates of
regression parameters in finite populations are computed using nonparametric methods such as the Taylor series linearization (TSL) method, balanced repeated replication (BRR), or jackknife repeated replication (JRR).
Model-based approaches to the linear regression analysis of complex sample
survey data, which may explicitly include stratification or clustering effects
179
© 2010 by Taylor and Francis Group, LLC
180
Applied Survey Data Analysis
in the regression models and may or may not use the sampling weights (e.g.,
Skinner, Holt, and Smith, 1989; Pfefferman et al., 1998), are introduced in
Chapter 12. Over the years, there have been many contributions to the survey
methodology literature comparing and contrasting these two approaches to
the regression analysis of survey data, including papers by DuMouchel and
Duncan (1983), Hansen, Madow, and Tepping (1983), and Kott (1991).
We present a brief history of important statistical developments in linear
regression analysis of complex sample survey data to begin this chapter.
Kish and Frankel (1974) were two of the first to empirically study and discuss
the impact of complex sample designs on inferences related to regression
coefficients. Fuller (1975) derived a linearization-based variance estimator
for multiple regression models with unequal weighting of observations and
introduced variance estimators for estimated regression parameters under
stratified and two-stage sampling designs. Shah, Holt, and Folsom (1977)
further discussed the violations of standard linear model assumptions when
fitting linear regression models to complex sample survey data, discussed
appropriate methods for making inferences about linear regression parameters estimated using survey data, and presented an empirical evaluation of
the performance of variance estimators based on Taylor series linearization.
Binder (1983) focused on the sampling distributions of estimators for
regression parameters in finite populations and defined related variance estimators. Skinner et al. (1989, Sections 3.3.4 and 3.4.2) summarized estimators
of the variances for regression coefficients that allowed for complex designs
(including linearization estimators) and recommended the use of linearization methods or other robust methods (e.g., JRR) for variance estimation.
Kott (1991) further discussed the advantages of using variance estimators
based on Taylor series linearization for estimates of linear regression parameters: protection against within-PSU correlation of random errors, protection
against possible nonconstant variance of the random errors, and the fact that
a within-PSU correlation structure does not need to be identified to have
a nearly unbiased estimator. Fuller (2002) provided a modern summary of
regression estimation methods for complex sample survey data.
7.2â•‡ The Linear Regression Model
Regression analysis is a study of the relationships among variables: a dependent variable and one or more independent variables. FigureÂ€7.1 illustrates
a simple linear regression model of the relationship of a dependent variable,
y, and a single independent variable x. The regression relationship among
the observed values of y and x is expressed as a regression model, for example, y = β0 + β1 x + ε , where y is the dependent variable, x is the independent
© 2010 by Taylor and Francis Group, LLC
181
Linear Regression Models
120.00
Temperature = 45.16 + 1.41 * Ozone
R-Square = 0.61
Temperature (Degrees F)
90.00
60.00
30.00
0.00
0.00
10.00
20.00
30.00
Ozone Concentration (PPM)
40.00
FigureÂ€7.1
Linear regression of air temperature on ozone level.
variable, β0 and β1 are model parameters, and ε is an error term that reflects
the difference between the observed value of y and its conditional expectation under the model, ε = y − yˆ = y − β o − β1 x .
In statistical practice, a fitted regression model may be used to simply
predict the expected outcome for the dependent variable based on a vector
of independent variable measurements x, E( y| x ) = βˆ 0 + βˆ 1x1 + ... + βˆ p x p , or
to explore the functional relationship of y and x. Across the many scientific disciplines that use regression analysis methods, dependent variables
may also be referred to as response variables, regressands, outcomes, or
even “left-hand-side variables.” Independent variables may be labeled
as predictors, regressors, covariates, factors, cofactors, explanatory variables or “right-hand-side variables.” We primarily refer to response variables and predictor variables in this chapter, but other terms can be used
interchangeably.
This chapter will focus on the broad class of regression models known as
linear models, or models for which the conditional expectation of y given x,
E(y | x), is a linear function of the unknown parameters. Consider the following three specifications of linear models:
© 2010 by Taylor and Francis Group, LLC
y = β0 + β1x + ε
(7.1)
182
Applied Survey Data Analysis
Note in this model that the dependent variable, y, is a linear function of the
unknown parameters and the independent variable x:
y = β0 + β1x + β 2 x 2 + ε
(7.2)
In this model (Equation 7.2), the response variable y is still a linear function
of the β parameters for x and x2; however, the linear model defines a nonlinear
relationship between y and x:
y = xβ + ε
p
=
∑β x + ε
j =0
j j
(7.3)
= β0 + β1 x1 + β 2 x 2 + + β p x p + ε
Here the linear model is first expressed in vector notation. Vector notation
may be used as an abbreviation to represent a complex model with many parameters and to facilitate computations using the methods of matrix algebra.
When specifying linear regression models, it is useful to be able to reference specific observations on the subjects in a survey data set:
y i = xi β + ε i
(7.4)
where xi = [1 x1 i ... x pi ] and βT = [β0 β1 ... β p ] .
In this notation, i refers to sampled element (or respondent) i in a given survey data set.
7.2.1â•‡ The Standard Linear Regression Model
Standard procedures for unbiased estimation and inference for the linear
regression model involve the following assumptions:
1.The model for E(y | x) is linear in the parameters (see Equation 7.2).
2.Correct model specification—in short, the model includes the true
main effects and interaction terms to accurately reflect the true
model under which the data were generated.
3. E(εi | x i) = 0, or that the expected value of the residuals given a set of
values on the predictor variables is equal to 0.
4.Homogeneity of variance: Var(εi | xi) = σ 2y⋅x , or that the variance of
the residuals given values on the predictor variables is a constant
parameter equal to σ 2y⋅x .
© 2010 by Taylor and Francis Group, LLC
183
Linear Regression Models
5.Normality of residuals (and also y): for continuous outcomes, we
assume that εi | xi ~ N(0, σ 2y⋅x ), or that given values on the predictor
variables, the residuals are independently and identically distributed (i.i.d.) as normal random variables with mean 0 and constant
variance σ 2y⋅x .
6.Independence of residuals: As a consequence of the previous point,
Cov(εi, εj | xi, xj) = 0, i ≠ j, or residuals on different subjects are uncorrelated given values on their predictor variables.
There are several implications of these standard model assumptions. First,
we can write
yˆ = E( y| x ) = E( xβ) + E( ε ) = xβ + 0 = xβ = β0 + β1 x1 + ... + β p x p
(7.5)
This equation for the predicted value of y is the regression function, or the
expected value of the dependent variable y conditional on a set of values on
the predictor variables (of which there are p). Further, we can write
Var( yi | xi ) = σ 2y⋅x
(7.6)
Cov( yi , y j | xi , x j ) = 0
(7.7)
These assumptions, therefore, imply that the dependent variable has constant variance given values on the predictors and that no two values on the
dependent variable are correlated given values on the predictors. Putting all
of the implications together, we have
yi ~ N ( xiβ , σ 2y⋅x )
(7.8)
Values on the dependent variable, y, are therefore assumed to be i.i.d. normally distributed random variables with a mean defined by the linear combination of the parameters and the predictor variables and a constant variance.
7.2.2â•‡ Survey Treatment of the Regression Model
Since the late 1940s and early 1950s when economists and sociologists
(Kendall and Lazarsfeld, 1950; Klein and Morgan, 1951) first applied regression analysis to complex sample survey data, survey statisticians have sought
to relate design-based estimation of regression relationships to the standard
linear model. The result was the linked concepts of a finite population and
the superpopulation model, which are described in more detail in Chapter 3
(also see Theory BoxÂ€7.1).
© 2010 by Taylor and Francis Group, LLC