3 Generalized Linear Mixed Models ( GLMMs) in Survey Data Analysis
Tải bản đầy đủ
376
Total Assets (Thousands of Dollars)
Applied Survey Data Analysis
15000
10000
5000
0
2000
2002
2004
2006
Year
hhidpn = 10038010
hhidpn = 10059030
hhidpn =10004010
hhidpn = 10050010
hhidpn = 10109030
FigureÂ€12.1
Multiple line plot showing trends in total household assets from 2000 to 2006 for a small subsample of five HRS 2006 households.
ordinal, and count-type data. Generalized linear mixed models are subjectspecific models that analyze the influence of both fixed effects and random
effects on individual attributes. The label subject-specific refers to the fact
that GLMMs are modeling the effect on the individual unit (see FigureÂ€12.1),
explicitly controlling for the random effects of randomly sampled individuals or clusters of observations. This is in contrast to the marginal modeling
or population averaged approach employed by the generalized estimating equations technique. This modeling technique is an alternative for the
analysis of repeated measures or other dependent sets of observations on
population units that does not explicitly incorporate random subject effects
and adjusts standard errors for the clustering present in the observations in
a manner similar to Taylor series linearization (TSL; using robust sandwich
estimates of standard errors).
Following Fitzmaurice, Laird, and Ware (2004), a general expression for the
GLMM is
g{E(Yit |bi )} = ηit = X it β + Z it bi
where:
g(•) = a known link function, often identity (Normal) but also
log (Poisson) or Logistic (see Chapters 8 and 9);
X it = row vector of j = 1,…,p covariates
for subject i, at observation t = 1,…,T;
© 2010 by Taylor and Francis Group, LLC
(12.1)
377
Advanced Topics in the Analysis of Survey Data
β = vector of j = 1,…,p fixed effects regression parameters;
Zit = row vector of k = 1,…,q covariates
for subject i, at observation t = 1,…T;
b i = vector of k = 1,…q random effects.
From this generic expression, more exact expressions for individual models may be derived. For example, a simple random intercept model for
change in a continuous random variable, y, over times t = 1, …, T can be
written as follows:
yit = X it β + bi + eit
= (β0 + bi ) + β1 xit ,1 + ⋅⋅⋅ + β p xit ,p + eit
(12.2)
where:
bi ~ N(0, τ2) = random effect for individual i;
eit ~ N(0, σ2) = random error term; and
cov(bi,eit) = 0.
The notation in Equation 12.1 suggests a GLMM for modeling multiple
dependent observations on a single individual, such as longitudinal measures
in a panel survey (Fitzmaurice et al., 2009). However, by using slightly different subscripts, we can show that the GLMM also represents a model where
the dependency is not among repeated observations on a single individual
but among units that are clustered in a hierarchical data set. Consider for
example a survey such as the National Assessment of Educational Progress
(NAEP), where measures on individual students are nested within schools.
The random intercept model can be rewritten to model fixed effects on individual student test performance, yi(s), while accounting for the random effects
of the school clusters:
yi( s ) = X i( s )β + bs + ei( s )
= (β0 + bs ) + β1 xi( s ),1 + ⋅⋅⋅ + β p xi( s ), p + ei( s )
(12.3)
where:
bs ~ N(0, τ2) = random effect for school s;
ei(s) ~ N(0, σ2) = random error term; and
cov(bs,ei(s)) = 0.
The literature generally refers to models such as Equation 12.3 that incorporate dependence among observations due to hierarchical clustering of
the ultimate observational units as hierarchical linear models or multilevel models (Goldstein, 2003; Raudenbush and Bryk, 2002). Because the
© 2010 by Taylor and Francis Group, LLC
378
Applied Survey Data Analysis
dependence among observations results from hierarchical levels of clustering, HLMs are often described by decomposing the fixed and random
components for the overall GLMM into models for the individual levels. For
example, consider the simple model (Equation 12.3) for test scores where
clusters of students are nested within schools. The level 1 model is the
model of the student test outcome, controlling for the effects (which may be
school specific) of covariates for that student. Note that the following level 1
model allows the intercept and the effects of the student-level covariates to
be unique to the sampled schools:
Level 1:
yi( s ) = β0( s ) + β1( s ) xi( s ),1 + ⋅⋅⋅ + β p( s )xi( s ), p + e i( s )
(12.4)
Level 2 of the multilevel specification defines equations for each of the
random school-specific coefficients in the level 1 model. In the case of
Equation 12.3, only the intercept randomly varies across sampled schools.
For example, suppose that the school-specific intercept is defined by the
fixed overall intercept, β0, and the random school effect, bs, while the
effects of the covariates are fixed across schools (i.e., defined by fixed effect
parameters only):
Level 2:
β0( s ) = β0 + bs
β1( s ) = β1
...
β p( s ) = β p
(12.5)
A complete treatment of generalized linear mixed models is beyond the
scope of this intermediate-level text on applied survey data analysis. Many
texts describe the theory and applications of GLMMs in detail, including (in
no particular order) Diggle et al. (2002), Gelman and Hill (2007), Fitzmaurice et
al. (2004), Verbeke and Molenberghs (2000), Molenberghs and Verbeke (2005),
West, Welch, and Galecki (2007), Raudenbush and Bryk (2002), and Goldstein
(2003). Our objective here is to acquaint the reader with generalized linear
mixed models, to introduce several key issues in the application of these
models to complex sample survey data, and to describe current research
findings that address these issues. The introduction to these models and the
discussion will be developed around a specific GLMM—a model of longitudinal change in the net worth of individual households. Because repeated
measures and growth curve models of this type share their GLMM genes
with HLMs and multilevel models, the analytic issues and recommended
© 2010 by Taylor and Francis Group, LLC
Advanced Topics in the Analysis of Survey Data
379
solutions described in the following sections are easily extended to multilevel forms. Throughout the following discussion, we will emphasize the
common roots by referring to individual observations at a specific time as
level 1 observations and the individual respondent as the level 2 observational unit.
12.3.2â•‡ Generalized Linear Mixed Models and
Complex Sample Survey Data
GLMMs are specifically designed to address nonindependence or “dependency” of observations. In HLMs the dependency arises because observational units are hierarchically clustered, such as students within classrooms
and classrooms with schools. In repeated measures or longitudinal models, the dependency arises because observations are clustered within individuals—for example, daily diaries of food intake for National Health and
Nutrition Examination Survey (NHANES) respondents, or longitudinal
measures of household assets for HRS panel respondents.
This problem of lack of independence for observations should sound
familiar. The design-based survey data analysis techniques that have been
the core subject of this book are constructed to address the intraclass correlation among observations in sample clusters. It is natural to draw the analogy
between these two estimation problems. Take, for example, two parallel sets
of data. The first data set records scores on a test administered to students
who are clustered within classrooms, schools, and school districts. The second data set includes test scores for the same standardized test administered
to a probability sample of students in their homes, where households were
selected within area segments and primary sampling units (PSUs) of a multistage national sample design. In both cases, there is a hierarchical ordering
of units: districts, schools, classes, and students in the first case; and PSUs,
area segments, households, and students in the second case.
If the survey analyst were interested only in inferences concerning national
student performance on the standardized test, robust inferences could be
obtained using the design-based estimates of population parameters (e.g.,
the mean test performance) and their standard errors. School districts would
define the ultimate clusters in the first data set, and PSUs would form the
ultimate clusters of the second. However, the survey analyst using the first
data set may have broader analytic goals. Specifically, he or she may be interested in estimating the proportion of variance in test scores that is attributable to the student, the class, the school, and the district. The analyst working
with the second data set may not be especially interested in the components
of variance associated with PSUs, secondary sampling units (SSUs) within
PSUs, households, and individuals. The first analyst will likely choose an
HLM form of the generalized linear mixed model. The second analyst will
be satisfied with a standard design-based regression analysis in which the
© 2010 by Taylor and Francis Group, LLC
380
Applied Survey Data Analysis
Taylor series linearization or replication estimates of overall sampling error
are used to develop confidence intervals and test statistics.
Consider another problem in repeated measures analysis in which a
simple random sample of individuals is asked to complete a daily diary of
hay fever symptoms for each of t = 1, …, 7 consecutive days. The dependent
variable for each daily measurement is whether the subject experienced hay
fever symptoms (yit = 1) or did not (yit = 0). The independent variables in the
analysis might include age, gender, daily pollen count, and an indicator of
previous allergy diagnosis. To estimate the relationship of hay fever symptoms to pollen count levels, the survey analyst could choose among three
approaches that all use variants of logistic regression modeling:
1. A GLMM in which the logit[P(yit = 1)] is modeled as a function of
fixed effects that include the constant effects of age, gender and
time t pollen count, and random effects of individuals in which the
repeated diary measures are clustered.
2.A GEE model in which the logit model relating symptoms to the
covariates is estimated using GEE methods and the covariance
matrix for the model coefficients is separately estimated using the
robust Huber-White sandwich estimator (Diggle et al., 2002).
3. A design-based analysis using a program such as Stata svy: logit
with a single record for each daily report and the individual as the
“cluster.”
The GLMM analysis is a subject-specific analysis, explicitly controlling for
the random subject effects, and would yield estimates of the fixed effects (or
regression parameters) associated with the covariates in addition to estimates
of the variances of the random subject effects. The GEE and svy: logit
approaches are population-averaged (or marginal) modeling techniques and
would provide comparable estimates of robust standard errors for the logistic regression coefficients for the covariates. The GEE and design-based population-averaged approaches would not separately estimate the variances of
the random subject effects or their contributions to the total sampling variability. This is the key distinction between these alternative approaches to
analyzing clustered or longitudinal data: GLMMs enable analysts to make
inferences about between-subject (or between-cluster) variance, based on the
variances of the subject-specific (or cluster-specific) random effects explicitly
included in the models, while GEE and design-based modeling approaches
are concerned only with overall estimates of parameters and their total sampling variance.
It is increasingly common to see survey samples designed to be optimal for
analysis under a GLMM-type model. For example, a multistage national probability sample of school districts, schools, classrooms, and students could be
consistent with a GLMM model that would enable the education researcher
© 2010 by Taylor and Francis Group, LLC
Advanced Topics in the Analysis of Survey Data
381
to study the influence of each level in this hierarchy on student outcomes.
To achieve data collection efficiency, the sampling statistician designing the
sample of individuals for the hay fever study could select a primary stage
sample of U.S. counties and then a clustered sample of individuals within
each primary stage county. The result would be a three-level data set with
days nested within individuals and individuals nested within counties.
However, even when care is taken to design a probability sample to support multilevel or longitudinal analysis using GLMM-type models, several
theoretical and practical issues should be addressed in the application of
these models to complex sample survey data:
1.Stratification. Using the terminology introduced in Section 12.2, if the
stratification used in the sample selection (e.g., regions, urban/rural
classification, population size) is informative—that is, associated
with the survey variables of interest—the stratum identifiers, or at
a minimum the major variables used to form the strata, will need to
be included as fixed effects in the model.
2.Clustering. In the general sample design context, clustering of population elements serves to reduce the costs of data collection. The
increase in sampling variance attributable to the intraclass correlation of characteristics within the cluster groupings is considered a
“nuisance,” inflating standard errors of estimates to no particular
analytical benefit. In the context of a specific analysis involving a
hierarchical linear (or multilevel) model, the clustering of observations is necessary to obtain stable estimates of the effects and variance components at each level of the hierarchy. Ideally, the sample
design clusters may be integrated into the natural hierarchy of the
GLMM, and the cluster effects on outcomes can be directly modeled using additional levels of random effects—say, nesting students
within classrooms, classrooms within schools, and schools within
county PSUs.
3.Weighting. Conceptually, one of the more difficult problems in the
application of GLMMs to complex sample survey data is how to
handle the survey weights. Theoretically, if the weights were noninformative, they could be safely omitted from the model estimation.
If the sample design and associated weights are informative for the
analysis, the variables used to build the weights or the weight values themselves could be included as fixed effects in the model. But
there are other complications with weighting in GLMMs. Attrition
adjustments that are often included in longitudinal survey weights
can yield different weight values for each measurement time point.
Clusters in a hierarchical sample design may not enter the sample
with equal probability. Even in a national equal probability multistage sample of students, the most efficient samples require that
© 2010 by Taylor and Francis Group, LLC
382
Applied Survey Data Analysis
counties and schools are selected with PPS. Conditional on a given
stage of sampling, the observed units would enter with varying
probability.
The following section uses the example of estimating longitudinal change
in individual attributes over time to illustrate some of the solutions that
have been proposed for incorporating complex sample design effects into a
GLMM analysis. Section 12.3.4 presents an example analysis in which longitudinal growth in HRS households’ assets from 2000 to 2006 is modeled
using a simple random intercept model.
12.3.3â•‡ GLMM Approaches to Analyzing Longitudinal Survey Data
A variety of longitudinal survey designs may be used to collect data over
time. Repeated cross-sectional surveys such as the NHANES, rotating
panel designs such as the Current Population Survey, and true panel survey
designs such as the Panel Study of Income Dynamics (PSID; Hill, 1992) and
the HRS (Juster and Suzman, 1995) all play an important role in understanding societal- and individual-level change over time. Binder (1998) and Kalton
and Citro (1993) provide sound overviews of the key issues associated with
longitudinal sample designs, both at the design and analysis stages. Menard
(2008) and Lynn (2009) address design, measurement, and analysis for longitudinal surveys.
Panel surveys, with repeated measurements on individuals over multiple
waves of data collection, enable longitudinal analysis of individual-level
change over time. The simplest type of analysis is the analysis of individual
change over two points in time denoted by t and t – k, for example,
∑ w ⋅(y − y
=
∑w
δt− k ,t
t− k ,i
t ,i
i
i
)
(12.6)
i
i
where wi is a population weight for sample element i; or,
yˆ t ,i = βˆ 0 + γˆ yt− k ,i + βˆ 1 x
where γˆ , βˆ 1 are weighted estimates of regression parameters.
We considered an example of analyzing the mean change between two
points in time in Example 5.13.
GLMMs can extend these simple “two-point” change models to examine
trends in repeated measures or growth curves over multiple observations
on the individual sample persons. Our aim in this section is to present a
© 2010 by Taylor and Francis Group, LLC
383
Advanced Topics in the Analysis of Survey Data
synthesis of the existing literature on suggested approaches to using GLMMs
for longitudinal analysis of complex sample survey data and to end with a
practical example using existing statistical software that is capable of fitting
some of the proposed models.
We first describe general notation for models that can be fitted to longitudinal survey data arising from complex sample designs. Let yit be the
response for individual i at time t, which may vary randomly over time.
We assume that randomly sampled individual i has some “permanent” status that partly defines the observed responses, which can be denoted by bi.
These values for the randomly sampled individuals are assumed to arise
from a normal distribution with a fixed mean (for the population) of 0 and
a variance of τ2. Then, observed values for yit conditional on a randomly
sampled value of bi are assumed to arise from a normal distribution with
mean βt + bi (where βt is the population mean of the response at time t) and
variance σ2. More succinctly,
bi ~ N (0 , τ 2 )
yit |bi ~ N (βt + bi , σ 2 )
(12.7)
Thinking in a multilevel modeling context, τ2 represents between-individual variance, while σ2 represents within-individual variance over time. The
variance of yit is thus partitioned into variance across individuals and transitory variance and can be written as τ2 + σ2.
A simple model for an outcome y measured over time can then be written
as follows:
yit = βt + bi + eit
(12.8)
In this model, t = 1, …, T, where T represents the total number of waves
(or panels), i = 1, …, n (persons), βt is the population mean of the outcome
at time t, bi is a long-term difference from βt for person i (a random variable
corresponding to a random individual effect), and eit is a transitory effect at
time t for person i (each individual will randomly vary around his or her
permanent state at time t). It is important to note that the eit values will be
correlated over time. One possible model expressing this correlation could
be the first-order autoregressive [AR(1)] model:
e it = ρe i( t−1) + ε it
(12.9)
The model in Equation 12.9 can thus be rewritten as follows, using the
AR(1) model for the transitory effects:
© 2010 by Taylor and Francis Group, LLC
384
Applied Survey Data Analysis
yit = βt + bi + ρei( t−1) + ε it
(12.10)
In this specification, εit represents random measurement error, and the bi
and εit are assumed to be independent. We also assume that E(bi) = E(εit) = 0,
var(bi) = τ2, var(εit) = σ2ε, and eit and εit are mutually independent and stationary. We can then write
var( e it ) =
σ ε2
(1 − ρ2 )
(12.11)
We focus on a GLMM approach that has been developed to fit models of
this form to longitudinal data from complex sample surveys (Skinner and
Holmes, 2003).
To fit this model using GLMM methods, we must first address the question of how the longitudinal survey weights should be incorporated in the
model estimation. Pfeffermann et al. (1998) laid the groundwork for appropriate methods of incorporating sampling weights into the estimation of
parameters in multilevel models for complex sample survey data. They also
proposed appropriate variance estimators (based on the delta method). The
methods proposed by these authors can be used to estimate two-level multilevel models for longitudinal survey data given sampling weights computed
for both the level 2 units (individuals) and the level 1 units (repeated observations). Theory BoxÂ€12.1 describes the procedure for constructing the weights
for the individual levels.
Unfortunately, the multilevel modeling approach proposed by Pfeffermann
et al. (1998) does not allow for autocorrelation of transitory effects within
persons, as described in Equation 12.9 (the approach was described for the
case where level 2 units were PSUs and level 1 units were sampled individuals), and our example in this chapter therefore does not consider this possible
autocorrelation. Skinner and Holmes (2003) developed a method to circumvent this problem and enable the methods proposed by Pfeffermann et al. to
be applied, but this method has yet to be programmed in any general purpose statistical software packages. A description of an alternative covariance
structure modeling approach to serially correlated observations is provided
in Theory BoxÂ€12.2.
Rabe-Hesketh and Skrondal (2006) further investigated methods of fitting GLMMs to complex sample survey data (and specifically binary outcomes). Based on their research, these authors stressed the importance in
likelihood PMLE estimation (see Chapter 8) of incorporating appropriately
scaled weights at all levels of a multilevel study, e.g., school-level, or level 2,
weights and student-level, or level 1, weights, and not simply student-level
weights). Addressing a limitation of the approaches proposed by Pfefferman
et al. (1998) and Skinner and Holmes (2003), these authors also discussed
© 2010 by Taylor and Francis Group, LLC
Advanced Topics in the Analysis of Survey Data
385
Theory BoxÂ€12.1â•… Weighted Estimation
in Two-Level GLMMs
Following Pfeffermann et al. (1998), the sampling weights for individuals at level 2 and repeated observations at level 1 are computed as
follows. In general, given a longitudinal study design with these two
levels, the appropriate weight for the data from individual i at time t,
wit, is computed as
wit =
1
π i π t|i
(12.12)
where πi represents individual i’s original probability of inclusion in the
sample, and πt|i represents the probability that individual i responds at
time t. The πt|i values might be estimated using a propensity modeling approach (e.g., Lepkowski and Couper, 2002), where the probability
of responding at time t is predicted using a logistic regression model
with predictors from previous waves. We assume that some reasonable
methodology has been used for estimation of these response probabilities in this discussion.
For purposes of the multilevel modeling approach discussed in this
section, define the level 2 weight as
wi =
1
πi
(12.13)
1
π t|i
(12.14)
and define the level 1 weight as
wt|i =
Then, we have
wit = wi wt|i
(12.15)
and
wt|i =
wit
wi
(12.16)
A reasonable approach suggested by Skinner and Holmes (2003) is to
let wi = wi1, or to let the level 2 weight for individual i be the inverse of
the probability of being selected for the sample and responding at the
© 2010 by Taylor and Francis Group, LLC
386
Applied Survey Data Analysis
baseline wave. Then, w1|i = wi1/wi1 = 1. This procedure will result in less
variance in the level 1 weights denoted by wt|i.
The computed level 1 weights can then be scaled, as recommended
by Pfeffermann et al. (1998) and Skinner and Holmes (2003) to minimize small-sample estimation bias. Scaling weights (e.g., normalizing weights; see Chapter 4) generally does not have an impact on the
estimates of parameters in multivariate models, but this is not the case
when fitting multilevel models to complex sample survey data. Skinner
and Holmes (p. 212) suggest rescaling the level 1 weights previously
described as follows:
wt*|i =
t * ( i )wt|i
t* ( i )
∑w
(12.17)
t|i
t =1
In this notation, t*(i) represents the last wave at which individual i
provides a response. Note that this method ensures that the average
rescaled weight for a given individual i is equal to 1. Rabe-Hesketh
and Skrondal (2006) outline additional options for scaling the level 1
weights, one of which we consider in the example in this section.
the computation of “sandwich” estimates of standard errors (based on a
linearization approach) to incorporate stratification and primary stage
clustering of a complex sample into the calculation of design-based standard errors for the parameter estimates derived using the PML approach.
Their proposed method is implemented in the Generalized Linear Latent
and Mixed Models (gllamm) program that these authors developed for the
Stata system (visit http://www.gllamm.org). The underlying assumption of
this approach is that the primary stage clusters represent the highest-level
units in the multilevel design (e.g., students at level 1 nested within schools
at level 2, and schools at level 2 nested with primary stage clusters at level
3). Via simulation, these authors found good coverage for confidence intervals based on the robust standard errors. Thinking about multilevel models for longitudinal survey data, we can apply the same models where the
repeated observations across waves define the level 1 observations, the
sampled individuals define level 2, and the primary sampling units define
level 3 of the data hierarchy. We consider an example of this type later in
Section 12.3.4.
Skinner and Vieira (2007) specifically focused on the impacts of cluster
sampling on variance estimation when analyzing longitudinal survey data
with multilevel models. They concluded that the simple method of including a random cluster effect in a three-level multilevel model for longitudinal
© 2010 by Taylor and Francis Group, LLC