5 Practical Matters: The Limits of Interpreting Moderators in Meta‑Analysis
Tải bản đầy đủ
Moderator Analyses
Path diagram:
EC1*
EC2*
EC3*
b1=.582
b2=.063
b3=.152
C_Age*
b4= –.020
Intercept*
b0=.370
1.0*
Zr(12)*
0*
1
Mplus syntax:
TITLE: Moderator analysis
DATA: File is Table9_3.txt;
VARIABLE: NAMES N r Zr W EC1 EC2 EC3 C_Age interc;
USEVARIABLES ARE Zr EC1 EC2 EC3 C_Age interc;
DEFINE: w2 = SQRT(W);
Zr = w2 * Zr;
EC1 = w2 * EC1;
EC2 = w2 * EC2;
EC3 = w2 * EC3;
C_Age = w2 * C_Age;
interc = w2 * interc;
MODEL:
[Zr@0.0]; !Fixes intercept at 0
Zr@1.0;
!Fixes variance at 1
Zr ON EC1 EC2 EC3 C_Age interc;
!Regress transformed Zr on transformed moderators
OUTPUT:
FIGURE 9.2. Path diagram and Mplus syntax to evaluate moderation.
223
224
COMBINING AND COMPARING EFFECT SIZES
unique association of a moderator above and beyond the other highly correlated moderators. Second, if they are extremely highly correlated, you can get
inaccurate regression estimates that have large standard errors (the so-called
bouncing beta problem).
Fortunately, it is easy—though somewhat time-consuming—to evaluate
multicolinearity in meta-analytic moderator analyses. To do so, you regress
each moderator (predictor) onto the set of all other moderators, weighted by
the same weights (i.e., inverse variances of effect size estimates) as you have
used in the moderator analyses. To illustrate using the example data shown
in Table 9.3, I would regress age onto the three dummy variables representing
the four categorical methods of assessing aggression. Here, R2 = .41, far less
than the .90 that is often considered too high (e.g., Cohen et al., 2003, p. 424).
I would then repeat the process for other moderator variables, successively
regressing (weighted by w) them on the other moderator variables.
9.5.2Conceptually Confounded (Proxy) Moderators
A more difficult situation is that of uncoded confounded moderators. These
include a large range of other study characteristics that might be correlated
across studies with the variables you have coded. For example, studying
a particular type of sample (e.g., adolescents vs. young children) might be
associated with particular methodological features (e.g., using self-reports
vs. observations; if I had failed to code this methodology, then this feature
would potentially be an uncoded confounded moderator). Here, results indicating moderation by the sample characteristics might actually be due to
moderation by methodology. Put differently, the moderator in my analysis is
only a proxy for the true moderator. Moreover, because the actual moderator
(type of measure) is conceptually very different from the moderator I actually tested (age), my conclusion would be seriously compromised if I failed to
consider this possibility.
There is no way to entirely avoid this problem of conceptually confounded, or proxy, moderators. But you can reduce the threat it presents by
coding as many alternative moderator variables as possible (see Chapter 5).
If you find evidence of moderation after controlling for a plausible alternative
moderator, then you have greater confidence that you have found the true
moderator (whereas if you did not code the alternative moderator, you could
not empirically evaluate this possibility). At the same time, a large number of
alternative possibilities might be argued to be the true moderator, of which
the predictor you have considered is just a proxy, and it is impossible to
Moderator Analyses
225
anticipate and code all of these possibilities. For this reason, some argue that
findings of moderation in meta-analysis are merely suggestive of moderation,
but require replication in primary studies where confounding variables could
arguably be better controlled. I do not think there is a universal answer for
how informative moderator results from meta-analysis are; I think it depends
on the conceptual arguments that can be made for the analyzed moderator
versus other, unanalyzed moderators, as well as the diversity of the existing
studies in using the analyzed moderator across a range of samples, methodologies, and measures. Despite the ambiguities inherent in meta-analytic
moderator effects, assessing conceptually reasonable moderators is a worthwhile goal in most meta-analyses in which effect sizes are heterogeneous (see
Chapter 8).
9.5.3Ensuring Adequate Coverage
in Moderator Analyses
When examining and interpreting moderators, an important consideration is
the coverage, or the extent to which numerous studies represent the range of
potential moderator values considered. The literature on meta-analysis has
not provided clear guidance on what constitutes adequate coverage, so this
evaluation is more subjective than might be desired. Nevertheless, I try to
offer my advice and suggestions based on my own experience.
As a first step, I suggest creating simple tables or plots showing the number of studies at various levels of the moderator variables. If you are testing
only the main effects of the moderators, it is adequate to look at just the
univariate distributions.11 For example, in the meta-analysis of Table 9.3,
I might create frequency tables or bar charts of the methods of assessing
aggression, and similar charts of the continuous variable age categorized into
some meaningful units (e.g., early childhood, middle childhood, early adolescence, and middle adolescence; or simply into, e.g., 2-year bins). Whether or
not you report these tables or charts in a manuscript, they are extremely useful in helping you to evaluate the extent of coverage. Considering the method
of assessing aggression, I see that these data contained a reasonable number
of effect sizes from peer- (k = 17) and teacher- (k = 6) report methods, but
fewer from observations (k = 3) and only one using parent reports. Similarly,
examining the distribution of age among these effect sizes suggests a gap in
the early adolescence range (i.e., no studies between 9.5 and 14.5 years).
What constitutes adequate coverage? Unfortunately, there are no clear
answers to this question, as it depends on the overall size of your meta-
226
COMBINING AND COMPARING EFFECT SIZES
a nalysis, the correlations among moderators, the similarity of your included
studies on characteristics not coded, and the conceptual certainty that
the moderator considered is the true moderator rather than a proxy. At an
extreme, one study representing a level of a moderator (e.g., the single study
using parent report in this example) or one study in a broad area of a continuous moderator (e.g., if there was only one study during early childhood)
is not adequate coverage, as it is impossible to know what other features of
that study are also different from those of the rest of the studies. Conversely,
five studies covering an area of a moderator probably constitute adequate
coverage for most purposes (again, I base this recommendation on my own
experience; I do not think that any studies have more formally evaluated this
claim). Beyond these general points of reference, the best advice I can provide
is to carefully consider these studies: Do they all provide similar effect sizes?
Do they vary among one another in other characteristics (which might point
to the generalizability of these studies for this region of the moderator)? Are
the studies comparable to the studies at other levels of the moderator (if not,
then it becomes impossible to determine whether the presumed moderator is
a true or proxy moderator)?
9.6 Summary
In a meta-analysis, moderator variables are coded study characteristics that
are evaluated as predictors of effect sizes. It is possible to evaluate categorical
moderators in an approach similar to ANOVA, continuous moderators in an
approach similar to regression, and to evaluate flexible combinations of these
in either a general multiple regression or SEM framework. In this chapter, I
have described each of these approaches as well as some limitation in interpreting moderator effects in meta-analysis.
9.7Recommended Readings
Lipsey, M. W. (2003). Those confounded moderators in meta-analysis: Good, bad, and
ugly. The Annals, 587, 69–81.—This article provides an accessible and thoughtful
conceptual consideration for interpreting moderator effects from meta-analysis.
Lipsey, M. W., & Wilson, D. B. (2001). Practical meta-analysis. Thousand Oaks, CA:
Sage.—This book provides a clear and concise description of the ANOVA and regression approaches to moderator analyses that I have described in this chapter (see
especially pp. 120–124 and 135–140).
Moderator Analyses
227
Notes
1. Although using multiple effect sizes from the same study violates the assumption of independence, I believe that this practice is acceptable when analyzing
categorical moderators and the interdependent effect sizes are placed in different
groups. Because it is reasonable to expect multiple effect sizes from the same
study to be more similar (i.e., positively correlated) than independent effect
sizes, the impact of this interdependence will be to attenuate between-group
differences. Therefore, violation of the independence assumption in this case is
likely to impose a conservative bias (i.e., increase in type II error rates). I believe
that this negative consequence is outweighed by the advantage of being able to
include all relevant effect sizes in this example.
2. The formula I have provided for the number of comparisons differs from that
sometimes provided in textbooks on ANOVA (e.g., Keppel, 1991, p. 167). My
formula assumes that you are only interested in comparing two groups with each
other (i.e., pairwise mean comparisons in ANOVA terminology), so the number
of possible comparisons is G(G – 1)/2 (e.g., for 4 groups, the number of comparisons is 4(4 – 1)/2 = 6). I assume that you will not compare different combinations
of groups (e.g., whether the mean effect sizes of Groups 1 and 2 combined differ
from the mean effect sizes of Groups 3 and 4 combined). If these multigroup
comparisons are of interest, then the total number of comparisons that can be
made using G groups is 1 + (3G – 1)/2 – 2G. Using this correction will result in
very conservative comparisons, and I strongly recommend considering planned
comparisons rather than this approach if you are interested in these combined
group comparisons.
3. I do not use the residual sum of squares in this section, but it is useful to record.
This value represents the residual heterogeneity (Qresidual), or heterogeneity
effect sizes not accounted for by the regression model.
4. These effects codes would assume that all groups have equal sizes (here, equal
numbers of studies). Effects codes derived from centering (described below) can
accommodate different group sizes.
5. Because not all programs readily provide this weighted average, it is useful to
keep in mind that you can compute this weighted average of the predictor by
regressing the predictor variable onto a constant, weighted by the inverse variance weights (w).
6. There is an interesting suppressor effect among these 27 effect sizes: By itself, age
is only a marginally significant predictor (Qregression(1) = 3.56, p = .059). However, when controlling for these effects for method, the effect of age is statistically significant.
7. In the hierarchical multiple regression, (Qregression(1) = 28.33, p < .001). In the
simultaneous regression, the regression coefficient was statistically significant
according to a Z-test: Z = –.0203 / .00382 = –5.32. Note that (–5.32)2 = 28.33.
228
COMBINING AND COMPARING EFFECT SIZES
8. Namely, this approach allows you to use FIML techniques of missing data management (see, e.g., Arbuckle, 1996). This approach is superior to the practice of
removing studies that have missing study characteristics in that FIML will provide less biased and more statistically powerful results. This approach is especially valuable when simultaneously evaluating multiple moderators, for which
many studies might otherwise be removed for missing values on one of the several coded study characteristics (moderators).
9. For reasons I describe in the next chapter on random- and mixed-effects models,
I recommend using Mplus or Mx SEM packages.
10. Note that the Mplus syntax in this figure calculates the transformations of Equation 9.7 directly from the raw effect size (Zr) and intercept (1.0).
11. If you are interested in evaluating interactions among moderators, it would be
valuable to consider multivariate distributions. For example, if I were interested
in the interaction of age and method of assessing aggression in the example
meta-analysis, I would create a two-dimensional plot with method on one axis
and age on the other, then plot the studies within this space. Here, I would look
for any areas of this space where there are few or no studies.
10
Fixed‑, Random‑,
and Mixed‑Effects Models
In Chapter 8, I presented an approach to computing mean effect sizes and
drawing inferences or computing confidence intervals about these means. In
Chapter 9, I described methods of evaluating moderators in the presence of initial heterogeneity. Both of these analyses assumed homogeneity at some level;
in Chapter 8, this assumption was that the effect sizes were homogeneous
(i.e., no more variability than expected due to random-sampling fluctuations),
and in Chapter 9, this assumption was that the effect sizes were homogeneous
after accounting for differences by moderator variables (i.e., conditional homogeneity). These models assuming homogeneity (or conditional homogeneity)
are termed fixed-effects models.
In this chapter, I present an alternative approach, known as randomeffects models (e.g., Hedges, 1983; Hedges & Vevea, 1998; Overton,
1998; Raudenbush, 1994), in which you model this unexplained heterogeneity. In Section 10.1 I compare the fixed-effects models discussed in Chapter
8 with these random-effects models, and in Section 10.2 I describe how you
compute the mean effect size (and draw inferences and compute confidence
intervals) within these random-effects models. In Section 10.3 I describe how
to analyze moderators while also modeling unexplained heterogeneity (mixedeffects, or conditionally random, models). I then continue from the introduction
of the SEM representation of meta-analysis from Chapter 9 to discuss how
this approach can be used to estimate random- and mixed-effects models
(Section 10.4). Finally, I consider some practical matters in choosing among
these models, presenting both conceptual and statistical power considerations
(Section 10.5).
229
230
COMBINING AND COMPARING EFFECT SIZES
10.1Differences among Models
It is easiest to begin with the simple case in which you are interested only
in the mean effect size among a set of studies, both in identifying the mean
effect size and in computing its standard errors for inferential testing or
for computing of confidence intervals. Even in this simple case, there are a
number of conceptual, analytic, and interpretive differences between fixedand random-effects meta-analytic models (see also Hedges & Vevea, 1998;
Kisamore & Brannick, 2008).
10.1.1Conceptual Differences
The conceptual differences between fixed- and random-effects models can be
illustrated through Figure 8.1, which I have reproduced in the top of Figure
10.1. As you recall, the top of Figure 10.1 displays effect sizes from five studies, all (or at least most) of which have confidence intervals that overlap with
a single population effect size, now denoted with θ using traditional symbol
conventions (e.g., Hedges & Vevea, 1998). This overlap with a single population effect size, with deviations of study effect sizes due to only sampling
fluctuations (i.e., study-specific confidence intervals), represents the fixedeffects model of meta-analysis.
The bottom portion of Figure 10.1 displays the random-effects model.
Here, the confidence intervals of the individual study effect sizes do not necessarily overlap with a single population effect size. Instead, they overlap
with a distribution of population effect sizes. In other words, random-effects
models conceptualize a population distribution of effect sizes, rather than
a single effect size as in the fixed-effects model. In a random-effects model,
you estimate not only a single population mean effect size (θ), but rather a
distribution of population effect sizes represented by a central tendency (µ)
and standard deviation (t).
10.1.2 Analytic Differences
These conceptual differences in fixed- versus random-effects models can also
be expressed in equation form. These equations help us understand the computational differences between these two models, described in Section 10.2.
Equation 10.1 expresses this fixed-effects model of study effect sizes
being a function of a population effect size and sampling error:
Fixed-, Random-, and Mixed-Effects Models
231
Equation 10.1: Equation for effect sizes for studies
in fixed‑effects model
ESi = θ + εi
• ESi is the (observed) effect size for study i.
• θ is the (single) population effect size.
• ei is the deviation of study i from the population effect size.
Fixed-effects model
Population
effect size, Q
Study 1
Study 2
Study 3
Study 4
Study 5
Range of effect sizes
Random-effects model
Population
effect size, µ
Study 1
Study 2
Study 3
T
Study 4
Study 5
Range of effect sizes
FIGURE 10.1. Conceptual representation of fixed- versus random-effects models.
232
COMBINING AND COMPARING EFFECT SIZES
In this fixed-effects model, the effect sizes for each study (ESi) are
assumed to be a function of two components: a single population effect size
(θ) and the deviation of this study from this population effect size (ei). The
population effect size is unknown but is estimated as the weighted average
of effect sizes across studies (this is often one of the key values you want to
obtain in your meta-analysis). The deviation of any one study’s effect size
from this population effect size (ei) is unknown and unknowable, but the distribution of these deviations across studies can be inferred from the standard
errors of the studies. The test of heterogeneity (Chapter 8) is a test of the null
hypothesis that this variability in deviations is no more than what you expect
given sampling fluctuations alone (i.e., homogeneity), whereas the alternative
hypothesis is that these deviations are more than would be expected by sampling fluctuations alone (i.e., heterogeneity).
I indicated in Chapter 9 that the presence of significant heterogeneity
might prompt us to evaluate moderators to systematically explain this heterogeneity. An alternative approach would be to model this heterogeneity within
a random-effects model. Conceptually, this approach involves estimating not
only a mean population effect size, but also the variability in study effect
sizes due to the population variability in effect sizes. These two estimates are
shown in the bottom of Figure 10.1 as µ (mean population effect size) and t
(population variability in effect sizes). In equation form, this means that you
would conceptualize each study effect size arising from three sources:
Equation 10.2: Equation for effect sizes for studies
in random‑effects model
ESi = m + ξ + ei
• ESi is the (observed) effect size for study i.
• µ is the mean of the distribution of population effect sizes.
• ξi is the reliable (not due to sampling deviation) deviation of study i
from the mean of the distribution of population effect sizes.
• ei is the conditional deviation (sampling deviation) of study i from
the distribution of population effect sizes.
As shown by comparing the equations for fixed- versus random-effects
models (Equation 10.1 vs. Equation 10.2, respectively), the critical difference
is that the single parameter of the fixed-effects model, the single population
effect size (θ), is decomposed into two parameters (the central tendency and
Fixed-, Random-, and Mixed-Effects Models
233
study deviation, µ and ξi) in the random-effects model. As I describe in more
detail in Section 10.2, the central tendency of this distribution of population
effect sizes is best estimated by the weighted mean of effect sizes from the
studies (though with a different weight than used in a fixed-effects model).
The challenge of the random-effects model is to determine how much of the
variability in each study’s deviation from this mean is due to the distribution
of population effect sizes (ξis, sometimes called the random-effects variance;
e.g., Raudenbush, 1994) versus sampling fluctuations (eis, sometimes called
the estimation variance). Although this cannot be determined for any single
study, random-effects models allow you to partition this variability across the
collection of studies in your meta-analysis. I describe these computations in
Section 10.2.
10.1.3 Interpretive Differences
Before turning to these analyses, however, it is useful to think of the different interpretations that are justified when using fixed- versus random-effect
models. Meta-analysts using fixed-effects models are only justified in drawing
conclusions about the specific set of studies included in their meta-analysis
(what are sometimes termed conditional inferences; e.g., Hedges & Vevea,
1998). In other words, if you use a fixed-effects model, you should limit your
conclusions to statements of the “these studies find . . . ” type.
The use of random-effects models justifies inferences that generalize
beyond the particular set of studies included in the meta-analysis to a population of potential studies of which those included are representative (what
are sometimes termed unconditional inferences; Hedges & Vevea, 1998). In
other words, random-effects models allow for more generalized statements of
the “the literature finds . . . ” or even “there is this magnitude of association
between X and Y” type (note the absence of any “these studies” qualifier).1
Although meta-analysts generally strive to be comprehensive in their inclusion of relevant studies in their meta-analyses (see Chapter 3), the truth is
that there will almost always be excluded studies about which you still might
wish to draw conclusions. These excluded studies include not only those that
exist that you were not able to locate, but also similar studies that might be
conducted in the future or even studies that contain unique permutations of
methodology, sample, and measures that are similar to your sampled studies
but simply have not been conducted.
I believe that most meta-analysts wish to make the latter, generalized
statements (unconditional inferences) most of the time, so random-effects