5 Practical Matters: Which Model Should I Use?
Tải bản đầy đủ
Fixed-, Random-, and Mixed-Effects Models
251
tions, you might use the absence versus presence of unexplained heterogeneity to inform your choice between fixed- versus random- or mixed-effects
models (respectively). Many meta-analysts take this approach. However, I
urge you to not make this your only consideration because the heterogeneity (i.e., Q) test is an inferential test that can vary in statistical power. In
meta-analyses with many studies that have large sample sizes, you might find
a significant residual heterogeneity that is trivial, whereas a meta-analysis
with few studies having small sample sizes might fail to detect potentially
meaningful heterogeneity. For this reason, I recommend against basing your
model decision only on empirical findings of unexplained heterogeneity.
A third consideration is the relative statistical power of fixed- versus
random-effects models (or fixed-effects with moderators versus mixedeffects models). The statistical power of a meta-analysis depends on many
factors—number of studies, sample sizes of studies, degree to which effect
sizes must be corrected for artifacts, magnitude of population variance in
effect size, and of course true mean population effect size. Therefore, it is not
a straightforward computation (see e.g., Cohn & Becker, 2003; Field, 2001;
Hedges & Pigott, 2001, 2004). However, to illustrate this difference in power
between fixed- and random-effects models, I have graphed some results of
a simulation by Field (2001), shown in Figure 10.4. These plots make clear
the greater statistical power of fixed-effects versus random-effects models.
More generally, fixed-effects analyses will always provide as high (when t2
= 0) or higher (when t2 > 0) statistical power than random-effects models.
This makes sense in light of my earlier observation that the random-effects
weights are always smaller than the fixed-effects weights; therefore, the sum
of weights is smaller and the standard error of the average effect size is larger
for random- than for fixed-effects models. Similarly, analysis of moderators
in fixed-effects models will provide as high or higher statistical power as
mixed-effects models. For these reasons, it may seem that this consideration
would always favor fixed-effects models. However, this conclusion must be
tempered by the inappropriate precision associated with high statistical
power when a fixed-effects model is used inappropriately in the presence
of substantial variance in population effect sizes (see below). Nevertheless,
statistical power is one important consideration in deciding among models:
If you have questionable statistical power (small number of studies and/or
small sample sizes) to detect the effects you are interested in, and you are
comfortable with the other considerations, then you might choose a fixedeffects model.
The presence of studies that are outliers in terms of either their effect
sizes or their standard errors (e.g., sample sizes) is better managed in ran-
252
COMBINING AND COMPARING EFFECT SIZES
Power
Population r = .10
1.000
FE (N=20)
0.800
RE (N=20)
FE (N=40)
0.600
RE (N=40)
FE (N=80)
0.400
0.200
0.000
5
10
15
20
25
k (number of studies)
30
RE (N=80)
FE (N=160)
RE (N=160)
Population r = .30
1.000
Power
0.800
0.600
0.400
0.200
0.000
5
10
15
20
25
30
k (number of studies)
Population r = .50
1.000
Power
0.800
0.600
0.400
0.200
0.000
5
10
15
20
25
30
k (number of studies)
FIGURE 10.4. Examples of different statistical power of fixed- versus randomeffects models. Values from simulation by Field (2001) showing power to detect
mean effect, by number of studies (k) of various sample sizes (N), at population
r = .10, .30, and .50 for fixed-effects (solid lines; homogeneous population effect size)
and random-effects (dashed lines; heterogeneous population effect size) models.
Fixed-, Random-, and Mixed-Effects Models
253
dom- than fixed-effects models. Outliers consisting of studies that have
extreme effect sizes have more influence on the estimated mean effect size
in fixed-effects analysis because the analyses—to anthropomorphize—must
“move the mean” substantially to fall within the confidence interval of the
extreme effect size (see top of Figure 10.1). In contrast, studies with extreme
effect sizes impact the population variance (t2) more so than the estimated
mean effect size in random-effects models. Considering the bottom of Figure
10.1, you can imagine that an extreme effect size can be accommodated by
widening the spread of the population effect size distribution (i.e., increasing
the estimate of t) in a random-effects model.
A second type of outlier consists of studies that are extreme in their
sample sizes, especially those with much larger sample sizes than other
studies. Because sample size is strongly connected to the standard error of
the study’s effect size, and these standard errors in turn form the weight in
fixed-effects models (see Chapter 8), you can imagine that a study with an
extremely large sample could be weighted much more heavily than other
studies. For example, in the 22 study meta-analyses I have presented (see
Table 10.1), four studies with large samples (Hawley et al., 2007; Henington,
1996; Pakaslahti and Keltikangas-Järvinen, 1998; Werner, 2000) comprise
44% of the total weight in the fixed-effects analysis (despite being only 18%
of the studies) and are given 13 to 16 times the weight of the smallest study
(Ostrov, Woods, Jansen, Casas, & Crick, 2004). Although I justified the use
of weights in Chapter 8, this degree of weighting some studies far more than
others might be too undemocratic (and I have seen meta-analyses with even
more extreme weighting, with single studies having more weight than all
other studies combined). As I have mentioned, random-effects models reduce
these discrepancies in weighting. Specifically, because a common estimate of
t2 is added to the squared standard error for each study, the weights become
more equal across studies as t2 becomes larger. This can be seen by inspecting the random-effects weights (w*) in Table 10.1: Here the largest study is
only weighted 1.4 times the smallest study. In sum, random-effects models,
to the extent that t2 is large, use weights that are less extreme, and therefore random- (or mixed-) effects models might be favored in the presence of
sample size outliers.
Perhaps the least convincing consideration is the complexity of the models (the argument is so unconvincing that I would not even raise it if it was
not so commonly put forward). The argument is that fixed-effects models,
whether for only computing mean effect sizes (Chapter 8) or for evaluating
moderators (Chapter 9) are far simpler than random- and mixed-effects models. Although simplicity is not a compelling rationale for a model (and a ratio-
254
COMBINING AND COMPARING EFFECT SIZES
nale that will not go far in the publication process), I acknowledge that you
should be realistic in considering how complex of a model you can use and
report. I suspect that most readers will be able to perform computations for
random-effect models, so if you are not analyzing moderators and the other
considerations point you toward this model, I encourage you to use it. Mixedeffects models, in contrast, are more complex and might not be tractable for
many readers. Because less-than-optimal answers are better than no answers
at all, I do think it is reasonable to analyze moderators within a fixed-effects
model if this is all that you feel you can do—with the caveat that you should
recognize the limitations of this model. Even better, however, is for you to
enlist the assistance of an experienced meta-analyst who can help you with
more complex—and more appropriate—models.
At this point, you might see some advantages and disadvantages to
each type of model, and you might still feel uncertain about which model to
choose. I think this decision can be aided by considering the consequences
of choosing the “wrong” model. By “wrong” model, I mean that you choose
(1) a random- or mixed-effects model when there is no population variability
among effect sizes, or (2) a fixed-effects model when there really exists substantial population variability among effect sizes. In the first situation, using
random-effects models in the absence of population variability, there is little
negative consequence other than a little extra work. Random- and mixedeffects models will yield similar results as fixed-effects models when there is
little population variability in effect sizes (e.g., because estimated t2 is close
to zero, Equation 10.2 functionally reduces to Equation 10.1). If you decide on
a random- (or mixed-) effects model only to find little population variability
in effect sizes, you still have the advantage of being able to make generalizable conclusions (see the first consideration above). In contrast, the second
type of inappropriate decision (using a fixed-effects model in the presence
of unexplained population variability) is problematic. Here, the failure to
model this population variability leads to conclusions that are inappropriately precise—in other words, artificially high significance tests and overly
narrow confidence intervals.
In conclusion, random-effects models offer more advantages than fixedeffects models, and there are no disadvantages to using random-effects models in the absence of population variability in effect sizes. For this reason, I
generally recommend random-effects models when the primary goal is estimated and drawing conclusions about mean effect sizes. When the focus of
your meta-analysis is on evaluating moderators, then my recommendations
are more ambivalent. Here, mixed-effects models provide optimal results,
but the complexity of estimating them might not always be worth the effort
Fixed-, Random-, and Mixed-Effects Models
255
unless you are able to enlist help from an experienced meta-analyst. For
moderator analyses, I do view fixed-effects models as acceptable, provided
you examine unexplained (residual) heterogeneity and are able to show that
it is either not significant or small in magnitude.6
10.6 Summary
Random-effects models conceptualize the population of effect sizes as falling along a distribution with both a mean and variance, above and beyond
variance due to sampling fluctuations of individual studies. These random
effects can be contrasted with the fixed-effects models described in Chapter
8, which conceptualize a single population effect size with any variability
among effect sizes in studies due to sampling fluctuations. In this chapter, I
have highlighted the differences between these models, and I have described
how to estimate random-effects models for meta-analysis. I then described
mixed-effects models, which are the random-effects extensions of the (fixedeffects) moderator analyses of Chapter 9. I also showed how both randomand mixed-effects models can be represented as structural equation models
with random slopes. To assist in selecting between fixed- versus random- or
mixed-effects models, I have encouraged you to consider several factors.
10.7Recommended Readings
Cheung, M. W.-L. (2008). A model for integrating fixed-, random-, and mixed-effects metaanalyses in structural equation modeling. Psychological Methods, 13, 182–202.—This
article presents the approach to modeling meta-analysis within an SEM framework that
I describe in this chapter.
Hedges, L. V., & Vevea, J. L. (1998). Fixed- and random-effects models in meta-analysis.
Psychological Methods, 3, 486–504.—This article is one of the seminal early articles
describing fixed- versus random-effects models. Although somewhat challenging, the
paper is worth reading given that it provides the foundation for much subsequent work
on this topic.
Overton, R. C. (1998). A comparison of fixed-effects and mixed (random-effects) models for meta-analysis tests of moderator variable effects. Psychological Methods, 3,
354–379.—This is a challenging article to read; however, it is one of the best sources
of information for conducting mixed-effects analyses.
Raudenbush, S. W. (1994). Random effects models. In H. Cooper & L. V. Hedges (Eds.), The
handbook of research synthesis (pp. 301–321). New York: Russell Sage Foundation.—
256
COMBINING AND COMPARING EFFECT SIZES
This chapter is the most comprehensive summary of the topic, including a nice mixture
of accessible and challenging information.
Notes
1. This does not mean that you extend conclusions beyond the general types of
studies in your meta-analysis, but that you expand beyond the specific studies.
For instance, I might perform a meta-analysis of three studies using samples that
are 10, 12, and 14 years old. Under a fixed-effects model, I can only make conclusions about studies of 10-, 12-, and 14-year-olds; I should not make conclusions
about results involving 11- or 13-year-olds. Under a random-effects model, I can
make conclusions about the more generalized period of early adolescence from
10–14 years (including 11- and 13-year-olds). Neither model would allow me to
safely extrapolate conclusions beyond these limits; so neither would inform my
understanding of 4-year-old children or 40-year-old adults.
2. The note to Equation 10.4 fixes the variance at 0 for those occasions when Q is
lower than this expected value, thus avoiding estimates of negative population
variance.
3. At the time of this writing, I am aware of only two programs that can do this:
Mplus and MX.
4. For alternate ways of representing random slopes in path diagrams, see Curran
and Bauer (2007); Mehta and Neale (2005).
5. Some meta-analysts make it their explicit goal to continue to examine moderators until the residual heterogeneity is not significant. Although I see value in this
approach—in attempting to systematically explore differences in the findings
of studies until you can systematically explain all differences beyond sampling
fluctuation—I do not think this must be the goal of every meta-analysis. If you
have evaluated all moderators that you are interested in, and residual heterogeneity still exists, I see nothing wrong with simply acknowledging that there still
remain differences among studies that you have not explained.
6. A reasonable—though untested—suggestion might be that the residual heterogeneity produces an I2 of 25% or less (see Chapter 8).
11
Publication Bias
In Chapter 2, I described publication bias as a threat to both narrative and
meta-analytic reviews. In Chapter 3, I emphasized the importance of thorough
and systematic searching of the literature as one way of reducing the likely
impact of this bias. Although search procedures are the most effective remedy
to this file drawer problem, it is also possible to evaluate the presence of publication bias after studies have been collected and coded.
In this chapter, I first revisit the problem of publication bias in more depth
than I did earlier in the book. I then review a range of analytic and graphical techniques that have been developed within the field of meta-analysis to
detect the presence of publication. Finally, in the “practical matters” section, I
provide what I view as a pragmatic perspective on the ever-present threat of
publication bias.
11.1The Problem of Publication Bias
Publication bias refers to the possibility that studies finding null (absence
of statistically significant effect) or negative (statistically significant effect
in opposite direction expected) results are less likely to be published than
studies finding positive effects (statistically significant effects in expected
direction).1 This bias is likely due both to researchers being less motivated
to submit null or negative results for publication and to journals (editors
and reviewers) being less likely to accept manuscripts reporting these results
(Cooper, DeNeve, & Charlton, 1997; Coursol & Wagner, 1986; Greenwald,
1975; Olson et al., 2002).
257
258
COMBINING AND COMPARING EFFECT SIZES
The impact of this publication bias is that the published literature might
not be representative of the studies that have been conducted on a topic, in
that the available results likely show a stronger overall effect size than if all
studies were considered. This impact is illustrated in Figure 11.1, which is a
reproduction of Figure 3.2. The top portion of this figure shows a distribution of effect sizes from a hypothetical population of studies. The effect sizes
from these studies center around a hypothetical mean effect size (about 0.20),
but have a certain distribution of effect sizes found due to random-sampling
error and, potentially, population-level between-study variance (i.e., heterogeneity; see Chapters 8 and 9). Among those studies that happen to find
small effect sizes, results are less likely to be statistically significant (in this
hypothetical figure, I have denoted this area where studies find effect sizes
less than ± 0.10, with the exact range depending on the study sample sizes
and effect size considered). Below this population of effect sizes of all studies conducted, I have drawn downward arrows of different thicknesses to
Population of effect sizes
nonsignificant
results
-0.20
-0.10
0.00
0.10
0.20
0.30
0.40
0.50
0.60
Sample of published effect sizes
-0.20
-0.10
0.00
0.10
0.20
0.30
0.40
0.50
0.60
FIGURE 11.1. Illustration of publication bias in a hypothetical sample drawn from
a population of studies.
Publication Bias
259
represent the different likelihoods of the study being published, with thicker
arrows denoting higher likelihood of publication. Consistent with the notion
of publication bias, the hypothetical studies that fail to find significant effects
are less likely to be published than those that do. This differential publication
rate results in the distribution of published studies shown in the lower part
of Figure 11.1. It can be seen that this distribution is shifted to the right, such
that the mean effect size is now approximately 0.30. If the meta-analysis only
includes this biased sample of published studies, then the estimate of the
mean effect size is going to be considerably higher (around 0.30) than that in
the true population of studies conducted. Clearly, this has serious implications for a meta-analysis that does not consider publication bias.
This publication bias is sometimes referred to by alternative names. Some
have referred to it as the “file-drawer problem” (Rosenthal, 1979), conjuring
images of researchers’ file drawers containing manuscripts reporting null or
negative (i.e., in the opposite direction expected) results that will never be
seen by the meta-analyst (or anyone else in the research community). Another
term proposed is “dissemination bias” (see Rothstein, Sutton, & Borenstein,
2005a). This latter term is more accurate in describing the broad scope of this
problem, although the term “publication bias” is the more commonly used
one (Rothstein et al., 2005a). Regardless of terminology used, the breadth of
this bias is not limited just to significant results being published and nonsignificant results not being published (even in a probabilistic rather than
absolute sense). One source of breadth of the bias is the existence of “gray literature,” research that is between the file drawer and publication, such as in
the format of conference presentations, technical reports, or obscure publication outlets (Conn, Valentine, Cooper, & Rantz, 2003; Hopewell, Clarke, &
Mallett, 2005; also referred to as “fugitive literature” by, e.g., M. C. Rosenthal,
1994). There is evidence that null findings are more likely to be reported only
in these more obscure outlets than are positive findings (see Dickersin, 2005;
Hopewell et al., 2005) If the literature search is less exhaustive, these reports
are less likely to be found and included in the meta-analysis than reports
published in more prominent outlets.
Another source of breadth in publication bias may be in the underemphasis of null or negative results. For example, researchers are likely to make
significant findings the centerpiece of an empirical report and only report
nonsignificant findings in a table. Such publications, though containing the
effect size of interest, might not be detected in key word searches or in browsing the titles of published works. Similarly, null or counterintuitive findings
that are published may be less likely to be cited by others; thus, backward
searches are less likely to find these studies.
260
COMBINING AND COMPARING EFFECT SIZES
Finally, an additional source of breadth in considering publication bias
is due to the time lag of publication. There is evidence, at least in some fields,
that significant results are published more quickly than null or negative
results (see Dickersin, 2005). The impact on meta-analyses, especially those
focusing on topics with a more recently created empirical basis, is that the
currently published results are going to overrepresent significant positive
findings, whereas null or negative results are more likely to be published
after the meta-analysis is performed.
Recognizing the impact and breadth of publication bias is important
but does not provide guidance in managing it. Ideally, the scientific process
would change so that researchers are obligated to report the results of study
findings.2 In clinical research, the establishment of clinical trial registries (in
which researchers must register a trial before beginning the study, with some
journals motivating registration by only considering registered trials for
publication) represents a step in helping to identify studies, although there
are some concerns that registries are incomplete and that the researchers
of registered trials may be unwilling to share unexpected results (Berlin &
Ghersi, 2005). However, unless you are in the position to mandate research
and reporting practices within your field, you must deal with publication bias
without being able to prevent it or even fully know of its existence. Nevertheless, you do have several methods of evaluating the likely impact publication
bias has on your meta-analytic results.
11.2 Managing Publication Bias
In this section, I describe six approaches to managing publication bias within
meta-analysis. I also illustrate some of these approaches through the example
meta-analysis I have used throughout this book: a review of 22 studies reporting associations between relational aggression and peer rejection among children and adolescents. In Chapter 8, I presented results of a fixed-effects3
analysis of these studies indicating a mean r = .368 (SE = .0118; Z = 32.70, p
< .001; 95% confidence interval = .348 to .388). When using this example in
this section, I evaluate the extent to which this conclusion about the mean
association is threatened by potential publication bias.
Table 11.1 displays these 22 studies. The first five columns of this table
are the citation, sample size, untransformed effect size (r), transformed effect
size (Zr), and standard error of the transformed effect size (SE). The remaining columns contain information that I explain when using these data to
illustrate methods of evaluating publication bias.
261
228
491
65
458
929
904
74
151
150
590
180
139
132
60
839
262
266
209
314
881
517
228
Sample size
(N)
.525
.198
.311
.554
.161
.336
.396
.617
.557
.575
.039
.358
.049
.000
.326
–.048
.454
.253
.160
.477
.469
.572
Effect size
(r)
.583
.201
.322
.624
.162
.349
.419
.721
.628
.655
.039
.375
.049
.000
.339
–.048
.489
.258
.162
.519
.509
.651
.525
.198
.311
.554
.161
.336
.396
.617
.557
.575
.039
.358
.049
.000
.326
–.048
.454
.253
.160
.477
.469
.572
Standard
error (SE)
0
1
1
0
1
1
0
0
0
1
1
0
1
1
1
1
1
1
1
0
1
1
Published
(1 = yes)
been published as Ostrov (2008).
Transformed
ES (Zr)
aArticle was under review during the preparation of this meta-analytic review. It has subsequently
bEffect size is lower-bound estimate based on authors’ report of only nonsignificant associations.
Blachman (2003)
Crick & Grotpeter (1995)
Crick et al. (1997)
Geiger (2003)
Hawley et al. (2007)
Henington (1996)
Johnson (2003)
Leff (1995)
Miller (2001)
Murray–Close & Crick (2006)
Nelson et al. (2005)
Ostrov (under review)a
Ostrov & Crick (2007)
Ostrov et al. (2004)b
Pakaslahti & Keltikangas-Järvinen (1998)
Phillipsen et al. (1999)
Rys & Bear (1997)
Salmivalli et al. (2000)
Tomada & Schneider (1997)
Werner (2000)
Werner & Crick (2004)
Zalecki & Hinshaw (2004)
Study
TABLE 11.1. Example Meta-Analysis Used to Illustrate Analyses for Publication Bias
.005
.003
.017
.002
.001
.001
.015
.007
.007
.002
.007
.008
.008
.019
.001
.004
.004
.005
.003
.001
.002
.005
v*
2.87
–3.28
–0.49
5.06
–6.90
–1.14
0.27
3.94
2.89
6.57
–4.22
–0.13
–3.69
–2.80
–1.32
–6.89
1.65
–1.82
–3.90
3.99
2.78
3.82
ES*
7.57
3.42
2.35
11.45
4.65
9.69
3.24
7.22
6.59
13.50
0.47
4.02
0.54
0.00
8.57
–0.75
7.13
3.54
2.73
13.57
10.30
8.15
z
14.43
17.26
7.55
20.67
28.91
28.84
8.18
11.69
11.83
23.47
12.03
11.21
10.85
7.21
26.26
15.59
15.71
13.98
16.99
28.47
21.96
14.24
1/SE