Tải bản đầy đủ
6 Practical Matters: Using Effect Size Calculators and Meta‑Analysis Programs
CODING INDIVIDUAL STUDIES
gram, then you can decide if using the program is worthwhile. I offer this
same advice when combining effect sizes, which I discuss later in this book.
In this chapter, I have described effect sizes as indices of association between
two variables, a definition that is somewhat restricted but that captures the
majority of uses in meta-analysis. I also emphasized that effect sizes are not
statistical significance tests.
I also described three classes of effect sizes. Correlations (r) index associations between two continuous variables. Standardized mean differences
(such as g) index associations between dichotomous and continuous variables. Odds ratios (o) are advantageous in indexing the associations between
two dichotomous variables. I stressed that you should carefully consider the
nature of the variables of interest, recognizing that primary studies may use
other distributions (e.g., artificial dichotomization of a continuous variable).
I also suggested that your conceptualization of the distributions of the variables of interest should guide your choice of effect size index. Finally, I considered the practical matter of using available effect size calculators in metaanalysis programs. Although you should be familiar enough with effect size
computation that you can do so yourself, these effect size calculators can be
a time-saving tool.
Fleiss, J. H. (1994). Measures of effect size for categorical data. In H. Cooper & L. V.
Hedges (Eds.), The handbook of research synthesis (pp. 245–260). New York: Russell
Sage Foundation.—This chapter provides a thorough and convincing description of
the use of o as effect size for associations between two dichotomous variables. This
chapter does not provide much advice on estimating o from commonly reported data,
so readers should also look at relevant sections of Lipsey and Wilson (2001).
Grissom, R. J., & Kim, J. J. (2005). Effect sizes for research: A broad practical approach.
Mahwah, NJ: Erlbaum.—Although not specifically written for the meta-analyst, this
book provides a thorough description of methods of indexing effect sizes.
Lipsey, M. W., & Wilson, D. B. (2001). Practical meta-analysis. Thousand Oaks, CA:
Sage.—This short book (247 pages) provides a more thorough coverage than that
of Rosenthal (1991), but is still brief and accessible. Lipsey and Wilson frame metaanalysis in terms of analysis of effect sizes, regardless of the type of effect size used.
Although only part of one chapter (Chapter 4) is devoted to effect size computation,
Basic Effect Size Computation
the authors include computational details in an appendix (Appendix B) and the second
author provides an Excel worksheet through his website that is useful in computing r
Rosenthal, R. (1991). Meta-analytic procedures for social research (revised ed.). Newbury
Park, CA: Sage.—This book is a very short (153 pages) and accessible introduction to
basic meta-analytic procedures. Chapter 2 provides an accessible introduction to the
practice of computing effect sizes for meta-analysis, with a focus on the use of r.
1. I should note here that this is a restrictive definition of an effect size, used for
convenience here. In Chapter 7, I describe other types of effect sizes that expand
this definition. For example, an effect size might be the mean or proportion of
a single variable, or some relations among more than two variables (e.g., semipartial correlations between two variables controlling for a third, internal consistencies of many items of a scale). However, this definition of effect sizes as
indexing the association between two variables is the most widely used.
2. Although there is general support for this transformation (see, e.g., Alexander,
Scozzaro, & Borodkin, 1989; Hedges & Olkin, 1985; James, Demaree, & Mulaik,
1986), readers should be aware that some experts (see Hunter & Schmidt, 2004,
p. 83) recommend against using this transformation.
3. There has been some criticism of g and d (which also applies to gGlass) as effect
sizes. The main source of critique is that these effect size estimates are not
robust to violations of normality assumptions (see Algina, Keselman, & Penfield, 2005). Several alternatives have been suggested including indices based on
dominance statistics, Windsorized data, and bootstrapping. These alternatives
do not seem viable for use in meta-analyses, however, because you typically do
not have access to the primary data. Therefore, you will typically need to rely on
g or d (or, less often, gGlass) in computing standardized mean differences from
information commonly reported in primary research. This necessity is probably
not of too much concern for your meta-analysis given that the limits of traditional standardized mean differences lie primarily in the potential inaccuracy
of confidence intervals rather than biases in point estimation. However, future
quantitative research evaluating the impact of using nonrobust effect size estimates on conclusions of mean, confidence intervals, and heterogeneity drawn
from meta-analyses is needed to support this claim.
4. Glass’s (e.g., Glass, McGraw, & Smith, 1981) standardized mean difference has
been represented by numerous symbols. Rosenthal (1991, 1994) has denoted this
index using the Greek uppercase delta (D), although I avoid this practice because
others use this symbol to denote a population parameter standardized mean
difference. Hedges and Olkin (1985) use g′ (in contrast to g) to denote Glass’s
standardized mean difference, which is clear, if not intuitive. Although it could
CODING INDIVIDUAL STUDIES
be argued that proliferation of more symbols is unnecessary, I use the symbol
gGlass for clarity.
5. Where “pooled estimates” refers to the combination of estimates from both
1s12 n 2
n1 n 2
for the pooled population estimate of standard deviation, or substituting sd for s
when pooling sample standard deviations.
6. Both are considered estimators of the same population parameter, (µ1 – µ2) / s.
The difference in these two statistics is that d has a slight bias, whereas g is unbiased, in estimating this common population parameter.
7. It is also worth noting here that g and d also differ in that a correction exists for
bias when estimating g from small samples that does not exist for d. I describe
this small sample correction for g below.
8. Other effect size indices under conditions of heteroscedasticity have been proposed (see Grissom & Kim, 2001). However, these indices generally require
access to raw data from primary studies, and those that do not require raw data
have not been thoroughly enough studied to support their widespread use.
9. Alternatively, one could consider the standardized mean difference in reference
to the standard normal cumulative distribution function (denoted by F(g), F(d),
or F(gGlass)) to determine the percentage of members of one group falling above
the mean of the second group (Grissom & Kim, 2005; Hedges & Olkin, 1985).
To put it in more comprehensible terms, one can look up the value of the standardized mean difference as a Z-score in a normal curve table to identify the percentage of the normal distribution that falls below (to the left of) that particular
Z-score; this percentage represents the percentage of Group 1 members who are
above the mean of Group 2. For example, a standardized mean difference of 0.75
implies that 77% of Group 1 members are above the mean of Group 2, whereas a
standardized mean difference of –0.50 implies that 31% of Group 1 members are
above the mean of Group 2. This interpretation assumes a normal distribution in
10. There is some evidence that an alternative index may be superior to the odds
ratio. This alternative is to transform the natural log of the odds ratio, ln(o), to a
standardized mean difference:
In a simulation study (Sánchez-Meca et al., 2003), dCox exhibited little bias,
whereas ln(o) slightly underestimated associations, especially when the true
(population) association was large. However, dCox has not yet been widely used
Basic Effect Size Computation
by meta-analysts. Nevertheless, you might consider this alternative effect size
if your meta-analysis indexes associations between dichotomous variables that
you expect may be large.
11. Assignment of orthogonal contrast weights, in which successive values are equidistant, assumes that the groups themselves are equidistant with respect to the
underlying continuous construct. For example, if we assigned contrast weights of
–1, 0, and +1 to groups defined as “never,” “sometimes,” or “often” experiencing
an event, this coding would assume that the amount of difference in the underlying group variable between “never” and “sometimes” is equal to the difference
between “sometimes” and “often” groups. The extent to which this assumption
is not valid will most likely attenuate the computed effect sizes using this technique. Of course, the meta-analyst might choose different contrast weights if
there is reason to do so; the only restrictions on selecting contrast weights are
that they make sense and that they sum to zero.
12. Using the equation F between = MSbetween /MSwithin, where MSbetween = S(ng(Mg –
GM)2)/dfbetween and MSwithin = S(ngsg2)/Sng. The grand mean (GM) can be computed from group sizes and means as GM = S(ngMg) / Sng . The other term needed
is the numerator degrees of freedom of the omnibus test, or dfbetween = number
of groups – 1.
13. Lipsey and Wilson (2001) recommend using t rather than Z in this equation
(where you would find the appropriate value of t given p and df). With small
sample sizes, the use of t seems more appropriate when the significance level is
from a test in which the t-distribution is the appropriate comparison distribution. However, with a large sample, the difference in values resulting from the
use of Z versus t becomes negligible, and the use of Z is likely more flexible.
14. McGrath and Meyer (2006) have pointed out that r is affected by base rates (i.e.,
relative group sizes) of the dichotomous variable, whereas standardized mean
differences are not. Specifically, more extreme group size discrepancies will
diminish values of r but not standardized mean differences. Therefore, differences in base rates across studies might contribute to heterogeneity among r but
not standardized mean differences. Based on this consideration, I believe that
standardized mean differences (e.g., g) are preferable to r when one of the variables is dichotomous, especially if the distribution of this dichotomy is extreme
(with one group more than 2 or 3 times more common) or variable across studies. However, Rosenthal (1991) maintains a preference for r.
Corrections to Effect Sizes
Several corrections can be made to the effect sizes described in Chapter 5.
Some are made in order to produce more desirable statistical properties; for
example, Fisher’s transformation of r (to Zr; Equation 5.2 in Chapter 5) and
the log transformation of o (Section 5.1.4 in Chapter 5) aim to produce a
more normal distribution of these effect sizes. Other corrections seek to alleviate biases that are known to exist under certain conditions. For example, the
adjustment to g for small sample sizes (Equation 5.9) corrects for the systematic
overestimation of effect sizes under these conditions.
In this chapter, I describe a specific family of corrections to effect sizes,
often called artifact corrections (Hunter & Schmidt, 2004). These artifact corrections aim to correct for methodological features of primary studies that are
known to bias (typically attenuate) effect sizes. The reasons for performing
these corrections are twofold. First, the corrections provide a more accurate
estimate of what effect sizes would have been if studies had not contained
methodological imperfections. Second, the corrections may reduce heterogeneity (variability in effect sizes) across studies that is due to differences
in methodological imperfections, thus allowing for the identification of more
substantively interesting similarities or differences (i.e., moderators; see Chapter
9) across effect sizes. As promising as these reasons seem, there are critics
of artifact correction. Next, I provide a brief overview of the arguments for
and against artifact correction, and then describe several artifact corrections.
Finally, I discuss some practical considerations in deciding whether (and how)
to correct for artifacts in a meta-analysis.
Corrections to Effect Sizes
6.1The Controversy of Correction
There is some controversy about correcting effect sizes used in meta-analyses
for methodological artifacts. In this section I describe arguments for and
against correction, and then attempt to reconcile these two positions.
6.1.1 Arguments for Artifact Correction
Probably the most consistent advocates of correcting for study artifacts are
John Hunter (now deceased) and Frank Schmidt (see Hunter & Schmidt,
2004; Schmidt & Hunter, 1996; as well as, e.g., Rubin, 1990). Their argument, in a simplified form, is that individual primary studies report effect
sizes among imperfect measures of constructs, not the constructs themselves.
These imperfections in the measurement of constructs can be due to a variety
of sources including unreliability of the measures, imperfect validity of the
measures, or imperfect ways in which the variables were managed in primary
studies (e.g., artificial dichotomization). Moreover, individual studies contain
not only random sampling error (due to their finite sample sizes), but often
biased samples that do not represent the population about which you wish to
These imperfections of measurement and sampling are inherent to every
primary study and provide a limiting frame within which you must interpret the findings. For instance, a particular study does not provide a perfect
effect size of the association between X and Y, but rather an effect size of the
association between a particular measure of X with a particular measure of
Y within the particular sample of the study. The heart of the argument for
artifact correction is that we are less interested in these imperfect effect sizes
found in primary studies and more interested in the effect sizes between
latent constructs (e.g., the correlation between construct X and construct Y).
The argument seems reasonable and in fact provides much of the impetus
for the rise of such latent variable techniques as confirmatory factor analysis
(e.g., Brown, 2006) and structural equation modeling (e.g., Kline, 2005) in primary research. Our theories that we wish to evaluate are almost exclusively
about associations among constructs (e.g., aggression and rejection), rather
than about associations among measures (e.g., a particular self-report scale
of aggression and a particular peer-report method of measuring rejection). As
such, it makes sense that we would wish to draw conclusions from our metaanalyses about associations among constructs rather than associations among
imperfect measures of these constructs reported in primary studies; thus, we
should correct for artifacts within these studies in our meta-analyses.
CODING INDIVIDUAL STUDIES
A corollary to the focus on associations among constructs (rather than
imperfect measures) is that artifact correction results in the variability among
studies being more likely due to substantively interesting differences rather
than methodological differences. For example, studies may differ due to a
variety of features, with some of these differences being substantively interesting (e.g., characteristics of the sample such as age or income, type of intervention evaluated) and others being less so (e.g., the use of a reliable versus
unreliable measure of a variable). Correction for these study artifacts (e.g.,
unreliability of measures) reduces this variability due to likely less interesting differences (i.e., noise), thus allowing for clearer illumination of differences between studies that are substantively interesting through moderator
analyses (Chapter 9).
6.1.2 Arguments against Artifact Correction
Despite the apparent logic supporting artifact correction in meta-analysis,
there are some who argue against these corrections. Early descriptions of
meta-analysis described the goal of these efforts as integrating the findings of
individual studies (e.g., Glass, 1976); in other words, the synthesis of results
was reported in primary studies. Although one might argue that these early
descriptions simply failed to appreciate the difference between the associations between measures and constructs (although this seems unlikely given
the expertise Glass had in measurement and factor analysis), some modern
meta-analysts have continued to oppose artifact adjustment even after the
arguments put forth by Hunter and Schmidt. Perhaps most pointedly, Rosenthal (1991) argues that the goal of meta-analysis “is to teach us better what is,
not what might some day be in the best of all possible worlds” (p. 25, italics
in original). Rosenthal (1991) also cautions that these corrections can yield
inaccurate effect sizes, such as when corrections for unreliability yield correlations greater than 1.0.
Another, though far weaker, argument against artifact correction is simply that such corrections add another level of complexity to our meta-analytic
procedures. I agree that there is little value in making these procedures
more complex than is necessary to best answer the substantive questions
of the meta-analysis. Furthermore, additional data-analytic complexity often
requires lengthier explanation when reporting meta-analyses, and our focus
in most of these reports is typically to explain information relevant to our
content-based questions rather than data-analytic procedures. At the same
time, simplicity alone is not a good guide to our data-analytic techniques.
The more important question is whether the cost of additional data-analytic
complexity is offset by the improved value of the results yielded.
Corrections to Effect Sizes
6.1.3Reconciling Arguments Regarding
Many of the critical issues surrounding the controversy of artifact correction can be summarized in terms of whether meta-analysts prefer to describe
associations among constructs (those for correction) or associations as found
among variables in the research (those against correction). In most cases, the
questions likely involve associations among latent constructs more so than
associations among imperfectly measured variables. Even when questions
involve measurement (e.g., are associations between X and Y stronger when
X is measured in certain ways than when X is measured in other ways?),
it seems likely that one would wish to base this answer on the differences
in associations among constructs between the two measurement approaches
rather than the magnitudes of imperfections that are common for these measurement approaches. Put bluntly, Hunter and Schmidt (2004) argue that
attempting to meta-analytically draw conclusions about constructs without
correcting for artifacts “is the mathematical equivalent of the ostrich with its
head in the sand: It is a pretense that if we ignore other artifacts then their
effects on study outcomes will go away” (p. 81). Thus, if you wish to draw
conclusions about constructs, which is usually the case, it would appear that
correcting for study artifacts is generally valuable.
At the same time, one must consider the likely impact of artifacts on the
results. If one is meta-analyzing a body of research that consistently uses
reliable and valid measures within representative samples, then the benefits
of artifact adjustment are likely small. In these cases, the additional complexity of artifact adjustment is likely not warranted. To adapt Rosenthal’s (1991)
argument quoted earlier, if what is matches closely with what could be, then
there is little value in correcting for study artifacts.
In sum, although I do not believe that all, or even any, artifact adjustments are necessary in every meta-analysis, I do believe it is valuable to always
consider each of the artifacts that could bias effect sizes. In meta-analyses in
which these artifacts are likely to have a substantial impact on at least some
of the included primary studies, it is valuable to at least explore some of the
6.2 Artifact Corrections to Consider
Hunter and Schmidt (2004; see also Schmidt, Le, & Oh, 2009) suggest several
corrections to methodological artifacts of primary studies. These corrections
involve unreliability of measures, poor validity of measured variables, arti-
CODING INDIVIDUAL STUDIES
ficial dichotomization of continuous variables, and range restriction of variables. Next I describe the conceptual justification and computational details
of each of these corrections. The computations of these artifact corrections
are summarized in Table 6.1.
Before turning to these corrections, however, let us consider the general
formula for all artifact corrections. The corrected effect size (e.g., r, g, o),
which is the estimated effect size if there were no study artifacts, is a function
of the effect size observed in the study divided by the total artifact correction1:
Equation 6.1: General equation for artifact corrections
• ESadjusted is the adjusted (corrected) effect size.
• ESobserved is the observed (uncorrected) effect size.
• a is the total correction for all study artifacts.
TABLE 6.1. Summary of Equations for Artifact Corrections
a unreliability r xx
avalidity r XT
Range restriction (direct)c
arange u 2 r 2 1
Range restriction (indirect)c
aThe correction for this artifact on both variables comprising the effect size is equal
to the product of the correction on each variable.
bThe correction for this artifact on both variables comprising the effect size is
approximated by the product of the correction on each variable in many cases (see
text for details).
cThe correction for this artifact on both variables comprising the effect size requires
special techniques described in the text.
Corrections to Effect Sizes
Here, a is the total correction for all study artifacts and is simply the
product of the individual artifacts described next (i.e., a = a1 * a2 * . . . , for the
first, second, etc., artifacts for which you wish to correct).2 Each individual
artifact (a1) and the total product of all artifacts (a) have values that are 1.0
(no artifact bias) or less (with the possible exception of the correction for
range restriction, as described below). The values of these artifacts decrease
(and adjustments therefore increase) as the methodological limitations of the
studies increase (i.e., larger problems, such as very low reliability, result in
smaller values of a and larger corrections).
Artifact adjustments to effect sizes also require adjustments to standard
errors. Because standard errors represent the imprecision in estimates of
effect sizes, it makes conceptual sense that these would increase if you must
make an additional estimate in the form of how much to correct the effect
size. Specifically, the standard errors of effect sizes (e.g., r, g, or o; see Chapter 5) are also adjusted for artifact correction using the following general
Equation 6.2: Equation for adjusting standard errors
for artifact corrections
• SEadjusted is the adjusted standard error.
• SEobserved is the observed (uncorrected) standard error.
• a is the total correction for all study artifacts.
The one exception to this equation is when one is correcting for range
restriction. This correction represents an exception to the general rule of
Equation 6.2 because the effect size is used in the computation of a, the artifact correction (see Equations 6.7 and 6.8). In this case of correcting for range
restriction, you multiply arange by ESadjusted /ESobserved prior to correcting the
6.2.1Corrections for Unreliability
This correction is for unreliability of measurement of the variables comprising
the effect sizes (e.g., variables X and Y that comprise a correlation). Unreliability refers to nonsystematic error in the measurement process (contrast with
systematic error in measurement, or poor validity, described in Section 6.2.4).
CODING INDIVIDUAL STUDIES
Reliability, or the repeatability of a measure (or the part that is not unreliable),
can be indexed in at least three ways. Most commonly, reliability is considered
in terms of internal consistency, representing the repeatability of measurement
across different items of a scale. This type of reliability is indexed as a function of the associations among items of a scale, most commonly through an
index called Cronbach’s coefficient alpha, a (Cronbach, 1951; see, e.g., DeVellis, 2003). Second, reliability can be evaluated in terms of agreement between
multiple raters or reporters. This interrater reliability can be evaluated with
the correlation between sets of continuous scores produced by two raters (or
average correlations among more than two raters) or with Cohen’s kappa (k)
representing agreement between categorical assignment between raters (for a
full description of methods of assessing interrater reliability, see von Eye &
Mun, 2005). A third index of reliability is the test–retest reliability. This test–
retest reliability is simply the correlation (r) between repeated measurements,
with the time span between measurements being short enough that the construct is not expected to change during this time. Because all three types of
reliability have a maximum of 1 and a minimum of 0, the relation between
reliability and unreliability can be expressed as reliability = 1 – unreliability.
Regardless of whether reliability is indexed as internal consistency (e.g.,
Cronbach’s a), interrater agreement (r or κ), or test–retest reliability (r),
this reliability impacts the magnitude of effect sizes that a study can find.
If reliability is high (e.g., near perfect, or close to 1) for the measurement of
two variables, then you expect that the association (e.g., correlation, r) the
researcher finds between these variables will be an unbiased estimate of the
actual (latent) population effect size (assuming the study does not contain
other artifacts described below). However, if the measurement of one or both
variables comprising the association of interest is low (reliability far below 1,
maybe even approaching 0), then the maximum (in terms of absolute value
of positive or negative associations) effect size the researcher might detect
is substantially lower than the true population effect size. This is because
the correlation (or any other effect size) between the two variables of interest is being computed not only from the true association between the two
constructs, but also between the unreliable aspects of each measure (i.e., the
noise, which typically is not correlated across the variables).
If you know (or at least have a good estimate of) the amount of unreliability in a measure, you can estimate the magnitude of this effect size
attenuation. This ability is also important for your meta-analysis because you
might wish to estimate the true (disattenuated) effect size from a primary
study reporting an observed effect size and the reliability of measures. Given
the reliability for variables X and Y, with these general reliabilities denoted as