3 Practical Matters: When (and How) to Correct: Conceptual, Methodological, and Disciplinary Considerations
Tải bản đầy đủ
Corrections to Effect Sizes
143
section, you might reasonably choose to correct only for those that seem
most pressing within the primary studies being synthesized.
How pressing a particular type of artifact is within a meta-analysis is
partly a conceptual question and partly an empirical question. First, you
must consider the collection of primary studies in light of your conceptual
expertise of the area. Relevant questions include the following: How valid
are the measures within this research in relation to the construct I am interested in? How representative are the samples relative to the population about
which I want to draw conclusions? Again, there is not a statistical answer
to such questions; rather, these questions must be answered based on your
understanding of the field.
In addition to conceptual considerations, you might also base conclusions on empirical grounds. Specifically, you can consider the data reported
in primary studies to draw conclusions about the presence of important artifacts. For example, I recommend coding the internal consistencies of relevant
measures within the primary studies, meta-analyzing these reliabilities (see
Chapter 7), and determining (1) whether the collection of studies has generally high or low reliabilities of measures and (2) whether substantial variability exists across studies in these reliabilities. Similarly, if many studies use
similar measures of a variable (i.e., with the same scale), then you could code
and evaluate standard deviations across studies (see Chapter 7) to determine
whether some studies suffer from restricted ranges. In short, for each of the
potential artifacts described in the previous section, you should consider the
available empirical evidence to determine whether this artifact is uniformly
or inconsistently present in the primary studies being analyzed. If a particular artifact is uniformly present, then correcting for it will yield more accurate
overall effect size estimates (among latent constructs). If a particular artifact
is present in some studies but not in others (or present in differing degrees
across studies), then correcting for this artifact will reduce less interesting
(i.e., artifactual) variability across studies and allow for a clearer picture of
substantively interesting variability in effect sizes.
6.3.2Disciplinary Considerations
Whereas I view the conceptual and empirical considerations as most important in deciding whether and how to correct for artifacts, the reality is that
these corrections are more common in some fields than in others. This means
that one meta-analyst working within one field might be expected to correct
for certain artifacts, whereas another meta-analyst working within another
field might be met with skepticism if certain (or any) corrections were to be
performed. These disciplinary practices are unfortunate, especially because
144
CODING INDIVIDUAL STUDIES
they are more often due to those who are influential in a field more so than
consideration of particular needs of a field. Nevertheless, it is useful to recognize the common practices within your particular field.
Notwithstanding recognition of these disciplinary practices, I want to
encourage you to not feel restricted by these practices. In other words, do
not base your decision to perform or not perform certain artifact corrections
only on common practices within your field. Instead, carefully consider the
conceptual and empirical basis for making certain corrections, and then use
(or not) these corrections to obtain results that best answer your research
questions.
6.4 Summary
In this chapter I have described rationales for and against corrections of
study artifacts, imperfections of primary studies that bias (typically attenuate) effect size estimates. I described methods of correcting for several types
of artifacts: unreliability of measures, artificial dichotomization of continuous variables, range restriction, poor validity of measures, and covariation
due to a third variable. Despite disciplinary differences in practices of artifact
correction, I argue that the decision to correct or not to correct for certain
artifacts should be based on conceptual and empirical grounds.
6.5Recommended Readings
Hunter, J. E., & Schmidt, F. L. (2004). Methods of meta-analysis: Correcting error and bias
in research findings (2nd ed.). Thousand Oaks, CA: Sage.—This book provides a
complete description of meta-analysis emphasizing the artifact corrections described
in this chapter. The authors have been the most active advocates for artifact correction
in the field of meta-analysis.
Schmidt, F. L., Le, H., & Oh, I.-S. (2009). Correcting for the distorting effects of study artifacts
in meta-analysis. In H. Cooper, L. V. Hedges, & J. C. Valentine (Eds.), The handbook
of research synthesis and meta-analysis (2nd ed., pp. 317–333). New York: Russell
Sage Foundation.—This chapter represents a more concise overview of the practice of
artifact correction in meta-analysis.
Corrections to Effect Sizes
145
notes
1. By describing artifact corrections of effect sizes of individual studies, I am
implicitly prescribing one of two possible methods of meta-analysis with artifact
correction. Specifically, I am recommending that you correct the effect sizes of
each individual study and use these corrected effect sizes in subsequent metaanalytic computations (described in Chapters 8–12). This approach is described
in Hunter and Schmidt (2004, Ch. 3). My selection of this approach makes my
subsequent description of combining and comparing effect sizes across studies
more straightforward. However, it also requires that most studies provide sufficient information to make corrections (e.g., report internal consistency to correct
for unreliability), and it may be necessary to substitute estimates of these corrections for studies that do not provide sufficient information (e.g., meta-analytically
compute a mean reliability that is used for studies that do not report internal consistency). An alternative approach is to meta-analytically compute a distribution
of uncorrected effect sizes across studies and distributions of corrections across
studies. These techniques are more complex, yet may be useful when primary
studies are inconsistent in reporting information needed to correct for artifacts.
These techniques are described in Hunter and Schmidt (2004, Ch. 4).
2. An important caveat of this use of multiplicative combination of artifacts is that
the artifacts are assumed to be independent of one another. Violations of this
assumption can lead to inaccurate corrected effect sizes, including out-of-bounds
effect sizes (e.g., r greater than 1.0).
3. I have arranged these reasons in what I consider the most to least justifiable.
Not correcting for unreliability of one variable is acceptable if a convincing case
can be made that it is highly reliably measured. Not correcting for reliability
of one variable because primary studies do not report this reliability is weaker
justification, though it is a reality you may have to deal with in some situations.
It is likely that some studies in a meta-analysis will report reliability estimates,
whereas others will not. In these cases it is preferable for you to seek reliability
information from primary study authors. If it is still not possible to obtain reliability estimates for some studies in the meta-analysis, I recommend performing
a meta-analysis of reliabilities among studies in the meta-analysis (see Chapter 7)
and using either the mean reliability or an estimated reliability predicted by other
study features. The final reason listed, not correcting for unreliability of one variable because you are not interested in the variable, is not acceptable. Expressing
an interest in X but not Y ignores the fact that the association between these variables necessarily depends on the measurement properties (including reliability)
of both variables, so unreliability in Y is going to adversely affect the association
involving X, which you are interested in.
4. Latent correlations can also be found within structural equation models, or
latent variable models that include directional (regression) paths. However, the
146
CODING INDIVIDUAL STUDIES
meta-analyst needs to be careful when determining latent correlations from such
models. Although nondirection (i.e., bivariate) associations between exogenous
(predictor) variables can be interpreted as latent correlations, nondirectional
associations between endogenous variables (predicted variables) and directional
associations cannot be interpreted as latent correlations. In these instances, the
meta-analyst needs to derive the latent correlations through tracing rules, as
described by Kline (2005) and Maruyama (1998).
5. When discussing range restriction, I focus on the use of r as the index of effect
size. This is the most common situation, as range restriction is relevant only to
continuous variables and is most often encountered in naturalistic studies. However, it is also possible to correct for range restriction of the continuous variable
when considering standardized mean differences (e.g., g). For details regarding
these corrections, see Hunter and Schmidt (2004) or Lipsey and Wilson (2001).
7
Advanced and Unique
Effect Size Computation
Although the three effect sizes (r, g, or other standardized mean differences,
and o) described in Chapter 5 are most commonly used, you are not restricted
to these indices of two-variable associations in your meta-analysis. Instead,
you should consider the broad range of potential effect sizes as answers to the
research questions relevant to your review. In this chapter, I describe some less
commonly used effect sizes that are useful for meta-analysis of single variables
(i.e., means, proportions, and variances or standard deviations), effect sizes
that retain the meaningful metric of the variables involved (i.e., unstandardized
effect sizes), effect sizes from multivariate regression analyses, and a variety
of different effect sizes that have received less consideration (e.g., scale reliabilities, longitudinal change scores). I then describe some of the challenges
of using less common effect sizes in your meta-analysis, as well as some of
the opportunities.
7.1Describing Single Variables
There are relatively few instances of meta-analyzing single variables, yet this
information could be potentially valuable. At least three types of information
regarding single variables could be important: (1) the mean level of individuals on a continuous variable; (2) the proportions of individuals falling into a
particular category of a categorical variable; and (3) the amount of variability
(or standard deviation), in a continuous variable.
147
148
7.1.1
CODING INDIVIDUAL STUDIES
Mean Level on Variable
Meta-analysis of reported means on a single variable may have great value. One
potential is that meta-analytic combination (see Chapters 8 and 9) allows you to
obtain a more precise estimate of this mean than might be obtained in primary
studies, especially when those primary studies have small sample sizes. Perhaps
more importantly, meta-analytic comparison (see Chapter 10) allows you to identify potential reasons why means differ across studies (e.g., methodological differences such as condition or reporter; sample characteristics such as age or ethnicity). Thus, the meta-analysis of means of single variables has considerable value.
At the same time, there is also an important limiting consideration in the
meta-analysis of means in that the primary studies must typically report this
value in the same metric. For example, if one study measures the variable of
interest on a 0–4 scale, whereas another uses a 1–100 scale, it usually does not
make sense to combine or compare means across these studies.1 Some exceptions can be considered, however. The first exception is if the different scales
are due to the primary study authors scoring comparable measures in different ways, then it is usually possible to transform one of the scales to the metric
of the other. For example, if two primary studies both use a 6-item scale with
items having values from 1 to 5, one study may form a composite by averaging
the items, whereas the other forms a composite by summing the items. In this
case, it would be possible to transform one of the two means to the same scale
of the other (i.e., multiplying the average by 6 to obtain the sum, or dividing
the sum by 6 to obtain the average), and the means of the two studies could
then be combined and compared. A second, more general exception is that it
might usually be possible to transform studies using different scales into a
common metric. From the example I provided of one study using a 0–4 scale
and the other using a 1–100 scale, it is possible to transform a mean on one
scale to an equivalent mean on the other using the following equation:
Equation 7.1: Transforming scores between two different scales
¤
¤ Max 2
Min 2 ³ ³
´´ ´ Min2
X 2 ¥¥X 1
Min1 ¥¥
´
¦ Max1
Min1 µ µ
¦
• X2 is the equivalent score on the second scale.
• X1 is the score on the first scale that you wish to transform.
• Min1 is the lowest possible score on the first scale.
• Max1 is the highest possible score on the first scale.
• Min2 is the lowest possible score on the second scale.
• Max2 is the highest possible score on the score scale.
Advanced and Unique Effect Size Computation
149
A caution in using different scales is that even if both studies use a common range of scores (e.g., 0–4), it is probably only meaningful to combine and
compare means if the studies used the same anchor points (e.g., if one used
response options of never, rarely, sometimes, often, and always, whereas the
other used 0 times, once, 2–3 times, 4–6 times, and 7 or more times, it would
make little sense to combine or compare these studies). This may prove an
especially difficult obstacle if you are attempting to combine multiple scales
in which scores from one scale are transformed to scores of another using
Equation 7.1. This requirement of primary studies reporting the variable on
the same—or at least a comparable—metric means that you will often include
only studies using the same measure (e.g., a particular measure of depression, such as the Children’s Depression Inventory; Kovacs, 1992) or else very
similar measures (e.g., child- and teacher-reported aggression using parallel
items and response options). I suspect that this rather restrictive requirement
is the primary reason why meta-analysis of means is not more common. If
you are using different but similar measures, or transformations to place values of different measures on a common scale, I highly recommend evaluating
the measure as a moderator (see Chapter 9).
If you do have a situation in which the combination or comparison
of means is feasible, computing this effect size (and its standard error) is
straightforward. The equation for computing a mean is well known, but I
reproduce it here:
Equation 7.2: Computing the mean (X) from raw data
x
X £ i
N
• xi is scores of individual i.
• N is the sample size.
However, it is typically not necessary (or possible) for you to compute
this mean, as this is usually reported within the primary study. Therefore,
coding the mean, which is an effect size (of the central tendency of a single
variable), is usually straightforward.
Occasionally, however, the primary studies will report frequency tables
rather than means for variables with a small number of potential options. For
example, a primary study might report the number or proportion of individuals scoring 0, the number or proportion scoring 1, and so on, on a measure
that has possible options of 0, 1, 2, 3, and 4. Here, you can use these frequen-
150
CODING INDIVIDUAL STUDIES
cies of different scores to re-create the raw data and then compute the mean
from these data (using Equation 7.2). An easier way to compute this mean is
using the following equivalent formula provided by Lipsey and Wilson (2001,
p. 176), summing over all potential values of a variable:
Equation 7.3: Computing the mean (X) from frequency data
X
£ xf
£f
• x is a potential value of the variable.
• f is the frequency (number, percentage, or proportion) of individuals with the value x.
Before ending my discussion of calculating the mean as an effect size,
it is important to consider the standard error of this estimate of the mean
(which is used for weighting in the meta-analysis; see Chapter 8). To compute
the standard error of a study’s estimate of the mean, you must obtain the
(population estimate of the) standard deviation (s) and sample size (N) from
that study, which are then used in the following equation:
Equation 7.4: Standard error of a mean (SE X )
SE X
sX
N
• s is the standard deviation of variable X.
• N is the sample size.
After computing the mean and standard error of the mean for each study,
you can then meta-analytically combine and compare results across studies
using techniques described later in this book (see Chapters 8–10).
7.1.2 Proportion of Individuals in Categories
Whereas the mean is a useful effect size for the typical score (i.e., central
tendency) of a single continuous variable, the proportion is a useful effect
size for a particular category of a categorical variable. For example, we may
be interested in the proportion of children who are aggressive or the proportion of individuals who meet certain criteria for rejected social status, if we
Advanced and Unique Effect Size Computation
151
believe the meaningful conceptualization of aggression or rejection is categorical. In these cases, we are interested in the prevalence of an affirmative
instance of a single dichotomous variable.2
This proportion is often either directly reported in primary studies (as
either a proportion or percentage, which can be divided by 100 to obtain the
proportion), or else can be computed from the reported frequency falling in
this category (k) relative to the total sample size (N):
Equation 7.5: Computing the proportion (p)
p=
k
N
• k is the number of individuals in the category of interest.
• N is the sample size.
This proportion works well as an effect size in many situations, but is
problematic when proportions are far from 0.50.3 For this reason, it is useful
to transform proportions (p) into logits (l) prior to meta-analytic combination or comparison:
Equation 7.6: Computing logits (l) from proportions
¤ p ³
´´
l ln ¥¥
¦ 1
p µ
• p is the proportion of individuals in the category of interest.
This logit has the following standard error dependent on the proportion
(p) and sample size (N) (Lipsey & Wilson, 2001, p. 40):
Equation 7.7: Standard error of a logit (SEl )
SE l
1
1
Np N 1
p
• p is the proportion of individuals in the category of interest.
• N is the sample size.
152
CODING INDIVIDUAL STUDIES
Analyses would then be performed on the logit (l), weighted by the standard error (SEl) as described in Chapters 8 through 10. For reporting, it is
useful to back-transform results (e.g., mean effect size) in logits (l) back to
proportions (p), using the following equation:
Equation 7.8: Transforming logits to proportions
p=
el
el + 1
• p is the proportion of individuals in the category of interest.
• l is the logit transformation.
7.1.3Variances and Standard Deviations
Few meta-analyses have used variances, or the equivalent standard deviation
(the square root of the variance), as effect sizes. However, the magnitude of
interindividual difference is a potentially interesting focus, so I offer this
brief description of using these as effect sizes for meta-analysis.
The standard deviation, which is the square root of the variance, is calculated from raw data as follows:
Equation 7.9: Computing the standard deviation (s)
or variance (s2) from raw data
s X s X2
£ X
X
2
i
N
1
• Xi is the score of individual i.
• X is the average of X across individuals.
• N is the sample size.
This equation is the unbiased estimate of population standard deviation
(and the square root of variance) from a sample (versus a description of the
sample variability, which would be computed using N rather than N – 1 in
the denominator). This is also the statistic commonly reported in primary
research. In fact, you will almost never need to calculate this standard deviation, as doing so requires raw data that are typically not available. Fortu-
Advanced and Unique Effect Size Computation
153
nately, standard deviations (or variances) are nearly always reported as basic
descriptive information in primary studies.4
To meta-analytically combine or compare standard deviations (or variances) across studies, you must also compute the standard error used for
weighting (see Chapter 8). The standard error of the standard deviation is a
function of the standard deviation itself and the sample size (Pigott & Wu,
2008):
Equation 7.10: Standard error of the standard deviation (SEs )
SE s
s
2N
• s is the (population estimate of the) standard deviation.
• N is the sample size.
The standard error of a variance estimate, as you might expect, is simply
Equation 7.10 squared (i.e., SE s s 2 2 N ).
At this point, you may have concluded that meta-analysis of standard
deviations (and therefore variances) is straightforward. To a large extent this
is true, though three qualifiers should be noted. First, as with the mean, it is
necessary that the studies included all use the same measure, or at least measures that can be placed on the same scale. Just as it would make little sense
to combine means from studies’ incomparable scales, it does not make sense
to combine magnitudes of individual difference (i.e., standard deviations)
from incomparable scales. Second, standard deviations are not exactly normally distributed, especially with small samples. Following the suggestion of
Pigott and Wu (2008), I suggest that you do not attempt to meta-analyze standard deviations if many studies have sample sizes less than 25. A third consideration involves the possibility of diminished standard deviations due to
ceiling or floor effects. Ceiling effects occur when most individuals in a study
score near the top of the scale, and floor effects occur when most individuals
score near the bottom of the scale. In both situations, estimates of standard
deviation are lowered because there is less “room” for individuals to vary
given the constraints of the scale. For example, if we administered a thirdgrade math test to graduate students, we would expect that most of them
would score near the maximum of the test, and the real individual variability
in math skills would not be captured by the observed variability in scores on
this test. I suggest two strategies for avoiding this potential biasing effect: (1)
2