Tải bản đầy đủ - 402 (trang)
3 Practical Matters: When (and How) to Correct: Conceptual, Methodological, and Disciplinary Considerations

3 Practical Matters: When (and How) to Correct: Conceptual, Methodological, and Disciplinary Considerations

Tải bản đầy đủ - 402trang

Corrections to Effect Sizes


section, you might reasonably choose to correct only for those that seem

most pressing within the primary studies being synthesized.

How pressing a particular type of artifact is within a meta-­analysis is

partly a conceptual question and partly an empirical question. First, you

must consider the collection of primary studies in light of your conceptual

expertise of the area. Relevant questions include the following: How valid

are the measures within this research in relation to the construct I am interested in? How representative are the samples relative to the population about

which I want to draw conclusions? Again, there is not a statistical answer

to such questions; rather, these questions must be answered based on your

understanding of the field.

In addition to conceptual considerations, you might also base conclusions on empirical grounds. Specifically, you can consider the data reported

in primary studies to draw conclusions about the presence of important artifacts. For example, I recommend coding the internal consistencies of relevant

measures within the primary studies, meta-­analyzing these reliabilities (see

Chapter 7), and determining (1) whether the collection of studies has generally high or low reliabilities of measures and (2) whether substantial variability exists across studies in these reliabilities. Similarly, if many studies use

similar measures of a variable (i.e., with the same scale), then you could code

and evaluate standard deviations across studies (see Chapter 7) to determine

whether some studies suffer from restricted ranges. In short, for each of the

potential artifacts described in the previous section, you should consider the

available empirical evidence to determine whether this artifact is uniformly

or inconsistently present in the primary studies being analyzed. If a particular artifact is uniformly present, then correcting for it will yield more accurate

overall effect size estimates (among latent constructs). If a particular artifact

is present in some studies but not in others (or present in differing degrees

across studies), then correcting for this artifact will reduce less interesting

(i.e., artifactual) variability across studies and allow for a clearer picture of

substantively interesting variability in effect sizes.

6.3.2Disciplinary Considerations

Whereas I view the conceptual and empirical considerations as most important in deciding whether and how to correct for artifacts, the reality is that

these corrections are more common in some fields than in others. This means

that one meta-­analyst working within one field might be expected to correct

for certain artifacts, whereas another meta-­analyst working within another

field might be met with skepticism if certain (or any) corrections were to be

performed. These disciplinary practices are unfortunate, especially because



they are more often due to those who are influential in a field more so than

consideration of particular needs of a field. Nevertheless, it is useful to recognize the common practices within your particular field.

Notwithstanding recognition of these disciplinary practices, I want to

encourage you to not feel restricted by these practices. In other words, do

not base your decision to perform or not perform certain artifact corrections

only on common practices within your field. Instead, carefully consider the

conceptual and empirical basis for making certain corrections, and then use

(or not) these corrections to obtain results that best answer your research


6.4 Summary

In this chapter I have described rationales for and against corrections of

study artifacts, imperfections of primary studies that bias (typically attenuate) effect size estimates. I described methods of correcting for several types

of artifacts: unreliability of measures, artificial dichotomization of continuous variables, range restriction, poor validity of measures, and covariation

due to a third variable. Despite disciplinary differences in practices of artifact

correction, I argue that the decision to correct or not to correct for certain

artifacts should be based on conceptual and empirical grounds.

6.5Recommended Readings

Hunter, J. E., & Schmidt, F. L. (2004). Methods of meta-­analysis: Correcting error and bias

in research findings (2nd ed.). Thousand Oaks, CA: Sage.—This book provides a

complete description of meta-­analysis emphasizing the artifact corrections described

in this chapter. The authors have been the most active advocates for artifact correction

in the field of meta-­analysis.

Schmidt, F. L., Le, H., & Oh, I.-S. (2009). Correcting for the distorting effects of study artifacts

in meta-­analysis. In H. Cooper, L. V. Hedges, & J. C. Valentine (Eds.), The handbook

of research synthesis and meta-analysis (2nd ed., pp. 317–333). New York: Russell

Sage Foundation.—This chapter represents a more concise overview of the practice of

artifact correction in meta-­analysis.

Corrections to Effect Sizes



1. By describing artifact corrections of effect sizes of individual studies, I am

implicitly prescribing one of two possible methods of meta-­analysis with artifact

correction. Specifically, I am recommending that you correct the effect sizes of

each individual study and use these corrected effect sizes in subsequent meta­analytic computations (described in Chapters 8–12). This approach is described

in Hunter and Schmidt (2004, Ch. 3). My selection of this approach makes my

subsequent description of combining and comparing effect sizes across studies

more straightforward. However, it also requires that most studies provide sufficient information to make corrections (e.g., report internal consistency to correct

for unreliability), and it may be necessary to substitute estimates of these corrections for studies that do not provide sufficient information (e.g., meta-­analytically

compute a mean reliability that is used for studies that do not report internal consistency). An alternative approach is to meta-­analytically compute a distribution

of uncorrected effect sizes across studies and distributions of corrections across

studies. These techniques are more complex, yet may be useful when primary

studies are inconsistent in reporting information needed to correct for artifacts.

These techniques are described in Hunter and Schmidt (2004, Ch. 4).

2. An important caveat of this use of multiplicative combination of artifacts is that

the artifacts are assumed to be independent of one another. Violations of this

assumption can lead to inaccurate corrected effect sizes, including out-of-­bounds

effect sizes (e.g., r greater than 1.0).

3. I have arranged these reasons in what I consider the most to least justifiable.

Not correcting for unreliability of one variable is acceptable if a convincing case

can be made that it is highly reliably measured. Not correcting for reliability

of one variable because primary studies do not report this reliability is weaker

justification, though it is a reality you may have to deal with in some situations.

It is likely that some studies in a meta-­analysis will report reliability estimates,

whereas others will not. In these cases it is preferable for you to seek reliability

information from primary study authors. If it is still not possible to obtain reliability estimates for some studies in the meta-­analysis, I recommend performing

a meta-­analysis of reliabilities among studies in the meta-­analysis (see Chapter 7)

and using either the mean reliability or an estimated reliability predicted by other

study features. The final reason listed, not correcting for unreliability of one variable because you are not interested in the variable, is not acceptable. Expressing

an interest in X but not Y ignores the fact that the association between these variables necessarily depends on the measurement properties (including reliability)

of both variables, so unreliability in Y is going to adversely affect the association

involving X, which you are interested in.

4. Latent correlations can also be found within structural equation models, or

latent variable models that include directional (regression) paths. However, the



meta-­analyst needs to be careful when determining latent correlations from such

models. Although nondirection (i.e., bivariate) associations between exogenous

(predictor) variables can be interpreted as latent correlations, nondirectional

associations between endogenous variables (predicted variables) and directional

associations cannot be interpreted as latent correlations. In these instances, the

meta-­analyst needs to derive the latent correlations through tracing rules, as

described by Kline (2005) and Maruyama (1998).

5. When discussing range restriction, I focus on the use of r as the index of effect

size. This is the most common situation, as range restriction is relevant only to

continuous variables and is most often encountered in naturalistic studies. However, it is also possible to correct for range restriction of the continuous variable

when considering standardized mean differences (e.g., g). For details regarding

these corrections, see Hunter and Schmidt (2004) or Lipsey and Wilson (2001).


Advanced and Unique

Effect Size Computation

Although the three effect sizes (r, g, or other standardized mean differences,

and o) described in Chapter 5 are most commonly used, you are not restricted

to these indices of two-­variable associations in your meta-­analysis. Instead,

you should consider the broad range of potential effect sizes as answers to the

research questions relevant to your review. In this chapter, I describe some less

commonly used effect sizes that are useful for meta-­analysis of single variables

(i.e., means, proportions, and variances or standard deviations), effect sizes

that retain the meaningful metric of the variables involved (i.e., unstandardized

effect sizes), effect sizes from multivariate regression analyses, and a variety

of different effect sizes that have received less consideration (e.g., scale reliabilities, longitudinal change scores). I then describe some of the challenges

of using less common effect sizes in your meta-­analysis, as well as some of

the opportunities.

7.1Describing Single Variables

There are relatively few instances of meta-­analyzing single variables, yet this

information could be potentially valuable. At least three types of information

regarding single variables could be important: (1) the mean level of individuals on a continuous variable; (2) the proportions of individuals falling into a

particular category of a categorical variable; and (3) the amount of variability

(or standard deviation), in a continuous variable.





Mean Level on Variable

Meta-­analysis of reported means on a single variable may have great value. One

potential is that meta-­analytic combination (see Chapters 8 and 9) allows you to

obtain a more precise estimate of this mean than might be obtained in primary

studies, especially when those primary studies have small sample sizes. Perhaps

more importantly, meta-­analytic comparison (see Chapter 10) allows you to identify potential reasons why means differ across studies (e.g., methodological differences such as condition or reporter; sample characteristics such as age or ethnicity). Thus, the meta-­analysis of means of single variables has considerable value.

At the same time, there is also an important limiting consideration in the

meta-­analysis of means in that the primary studies must typically report this

value in the same metric. For example, if one study measures the variable of

interest on a 0–4 scale, whereas another uses a 1–100 scale, it usually does not

make sense to combine or compare means across these studies.1 Some exceptions can be considered, however. The first exception is if the different scales

are due to the primary study authors scoring comparable measures in different ways, then it is usually possible to transform one of the scales to the metric

of the other. For example, if two primary studies both use a 6-item scale with

items having values from 1 to 5, one study may form a composite by averaging

the items, whereas the other forms a composite by summing the items. In this

case, it would be possible to transform one of the two means to the same scale

of the other (i.e., multiplying the average by 6 to obtain the sum, or dividing

the sum by 6 to obtain the average), and the means of the two studies could

then be combined and compared. A second, more general exception is that it

might usually be possible to transform studies using different scales into a

common metric. From the example I provided of one study using a 0–4 scale

and the other using a 1–100 scale, it is possible to transform a mean on one

scale to an equivalent mean on the other using the following equation:

Equation 7.1: Transforming scores between two different scales


¤ Max 2
Min 2 ³ ³

´´ ´ Min2

X 2  ¥¥X 1
Min1 ¥¥


¦ Max1
Min1 µ µ


• X2 is the equivalent score on the second scale.

• X1 is the score on the first scale that you wish to transform.

• Min1 is the lowest possible score on the first scale.

• Max1 is the highest possible score on the first scale.

• Min2 is the lowest possible score on the second scale.

• Max2 is the highest possible score on the score scale.

Advanced and Unique Effect Size Computation


A caution in using different scales is that even if both studies use a common range of scores (e.g., 0–4), it is probably only meaningful to combine and

compare means if the studies used the same anchor points (e.g., if one used

response options of never, rarely, sometimes, often, and always, whereas the

other used 0 times, once, 2–3 times, 4–6 times, and 7 or more times, it would

make little sense to combine or compare these studies). This may prove an

especially difficult obstacle if you are attempting to combine multiple scales

in which scores from one scale are transformed to scores of another using

Equation 7.1. This requirement of primary studies reporting the variable on

the same—or at least a comparable—­metric means that you will often include

only studies using the same measure (e.g., a particular measure of depression, such as the Children’s Depression Inventory; Kovacs, 1992) or else very

similar measures (e.g., child- and teacher-­reported aggression using parallel

items and response options). I suspect that this rather restrictive requirement

is the primary reason why meta-­analysis of means is not more common. If

you are using different but similar measures, or transformations to place values of different measures on a common scale, I highly recommend evaluating

the measure as a moderator (see Chapter 9).

If you do have a situation in which the combination or comparison

of means is feasible, computing this effect size (and its standard error) is

straightforward. The equation for computing a mean is well known, but I

reproduce it here:

Equation 7.2: Computing the mean (X) from raw data


X £ i


• xi is scores of individual i.

• N is the sample size.

However, it is typically not necessary (or possible) for you to compute

this mean, as this is usually reported within the primary study. Therefore,

coding the mean, which is an effect size (of the central tendency of a single

variable), is usually straightforward.

Occasionally, however, the primary studies will report frequency tables

rather than means for variables with a small number of potential options. For

example, a primary study might report the number or proportion of individuals scoring 0, the number or proportion scoring 1, and so on, on a measure

that has possible options of 0, 1, 2, 3, and 4. Here, you can use these frequen-



cies of different scores to re-­create the raw data and then compute the mean

from these data (using Equation 7.2). An easier way to compute this mean is

using the following equivalent formula provided by Lipsey and Wilson (2001,

p. 176), summing over all potential values of a variable:

Equation 7.3: Computing the mean (X) from frequency data


£ xf


• x is a potential value of the variable.

• f is the frequency (number, percentage, or proportion) of individuals with the value x.

Before ending my discussion of calculating the mean as an effect size,

it is important to consider the standard error of this estimate of the mean

(which is used for weighting in the meta-­analysis; see Chapter 8). To compute

the standard error of a study’s estimate of the mean, you must obtain the

(population estimate of the) standard deviation (s) and sample size (N) from

that study, which are then used in the following equation:

Equation 7.4: Standard error of a mean (SE X )




• s is the standard deviation of variable X.

• N is the sample size.

After computing the mean and standard error of the mean for each study,

you can then meta-­analytically combine and compare results across studies

using techniques described later in this book (see Chapters 8–10).

7.1.2 Proportion of Individuals in Categories

Whereas the mean is a useful effect size for the typical score (i.e., central

tendency) of a single continuous variable, the proportion is a useful effect

size for a particular category of a categorical variable. For example, we may

be interested in the proportion of children who are aggressive or the proportion of individuals who meet certain criteria for rejected social status, if we

Advanced and Unique Effect Size Computation


believe the meaningful conceptualization of aggression or rejection is categorical. In these cases, we are interested in the prevalence of an affirmative

instance of a single dichotomous variable.2

This proportion is often either directly reported in primary studies (as

either a proportion or percentage, which can be divided by 100 to obtain the

proportion), or else can be computed from the reported frequency falling in

this category (k) relative to the total sample size (N):

Equation 7.5: Computing the proportion (p)




• k is the number of individuals in the category of interest.

• N is the sample size.

This proportion works well as an effect size in many situations, but is

problematic when proportions are far from 0.50.3 For this reason, it is useful

to transform proportions (p) into logits (l) prior to meta-­analytic combination or comparison:

Equation 7.6: Computing logits (l) from proportions

¤ p ³


l  ln ¥¥

¦ 1
p µ

• p is the proportion of individuals in the category of interest.

This logit has the following standard error dependent on the proportion

(p) and sample size (N) (Lipsey & Wilson, 2001, p. 40):

Equation 7.7: Standard error of a logit (SEl )

SE l 



Np N 1

• p is the proportion of individuals in the category of interest.

• N is the sample size.



Analyses would then be performed on the logit (l), weighted by the standard error (SEl) as described in Chapters 8 through 10. For reporting, it is

useful to back-­transform results (e.g., mean effect size) in logits (l) back to

proportions (p), using the following equation:

Equation 7.8: Transforming logits to proportions



el + 1

• p is the proportion of individuals in the category of interest.

• l is the logit transformation.

7.1.3Variances and Standard Deviations

Few meta-­analyses have used variances, or the equivalent standard deviation

(the square root of the variance), as effect sizes. However, the magnitude of

interindividual difference is a potentially interesting focus, so I offer this

brief description of using these as effect sizes for meta-­analysis.

The standard deviation, which is the square root of the variance, is calculated from raw data as follows:

Equation 7.9: Computing the standard deviation (s)

or variance (s2) from raw data

s X  s X2 

£ X





• Xi is the score of individual i.

• X is the average of X across individuals.

• N is the sample size.

This equation is the unbiased estimate of population standard deviation

(and the square root of variance) from a sample (versus a description of the

sample variability, which would be computed using N rather than N – 1 in

the denominator). This is also the statistic commonly reported in primary

research. In fact, you will almost never need to calculate this standard deviation, as doing so requires raw data that are typically not available. Fortu-

Advanced and Unique Effect Size Computation


nately, standard deviations (or variances) are nearly always reported as basic

descriptive information in primary studies.4

To meta-­analytically combine or compare standard deviations (or variances) across studies, you must also compute the standard error used for

weighting (see Chapter 8). The standard error of the standard deviation is a

function of the standard deviation itself and the sample size (Pigott & Wu,


Equation 7.10: Standard error of the standard deviation (SEs )

SE s 



• s is the (population estimate of the) standard deviation.

• N is the sample size.

The standard error of a variance estimate, as you might expect, is simply

Equation 7.10 squared (i.e., SE s  s 2 2 N ).

At this point, you may have concluded that meta-­analysis of standard

deviations (and therefore variances) is straightforward. To a large extent this

is true, though three qualifiers should be noted. First, as with the mean, it is

necessary that the studies included all use the same measure, or at least measures that can be placed on the same scale. Just as it would make little sense

to combine means from studies’ incomparable scales, it does not make sense

to combine magnitudes of individual difference (i.e., standard deviations)

from incomparable scales. Second, standard deviations are not exactly normally distributed, especially with small samples. Following the suggestion of

Pigott and Wu (2008), I suggest that you do not attempt to meta-­analyze standard deviations if many studies have sample sizes less than 25. A third consideration involves the possibility of diminished standard deviations due to

ceiling or floor effects. Ceiling effects occur when most individuals in a study

score near the top of the scale, and floor effects occur when most individuals

score near the bottom of the scale. In both situations, estimates of standard

deviation are lowered because there is less “room” for individuals to vary

given the constraints of the scale. For example, if we administered a thirdgrade math test to graduate students, we would expect that most of them

would score near the maximum of the test, and the real individual variability

in math skills would not be captured by the observed variability in scores on

this test. I suggest two strategies for avoiding this potential biasing effect: (1)


Tài liệu bạn tìm kiếm đã sẵn sàng tải về

3 Practical Matters: When (and How) to Correct: Conceptual, Methodological, and Disciplinary Considerations

Tải bản đầy đủ ngay(402 tr)