Tải bản đầy đủ
4 Practical Matters: Creating an Organized Protocol for Coding
CODING INDIVIDUAL STUDIES
to create a usable database for later meta-analyses. There are several considerations for each aspect, which I describe next.
Considering first the interface coders use to record information, three
options include using paper forms that coders complete, using a computerized form to collect this information, or coding directly into the electronic
format to be used for analyses. Part of an example paper coding form (from
a meta-analysis of the association between relational aggression and peer
rejection described throughout this book; Card et al., 2008) is shown in
Using paper forms would require coders to write information into predefined questions (e.g., “Sample age in years: ”), which would then be
transferred into an electronic database for analyses. The advantages of this
approach are (1) that coders need training only in the coding process (guided
by the manual of instructions described in Section 4.4.2) rather than procedures for entering data into a computer, and (2) the information is checked
for plausibility when entered into the computer.
A computerized form would present the same information to coders but
would require them to input the coded data electronically, perhaps using a
relational database program (e.g., Microsoft Access). This type of interface
would require only a small amount of training beyond using paper forms
and would reduce the time (and potentially errors) in transferring information from paper to the electronic format. However, this advantage is also a
disadvantage in that it bypasses the check that would occur during this entry
from paper forms.
A third option with regard to a coding interface is to code information
directly into an electronic format (e.g., Microsoft Excel, SAS, SPSS) later used
for analysis. This option is perhaps the most time-efficient of all in reducing
the number of steps, but it is also the most prone to errors. I strongly discourage this third method if multiple coders will be coding studies.
A coding manual is a detailed collection of instructions describing how
information reported in research reports is quantified for inclusion in your
meta-analysis. Creating a detailed coding manual serves three primary purposes. First, this coding manual provides a guide for coders to transfer information in the study reports to the coding interface (e.g., paper forms). As
Coding Study Characteristics
Date coded: 4/25
Date entered into database: 5/10
1. Study #: 104
2. Study authors: Crick & Grotpeter
3. Year: 1995
4. Sample size (N): 491
5. Sample age (years): Not reported
6. Sample grade(s): 3-6 (128 3rd grade, 126 4th grade, 126 5th grade, 111 6th grade)
7. Proportion male: .52
8. Proportion ethnic minority: .40
9. Unique characteristics of sample: public school sample
10. Aggression—source of information: peer nomination
11. Aggression—name of scale: author created
12. Rejection—source of information: peer nominations
13. Rejection—name of scale: Classified by Coie et al. criteria
. . .
FIGURE 4.2. Part of an example study coding form. This example shows part of
a coding form for a meta-analysis of associations between relational aggression and
peer rejection (see Card et al., 2008). This coding form, used in conjunction with a
detailed coding manual, requires coders to record information from studies that is
later entered into a computerized database.
such, it should be a clear set of instructions for coding both “typical” studies
and more challenging coding situations. Second (and relatedly), this coding
manual aims to ensure consistency across multiple reporters9 by providing
a clear, concrete set of instructions that each coder should study and have at
hand during the coding process. Third, this coding manual serves as documentation of the coding process that should guide the presentation of the
meta-analysis or be made available to others to ensure transparency of the
coding (see beginning of this section).
With regard to the coding manual, the amount of instruction for each
study characteristic coded depends on the level of inference of the coding:
low-inference coding requires relatively little instruction, whereas high-
CODING INDIVIDUAL STUDIES
i nference coding requires more instruction. In addition, the coding manual
is most often a work in progress. Although an initial coding manual should
be developed before beginning the coding, ambiguities discovered during the
coding process likely will force ongoing revision.
Turning again to the example coding form of Figure 4.2, we should note
that this form would be accompanied by a detailed coding manual that all
coders have been trained in and have present while completing this form.
To provide illustrations of the type of information that might be included in
such a manual, we can consider two of the coded study characteristics. First,
item 5 (mean age) might be accompanied by the rather simple instruction
“Record the mean age of the sample in years.” However, even this relatively
simple (low-inference) code requires fairly extensive elaboration: “If study
analyzed a subset of the data, record the mean age of the subset used in analyses. If study reported a range of ages but not the mean, record the midpoint
of this range.” My colleagues and I also had to change the coding protocol
rather substantially when we found that many studies failed to report ages,
but did report the grades in school of participants. This led us to add the
“grade” code (item 6) along with instructions for entering this information
in the database: “If sample age is not reported in the study, then an estimated
age can be entered from grade using the formula Age = Grade + 5.”10 A second
study characteristic shown in Figure 4.2 that illustrates typical instruction
is item 10 (aggression—source of information). The coding manual for this
item specifies the choices that should be coded (self-report, peer nomination,
peer rating, teacher report, parent report, researcher observations, or other)
and definitions of each code.
4.4.3Database for Meta‑Analysis
The product of your coding should be an electronic file with which to conduct your meta-analysis. Table 4.2 provides an example of what this database
might look like (if complete, the table would extend far to the right to include
other coded study characteristics, coded effect sizes [Chapter 5], information
for any artifact corrections [Chapter 6], and several calculations for the actual
meta-analysis [Chapters 8–10]). Although the exact variables (columns) you
include will depend on the study characteristics you decide to code, the general layout of this file sould be considered. Here, each row represents a single
coded study, and each column represents a coded study characteristic.
Crick & Grotpeter
Crick et al.
Hawley et al.
Murray-Close & Crick
Nelson et al.
Ostrov & Crick
Ostrov et al.
Pakaslahti & Keltikangas-Järvinen
Phillipsen et al.
Rys & Bear
Salmivalli et al.
Tomada & Schneider
Werner & Crick
Zalecki & Hinshaw
It has subsequently been published as Ostrov (2008).
aValues of age that are bold and italicized were estimated from sample grade.
bArticle was under review during the preparation of this meta-analytic review.
TABLE 4.2. Partial Example of Database of Coded Study Characteristics
. . .
Summer camp . . .
Included only . . .
Headstart . . .
German; included . . .
CODING INDIVIDUAL STUDIES
In this chapter, I have described the process of coding study characteristics.
This process spans from the initial planning stages, in which you consider
the characteristics that are most informative to your research questions, to
the coding itself, in which you strive to extract and quantify information
from the study reports. Potentially interesting study characteristics to code
include features of the sample, measurement, design, and the source itself.
Study quality is another important consideration, though I recommend coding for specific aspects of quality rather than some single dimension. It is
important that your coding process is transparent and replicable; the process
should also be reliable across coders or within the same coder, and I have
described methods of evaluating this reliability. Finally, after deciding which
study characteristics to code, the coding protocol will guide this coding process.
Orwin, R. G., & Vevea, J. L. (2009). Evaluating coding decisions. In H. Cooper, L. V.
Hedges, & J. C. Valentine (Eds.), The handbook of research synthesis and metaanalysis (2nd ed., pp. 177–203). New York: Russell Sage Foundation.—This chapter
provides a thorough description of the sources of coding errors, ways to assess coder
reliability, and uses of this information in analyses.
Valentine, J. C. (2009). Judging the quality of primary research. In H. Cooper, L. V. Hedges,
& J. C. Valentine (Eds.), The handbook of research synthesis and meta-analysis (2nd
ed., pp. 129–146). New York: Russell Sage Foundation.—This chapter describes the
aspects of studies that collectively comprise “study quality,” as well as the relative
advantages of excluding poor quality studies versus assessing coded quality features
Wilson, D. B. (2009). Systematic coding. In H. Cooper, L. V. Hedges, & J. C. Valentine
(Eds.), The handbook of research synthesis and meta-analysis (2nd ed., pp. 159–176).
New York: Russell Sage Foundation.—This chapter provides thorough guidance in
planning a coding strategy that is explicit and transparent.
1. The year of publication is a crude proxy for the year the study was conducted, as it
does not account for likely inconsistencies across studies in the lag between data
collection and publication. However, year of publication is almost always avail-
Coding Study Characteristics
able, whereas the year of data collection is often not reported. Closer approximations of the year that data were collected might come from coding the dates
the report was submitted for publication (which is reported in some journals),
though this date will not reflect previous submissions of the work elsewhere or
the variability in lag between data collection and submission. If accurately coding the year of data collection is critical in your meta-analysis, the best approach
is to follow two steps. First, code the year of publication and the year of data collection for all studies reporting this information, contacting study authors who
do not report year of data collection for this information. Second, based on the
likely complete data for year of publication and the partially complete information for year of data collection that you are able to obtain, impute the missing values of year of data collection (see, e.g., Schafer & Graham, 2002 for a description
of imputation approaches). If your review includes various formats and methods
of coding year (e.g., year of conference presentation, year of defense), it will be
useful to include the format as a predictor in the imputation model.
2. A third recommendation is to give greater weight to studies of higher than lower
quality. This recommendation is problematic in my view because there is no
singularly defensible magnitude for these weights. For instance, if the quality of
studies is rated on a 3-point scale (1 = low quality, 2 = medium quality, 3 = high
quality) and these ratings are used as weights, then this weighting would assume
that high-quality studies deserve three times the weight as low-quality studies;
but this choice is as arbitrary as weighing them twice or four times as heavily.
Furthermore, these weights would need to be multiplied by the weights due to
the standard errors of effect sizes from studies (i.e., the inverse of these standard errors squared; see Chapter 8), but this would make it impossible to draw
statistically defensible (1) inferences about the mean effect size of your metaanalysis or (2) conclusions about the heterogeneity of effect sizes (see Chapter 8).
In short, any weighting by study quality is arbitrary, and I strongly recommend
against this practice.
3. Or, alternatively, are important enough to serve as inclusion/exclusion criteria. As with other study features, the decision to exclude studies with certain
problems of quality, or to code these qualities and evaluate them as moderators
of effect sizes, depends on your interest in empirically evaluating the impact
of study quality, your desire to draw conclusions about a homogeneous versus
heterogeneous population of studies, and the number of studies that would be
included in your meta-analysis (see Chapter 3).
4. I do not describe the fourth broad type of validity, statistical conclusion validity,
for two reasons. First, primary studies typically do not provide sufficient information regarding threats to this aspect of validity. Second, even if it were possible to code, the associations of these threats with effect sizes are likely small
and of little interest. One exception to these statements is the problem of artificial dichotomization of continuous variables, an unfortunately common practice
CODING INDIVIDUAL STUDIES
that substantially impacts the statistical conclusion validity (see e.g., Hunter &
Schmidt, 1990; MacCallum, Zhang, Preacher, & Rucker, 2002). However, it is
better to correct for (see Chapter 6), rather than code, this artificial dichotomization.
5. Methods of correcting effect sizes that are biased by range restriction (or range
enhancement) are described in Chapter 6.
6. Practically, it will not always be reasonable to report the many nuanced decisions for some study characteristics, owing to page limits or limits in the likely
audience’s interest in these minutiae. In these situations, it would be well to
improve brevity or readability at the expense of transparency. However, I would
recommend creating a complete documentation of these coding rules and studyby-study decisions that you can make available to interested readers.
7. This recommendation is another reason for coding aspects of study features
(lower inference codes) rather than an overall study quality (a high-inference
code), as I described in Section 4.2.
8. An additional approach is to quantify reliability with the intraclass correlation.
This approach has certain advantages, including the ability to model betweenrater variance and more realistic modeling of agreement across three or more
coders (Orwin & Vevea, 2009). However, computing the intraclass correlation is
more complicated than the three methods described in this chapter, and I believe
that you will find the approaches I describe adequate if your goal is simply to
evaluate and report the agreement of coding. Interested readers can consult
Orwin and Vevea (2009, pp. 190–191) and the references cited in this work.
9. This coding manual is just as important if you are coding the studies yourself
as it is if you have multiple coders. The coding process will very likely take an
extended period of time, get interrupted by other demands, and so on. In these
situations it is critical that you have a coding manual that can be used to retrain
yourself (i.e., ensure consistency of coding across time), just as it is for training
10. For the particular study (Crick & Grotpeter, 1995) shown in Figure 4.2, in which
exact subsample sizes per grade were reported, we estimate grade as the weighted
average, Age = [128(3+5) + 126(4+5) + 126(5+5) + 111(6+5)]/(128 + 126 + 126 +
Basic Effect Size Computation
Effect sizes represent the most important information that you will extract from
included studies. As such, carefully computing effect sizes from reported results
is critical. In this chapter, I describe three common indices for representing
effect sizes: r (Pearson correlation coefficient), g (one form of standardized
mean difference), and o (odds ratio). I also describe how you can compute
each from information commonly provided in empirical reports, such as reports
of actual effect sizes, inferential statistics (e.g., t-tests), descriptive data, and
statements of statistical significance. I then demonstrate how you can compare
and transform among these three indices of effect sizes. Finally, I discuss a
practical matter in computing effect sizes: using available effect size calculators within programs for conducting meta-analysis.
5.1The Common Metrics:
Correlation, Standardized Mean Difference,
and Odds Ratio
5.1.1 Significance Tests Are Not Effect Sizes
Before describing what effect sizes are, I describe what they are not. Effect
sizes are not significance tests, and significance tests are not effect sizes.
Although you can usually derive effect sizes from the results of significance
tests, and the magnitude of the effect size influences the likelihood of finding statistically significant results (i.e., statistical power), it is important to
distinguish between indices of effect size and statistical significance.
CODING INDIVIDUAL STUDIES
Imagine that a researcher, Dr. A, wishes to investigate whether two
groups (male versus female, two treatment groups, etc.) differ on a particular
variable X. So she collects data from five individuals in each group (N = 10).
She finds that Group 1 members have scores of 4, 4, 3, 2, and 2, for a mean of
3.0 and (population estimated) standard deviation of 1.0, whereas Group 2
members have scores of 6, 6, 5, 4, and 4, for a mean of 5.0 and standard deviation of 1.0. Dr. A performs a t-test and finds that t(8) = 3.16, p = .013. Finding
that Group 2 was significantly higher than Group 1 (according to traditional
criteria of a = .05), she publishes the results.
Further imagine that Dr. B reads this report and is skeptical of the results.
He decides to replicate this study, but collects data from only three individuals in each group (N = 6). He finds that individuals in Group 1 had scores of
4, 3, and 2, for a mean of 3.0 and standard deviation of 1.0, whereas Group 2
members had scores of 6, 5, and 4, for a mean of 5.0 and standard deviation
of 1.0. His comparison of these groups results in t(4) = 2.45, p = .071. Dr. B
concludes that the two groups do not differ significantly and therefore that
the findings of Dr. A have failed to replicate.
Now we have a controversy on our hands. Graduate student C decides
that she will resolve this controversy through a definitive study involving 10
individuals in each group (N = 20). She finds that individuals in Group 1 had
scores of 4, 4, 4, 4, 3, 3, 2, 2, 2, and 2, for a mean of 3.0 and standard deviation of 1.0, whereas individuals in Group 2 had scores of 6, 6, 6, 6, 5, 5, 4, 4,
4, and 4, for a mean of 5.0 and a standard deviation of 1.0. Her inferential test
is highly significant, t(18) = 4.74, p = .00016. She concludes that not only do
the groups differ, but also the difference is more pronounced than previously
This example illustrates the limits of relying on the Null Hypothesis
Significance Testing Framework in comparing results across studies. In each
of the three hypothetical studies, individuals in Group 1 had a mean score
of 3.0 and a standard deviation of 1.0, whereas individuals in Group 2 had a
mean score of 5.0 and a standard deviation of 1.0. The hypothetical researchers’ focus on significance tests led them to inappropriate conclusions: Dr. B’s
conclusion of a failure to replicate is inaccurate (because it does not consider
the inadequacy of statistical power in the study), as is Student C’s conclusion
of a more pronounced difference (which mistakenly interprets a low p value
as informing the magnitude of an effect). A focus on effect sizes would have
alleviated the confusion that arose from a reliance only on statistical significance and, in fact, would have shown that these three studies provided perfectly replicating results. Moreover, if the researchers had considered effect
sizes, they could have moved beyond the question of whether the two groups