Tải bản đầy đủ
2 Type I Error, Type II Error, and Power
Recall that the formula for degrees of freedom for the t test is (n1 + n2 − 2); hence,
for this problem dfÂ€=Â€28. If we had set αÂ€=Â€.05, then reference to Appendix A.2 of this
book shows that the critical values are −2.048 and 2.048. They are called critical values because they are critical to the decision we will make on H0. These critical values
define critical regions in the sampling distribution. If the value of t falls in the critical
region we reject H0; otherwise we fail to reject:
t (under H0) for df = 28
Type IÂ€error is equivalent to saying the groups differ when in fact they do not. The α
level set by the investigator is a subjective decision, but is usually set at .05 or .01 by
most researchers. There are situations, however, when it makes sense to use α levels
other than .05 or .01. For example, if making a type IÂ€error will not have serious
substantive consequences, or if sample size is small, setting αÂ€=Â€.10 or .15 is quite
reasonable. Why this is reasonable for small sample size will be made clear shortly.
On the other hand, suppose we are in a medical situation where the null hypothesis
is equivalent to saying a drug is unsafe, and the alternative is that the drug is safe.
Here, making a type IÂ€error could be quite serious, for we would be declaring the
drug safe when it is not safe. This could cause some people to be permanently damaged or perhaps even killed. In this case it would make sense to use a very small α,
Another type of error that can be made in conducting a statistical test is called a type II
error. The type II error rate, denoted by β, is the probability of accepting H0 when it is
false. Thus, a type II error, in this case, is saying the groups don’t differ when they do.
Now, not only can either type of error occur, but in addition, they are inversely related
(when other factors, e.g., sample size and effect size, affecting these probabilities are
held constant). Thus, holding these factors constant, as we control on type IÂ€error, type
II error increases. This is illustrated here for a two-group problem with 30 participants
per group where the population effect size d (defined later) is .5:
Notice that, with sample and effect size held constant, as we exert more stringent control over α (from .10 to .01), the type II error rate increases fairly sharply (from .37 to
.78). Therefore, the problem for the experimental planner is achieving an appropriate
balance between the two types of errors. While we do not intend to minimize the seriousness of making a type IÂ€error, we hope to convince you throughout the course of
this text that more attention should be paid to type II error. Now, the quantity in the
last column of the preceding table (1 − β) is the power of a statistical test, which is the
probability of rejecting the null hypothesis when it is false. Thus, power is the probability of making a correct decision, or of saying the groups differ when in fact they do.
Notice from the table that as the α level decreases, power also decreases (given that
effect and sample size are held constant). The diagram in FigureÂ€1.1 should help to
make clear why this happens.
The power of a statistical test is dependent on three factors:
1. The α level set by the experimenter
3. Effect size—How much of a difference the treatments make, or the extent to which
the groups differ in the population on the dependent variable(s).
FigureÂ€1.1 has already demonstrated that power is directly dependent on the α level.
Power is heavily dependent on sample size. Consider a two-tailed test at the .05 level
for the t test for independent samples. An effect size for the t test, as defined by Cohen
(1988), is estimated as =
d ( x1 − x2 ) / s, where s is the standard deviation. That is,
effect size expresses the difference between the means in standard deviation units.
Thus, if x1Â€=Â€6 and x2Â€=Â€3 and sÂ€=Â€6, then d= ( 6 − 3) / 6 = .5, or the means differ by
standard deviation. Suppose for the preceding problem we have an effect size of .5
standard deviations. Holding α (.05) and effect size constant, power increases dramatically as sample size increases (power values from Cohen, 1988):
n (Participants per group)
As the table suggests, given this effect size and α, when sample size is large (say, 100
or more participants per group), power is not an issue. In general, it is an issue when
one is conducting a study where group sizes will be small (n ≤ 20), or when one is
evaluating a completed study that had small group size. Then, it is imperative to be
very sensitive to the possibility of poor power (or conversely, a high type II error rate).
Thus, in studies with small group size, it can make sense to test at a more liberal level
Figure 1.1:â•‡ Graph of F distribution under H0 and under H0 false showing the direct relationship
between type IÂ€error and power. Since type IÂ€error is the probability of rejecting H0 when true, it
is the area underneath the F distribution in critical region for H0 true. Power is the probability of
rejecting H0 when false; therefore it is the area underneath the F distribution in critical region when
H0 is false.
F (under H0)
F (under H0 false)
Reject for α = .01
Reject for α = .05
Power at α = .05
Power at α = .01
Type I error
Type I error for .05
(.10 or .15) to improve power, because (as mentioned earlier) power is directly related
to the α level. We explore the power issue in considerably more detail in ChapterÂ€4.
1.3â•‡MULTIPLE STATISTICAL TESTS AND THE PROBABILITY
OF SPURIOUS RESULTS
If a researcher sets αÂ€=Â€.05 in conducting a single statistical test (say, a t test), then,
if statistical assumptions associated with the procedure are satisfied, the probability
of rejecting falsely (a spurious result) is under control. Now consider a five-group
problem in which the researcher wishes to determine whether the groups differ significantly on some dependent variable. You may recall from a previous statistics course
that a one-way analysis of variance (ANOVA) is appropriate here. But suppose our
researcher is unaware of ANOVA and decides to do 10 t tests, each at the .05 level,
comparing each pair of groups. The probability of a false rejection is no longer under
control for the set of 10 t tests. We define the overall α for a set of tests as the probability of at least one false rejection when the null hypothesis is true. There is an important
inequality called the Bonferroni inequality, which gives an upper bound on overallÂ€α:
Overall α ≤ .05 + .05 + + .05 = .50
Thus, the probability of a few false rejections here could easily be 30 or 35%, that is,
In general then, if we are testing k hypotheses at the α1, α2, …, αk levels, the Bonferroni
Overall α ≤ α1 + α 2 + + α k
If the hypotheses are each tested at the same alpha level, say α′, then the Bonferroni
upper bound becomes
Overall α ≤ k α ′
This Bonferroni upper bound is conservative, and how to obtain a sharper (tighter)
upper bound is discussedÂ€next.
If the tests are independent, then an exact calculation for overall α is available. First,
(1 − α1) is the probability of no type IÂ€error for the first comparison. Similarly, (1 − α2)
is the probability of no type IÂ€error for the second, (1 − α3) the probability of no type
IÂ€error for the third, and so on. If the tests are independent, then we can multiply probabilities. Therefore, (1 − α1) (1 − α2) … (1 − αk) is the probability of no type IÂ€errors
for all k tests.Â€Thus,
Overall α = 1 − (1 − α1 ) (1 − α 2 ) (1 − α k )
is the probability of at least one type IÂ€error. If the tests are not independent, then overall α will still be less than given here, although it is very difficult to calculate. If we set
the alpha levels equal, say to α′ for each test, then this expression becomes
Overall α = 1 − (1 − α ′ ) (1 − α ′ ) (1 − α ′ ) = 1 − (1 − α ′ )
No. of tests
1 − (1 − α′)
1 − (1 − α′)
1 − (1 − α′)k
This expression, that is, 1 − (1 − α′)k, is approximately equal to kα′ for small α′. The
next table compares the two for α′Â€=Â€.05, .01, and .001 for number of tests ranging from
First, the numbers greater than 1 in the table don’t represent probabilities, because
a probability can’t be greater than 1. Second, note that if we are testing each of a
large number of hypotheses at the .001 level, the difference between 1 − (1 − α′)k
and the Bonferroni upper bound of kα′ is very small and of no practical consequence. Also, the differences between 1 − (1 − α′)k and kα′ when testing at α′Â€=Â€.01
are also small for up to about 30 tests. For more than about 30 tests 1 − (1 − α′)k
provides a tighter bound and should be used. When testing at the α′Â€=Â€.05 level, kα′
is okay for up to about 10 tests, but beyond that 1 − (1 − α′)k is much tighter and
You may have been alert to the possibility of spurious results in the preceding example with multiple t tests, because this problem is pointed out in texts on intermediate
statistical methods. Another frequently occurring example of multiple t tests where
overall α gets completely out of control is in comparing two groups on each item of a
scale (test); for example, comparing males and females on each of 30 items, doing 30
t tests, each at the .05 level.
Multiple statistical tests also arise in various other contexts in which you may not readily recognize that the same problem of spurious results exists. In addition, the fact that
the researcher may be using a more sophisticated design or more complex statistical
tests doesn’t mitigate the problem.
As our first illustration, consider a researcher who runs a four-way ANOVA (A × B ×
C × D). Then 15 statistical tests are being done, one for each effect in the design: A, B, C,
and D main effects, and AB, AC, AD, BC, BD, CD, ABC, ABD, ACD, BCD, and
ABCD interactions. If each of these effects is tested at the .05 level, then all we
know from the Bonferroni inequality is that overall α ≤ 15(.05)Â€=Â€.75, which is not
very reassuring. Hence, two or three significant results from such a study (if they
were not predicted ahead of time) could very well be type IÂ€errors, that is, spurious
Let us take another common example. Suppose an investigator has a two-way ANOVA
design (A × B) with seven dependent variables. Then, there are three effects being
tested for significance: A main effect, B main effect, and the A × B interaction. The
investigator does separate two-way ANOVAs for each dependent variable. Therefore,
the investigator has done a total of 21 statistical tests, and if each of them was conducted at the .05 level, then the overall α has gotten completely out of control. This
type of thing is done very frequently in the literature, and you should be aware of it in
interpreting the results of such studies. Little faith should be placed in scattered significant results from these studies.
A third example comes from survey research, where investigators are often interested
in relating demographic characteristics of the participants (sex, age, religion, socioeconomic status, etc.) to responses to items on a questionnaire. AÂ€statistical test for relating
each demographic characteristic to responses on each item is a two-way χ2. Often in
such studies 20 or 30 (or many more) two-way χ2 tests are run (and it is so easy to run
them on SPSS). The investigators often seem to be able to explain the frequent small
number of significant results perfectly, although seldom have the significant results
been predicted a priori.
A fourth fairly common example of multiple statistical tests is in examining the elements of a correlation matrix for significance. Suppose there were 10 variables in one
set being related to 15 variables in another set. In this case, there are 150 between
correlations, and if each of these is tested for significance at the .05 level, then
150(.05)Â€=Â€7.5, or about eight significant results could be expected by chance. Thus,
if 10 or 12 of the between correlations are significant, most of them could be chance
results, and it is very difficult to separate out the chance effects from the real associations. AÂ€way of circumventing this problem is to simply test each correlation for significance at a much more stringent level, say αÂ€=Â€.001. Then, by the Bonferroni inequality,
overall α ≤ 150(.001)Â€=Â€.15. Naturally, this will cause a power problem (unless n is
large), and only those associations that are quite strong will be declared significant. Of
course, one could argue that it is only such strong associations that may be of practical
A fifth case of multiple statistical tests occurs when comparing the results of many
studies in a given content area. Suppose, for example, that 20 studies have been
reviewed in the area of programmed instruction and its effect on math achievement
in the elementary grades, and that only five studies show significance. Since at least
20 statistical tests were done (there would be more if there were more than a single
criterion variable in some of the studies), most of these significant results could be
spurious, that is, type IÂ€errors.
A sixth case of multiple statistical tests occurs when an investigator(s) selects
a small set of dependent variables from a much larger set (you don’t know this
has been done—this is an example of selection bias). The much smaller set is
chosen because all of the significance occurs here. This is particularly insidious.
Let us illustrate. Suppose the investigator has a three-way design and originally
15 dependent variables. Then 105Â€=Â€15 × 7 tests have been done. If each test is
done at the .05 level, then the Bonferroni inequality guarantees that overall alpha
is less than 105(.05)Â€=Â€5.25. So, if seven significant results are found, the Bonferroni procedure suggests that most (or all) of the results could be spurious. If all
the significance is confined to three of the variables, and those are the variables
selected (without your knowing this), then overall alphaÂ€=Â€21(.05)Â€=Â€1.05, and this
conveys a very different impression. Now, the conclusion is that perhaps a few of
the significant results are spurious.
1.4â•‡STATISTICAL SIGNIFICANCE VERSUS PRACTICAL
You have probably been exposed to the statistical significance versus practical importance issue in a previous course in statistics, but it is sufficiently important to have us
review it here. Recall from our earlier discussion of power (probability of rejecting the
null hypothesis when it is false) that power is heavily dependent on sample size. Thus,
given very large sample size (say, group sizes > 200), most effects will be declared
statistically significant at the .05 level. If significance is found, often researchers seek
to determine whether the difference in means is large enough to be of practical importance. There are several ways of getting at practical importance; among themÂ€are
1. Confidence intervals
2. Effect size measures
3. Measures of association (variance accountedÂ€for).
Suppose you are comparing two teaching methods and decide ahead of time that the
achievement for one method must be at least 5 points higher on average for practical
importance. The results are significant, but the 95% confidence interval for the difference in the population means is (1.61, 9.45). You do not have practical importance,
because, although the difference could be as large as 9 or slightly more, it could also
be less thanÂ€2.
You can calculate an effect size measure and see if the effect is large relative to what
others have found in the same area of research. As a simple example, recall that the
Cohen effect size measure for two groups is d = ( x1 − x2 ) / s, that is, it indicates how
many standard deviations the groups differ by. Suppose your t test was significant
and the estimated effect size measure was d = .63 (in the medium range according
to Cohen’s rough characterization). If this is large relative to what others have found,
then it probably is of practical importance. As Light, Singer, and Willett indicated in
their excellent text By Design (1990), “because practical significance depends upon
the research context, only you can judge if an effect is large enough to be important”
ˆ 2 , can also be used
Measures of association or strength of relationship, such as Hay’s ω
to assess practical importance because they are essentially independent of sample size.
However, there are limitations associated with these measures, as O’Grady (1982)
pointed out in an excellent review on measures of explained variance. He discussed
three basic reasons that such measures should be interpreted with caution: measurement, methodological, and theoretical. We limit ourselves here to a theoretical point
O’Grady mentioned that should be kept in mind before casting aspersions on a “low”
amount of variance accounted. The point is that most behaviors have multiple causes,
and hence it will be difficult in these cases to account for a large amount of variance
with just a single cause such as treatments. We give an example in ChapterÂ€4 to show
that treatments accounting for only 10% of the variance on the dependent variable can
indeed be practically significant.
Sometimes practical importance can be judged by simply looking at the means and
thinking about the range of possible values. Consider the following example.
A survey researcher compares four geographic regions on their attitude toward education. The survey is sent out and 800 responses are obtained. Ten items, Likert scaled
from 1 to 5, are used to assess attitude. The group sizes, along with the means and
standard deviations for the total score scale, are givenÂ€here:
An analysis of variance on these groups yields FÂ€=Â€5.61, which is significant at the .001
level. Examining the p value suggests that results are “highly significant,” but are the
results practically important? Very probably not. Look at the size of the mean differences for a scale that has a range from 10 to 50. The mean differences for all pairs of
groups, except for East and South, are about 2 or less. These are trivial differences on
a scale with a range ofÂ€40.
Now recall from our earlier discussion of power the problem of finding statistical significance with small sample size. That is, results in the literature that are not significant
may be simply due to poor or inadequate power, whereas results that are significant,
but have been obtained with huge sample sizes, may not be practically significant. We
illustrate this statement with two examples.
First, consider a two-group study with eight participants per group and an effect
size of .8 standard deviations. This is, in general, a large effect size (Cohen, 1988),
and most researchers would consider this result to be practically significant. However, if testing for significance at the .05 level (two-tailed test), then the chances
of finding significance are only about 1 in 3 (.31 from Cohen’s power tables).
The danger of not being sensitive to the power problem in such a study is that a
researcher may abort a promising line of research, perhaps an effective diet or type
of psychotherapy, because significance is not found. And it may also discourage
On the other hand, now consider a two-group study with 300 participants per group
and an effect size of .20 standard deviations. In this case, when testing at the .05 level,
the researcher is likely to find significance (powerÂ€=Â€.70 from Cohen’s tables). To use
a domestic analogy, this is like using a sledgehammer to “pound out” significance. Yet
the effect size here may not be considered practically significant in most cases. Based
on these results, for example, a school system may decide to implement an expensive
program that may yield only very small gains in achievement.
For further perspective on the practical importance issue, there is a nice article by
Haase, Ellis, and Ladany (1989). Although that article is in the Journal of Counseling
Psychology, the implications are much broader. They suggest five different ways of
assessing the practical or clinical significance of findings:
1. Reference to previous research—the importance of context in determining whether
a result is practically important.
2. Conventional definitions of magnitude of effect—Cohen’s (1988) definitions of
small, medium, and large effectÂ€size.
3. Normative definitions of clinical significance—here they reference a special issue
of Behavioral Assessment (Jacobson, 1988) that should be of considerable interest
4. Cost-benefit analysis.
5. The good-enough principle—here the idea is to posit a form of the null hypothesis
that is more difficult to reject: for example, rather than testing whether two population means are equal, testing whether the difference between them is at leastÂ€3.
Note that many of these ideas are considered in detail in Grissom and Kim (2012).
Finally, although in a somewhat different vein, with various multivariate procedures
we consider in this text (such as discriminant analysis), unless sample size is large relative to the number of variables, the results will not be reliable—that is, they will not
generalize. AÂ€major point of the discussion in this section is that it is critically important to take sample size into account in interpreting results in the literature.
Outliers are data points that split off or are very different from the rest of the data. Specific examples of outliers would be an IQ of 160, or a weight of 350 lbs. in a group for
which the median weight is 180 lbs. Outliers can occur for two fundamental reasons:
(1) a data recording or entry error was made, or (2) the participants are simply different
from the rest. The first type of outlier can be identified by always listing the data and
checking to make sure the data have been read in accurately.
The importance of listing the data was brought home to Dr.Â€Stevens many years ago as
a graduate student. AÂ€regression problem with five predictors, one of which was a set
of random scores, was run without checking the data. This was a textbook problem to
show students that the random number predictor would not be related to the dependent variable. However, the random number predictor was significant and accounted
for a fairly large part of the variance on y. This happened simply because one of the
scores for the random number predictor was incorrectly entered as a 300 rather than
as a 3. In this case it was obvious that something was wrong. But with large data sets
the situation will not be so transparent, and the results of an analysis could be completely thrown off by 1 or 2 errant points. The amount of time it takes to list and check
the data for accuracy (even if there are 1,000 or 2,000 participants) is well worth the
Statistical procedures in general can be quite sensitive to outliers. This is particularly
true for the multivariate procedures that will be considered in this text. It is very important to be able to identify such outliers and then decide what to do about them. Why?
Because we want the results of our statistical analysis to reflect most of the data, and
not to be highly influenced by just 1 or 2 errant data points.
In small data sets with just one or two variables, such outliers can be relatively easy to
identify. We now consider some examples.
Consider the following small data set with two variables:
Cases 6 and 10 are both outliers, but for different reasons. Case 6 is an outlier because
the score for case 6 on x1 (150) is deviant, while case 10 is an outlier because the score
for that subject on x2 (97) splits off from the other scores on x2. The graphical split-off
of cases 6 and 10 is quite vivid and is given in FigureÂ€1.2.
In large data sets having many variables, some outliers are not so easy to spot
and could go easily undetected unless care is taken. Here, we give an example
Figure 1.2:â•‡ Plot of outliers for two-variable example.
(108.7, 60)–Location of means on x1 and x2.
100 110 120 130 140 150
of a somewhat more subtle outlier. Consider the following data set on four
The somewhat subtle outlier here is case 13. Notice that the scores for case 13 on none
of the xs really split off dramatically from the other participants’ scores. Yet the scores
tend to be low on x2, x3, and x4 and high on x1, and the cumulative effect of all this is
to isolate case 13 from the rest of the cases. We indicate shortly a statistic that is quite
useful in detecting multivariate outliers and pursue outliers in more detail in ChapterÂ€3.
Now let us consider three more examples, involving material learned in previous statistics courses, to show the effect outliers can have on some simple statistics.