7 Analyzing ratio, interval or ordinal data that show gross differences in variance among treatments and cannot be satisfactorily transformed
Tải bản đầy đủ - 0trang
264
Further non-parametric tests
Friedman statistic. Once this exceeds the critical value above which less than
5% of the most extreme departures from the null hypothesis occur when
samples are taken from the same population, the outcome is considered
statistically signiﬁcant.
This analysis can be up to 95% as powerful as the equivalent two-factor
ANOVA without replication for randomized blocks.
19.6.2 Exact tests and randomization tests for three or more
related samples
The procedures for randomization and exact tests on the ranks of three or
more related samples are extensions of the methods for two independent
samples and do not need to be explained any further.
19.6.3 A posteriori comparisons for three or more related samples
If the Friedman test shows a signiﬁcant diﬀerence among treatments and
the eﬀect is considered ﬁxed, you are likely to want to know which treatments
are signiﬁcantly diﬀerent (see 19.4.3). A posteriori testing can be done and
instructions are given in more advanced texts such as Zar (1996).
19.7
Analyzing ratio, interval or ordinal data that show
gross diﬀerences in variance among treatments and
cannot be satisfactorily transformed
Some data show gross diﬀerences in variance among treatments that
cannot be improved by transformation and are therefore unsuitable for
parametric or non-parametric analysis. An exploration geologist in Canada
was evaluating the economic potential of a circular depression thought to
be an impact crater. They knew that the large-scale impact structure at
nearby Sudbury was associated with valuable copper and nickel deposits,
and that other impact structures are excellent reservoirs for oil and gas.
So they set out to determine if the new locality might also be an impact
structure.
One of the key properties of impacted rocks is their high concentration of
platinum group elements. Perhaps the most diagnostic of these is iridium,
which is famously found all over the world in an ash layer that corresponds
19.7 Gross diﬀerences in variance
265
to the end of the Cretaceous Period and the extinction of the dinosaurs.
Iridium is not normally present in crustal rocks on the Earth’s surface – it is
usually found only in the metallic cores of diﬀerentiated planets and in iron
from meteorites. So when an impact from an iron-rich object occurs on
Earth, the iridium vaporizes and is distributed among the impact ejecta in
unusually high concentrations (up to 100 parts per billion). Thus iridium
concentration can be used as a geochemical tracer to indicate that rocks
have experienced an impact event.
The exploration geologist collected 15 core samples from his suspected
new impact site, along with 15 from the Sudbury impact structure. The
concentration of iridium in the two samples of 15 is given in Table 19.7.
It is clear there are gross diﬀerences in the distributions between the two
samples, with one showing bimodality. A solution is to transform the data to
a nominal scale and reclassify both samples into two mutually exclusive
categories of “with iridium” and “no iridium” (Table 19.8) which can be
compared using a test for two or more independent samples of categorical
data (Chapter 18).
Table 19.7 The Ir contents (in parts per billion)
of 15 rocks sampled at Sudbury crater (a classic
impact site) and 15 at a new site with a circular
feature suspected to be an impact crater.
Sudbury
New site
4
7
4
10
2
7
1
9
1
9
12
1
5
4
5
2
0
2
0
0
0
0
0
1
0
1
0
0
1
0
266
Further non-parametric tests
Table 19.8 Transformation of the ratio data in Table 19.7 to a nominal
scale showing the number of replicates in each sample as the two
mutually exclusive categories of with and without detectable iridium.
Number without iridium
Number with iridium
19.8
Sudbury
New site
0
15
10
5
Non-parametric correlation analysis
Correlation analysis was introduced in Chapter 15 as an exploratory technique used to examine whether two variables are related or vary together.
Importantly, there is no expectation that the numerical value of one variable
can be predicted from the other, nor is it necessary that either variable is
determined by the other.
The parametric test for correlation gives a statistic that varies between
+1.00 and –1.00, with both of these extremes indicating a perfect positive
and negative straight line relationship respectively, while values around
zero show no relationship. Although parametric correlation analysis is
powerful, it can only detect linear relationships and also assumes that both
the X and Y variables are normally distributed. When normality of both
variables cannot be assumed, or the relationship between the two variables
does not appear to be linear and cannot be remedied by transformation,
it is not appropriate to use a parametric test for correlation. The most
commonly used non-parametric test for correlation is Spearman’s rank
correlation.
19.8.1 Spearman’s rank correlation
This test is extremely straightforward. The two variables are ranked
separately, from lowest to highest, and the (parametric) Pearson correlation coeﬃcient calculated for the ranked values. This gives a statistic called
Spearman’s rho, which for a population is symbolized by ρs and by rs for a
sample.
Spearman’s rs and Pearson’s r will not always be the same for the same set
of data. For Pearson’s r the correlation coeﬃcients of 1.00 or –1.00 were
19.8 Non-parametric correlation analysis
267
Y
X
(a)
Raw score
Y
X
8
7
5
4
800
700
500
400
rs =
X
(c)
X
(b)
Rank
X Y
4
3
2
1
4
3
2
1
Raw score
Y
X
Rank
X Y
Raw score
Y
X
8
7
5
4
4
3
2
1
8
7
5
4
500
500
500
500
rs =
1.00
2.5
2.5
2.5
2.5
Rank
X Y
400
500
700
800
rs =
0.1
1
2
3
4
4
3
2
1
– 1.00
Y
X
(d)
Raw score
Y
X
8
6
5
2
900
700
300
200
rs =
Rank
X Y
4
3
2
1
X
(f)
X
(e)
4
3
2
1
1.00
Raw score
Y
X
8
7
4
2
900
700
200
300
rs =
Rank
Y
X
4
3
2
1
4
3
1
2
0.80
Raw score
Y
X
8
6
4
2
800
900
200
400
rs =
Rank
X Y
4
3
2
1
3
4
1
2
0.60
Figure 19.2 Examples of raw scores, ranks and the Spearman rank
correlation coeﬃcient for data with: (a) a perfect positive relationship
(all points lie along a straight line); (b) no relationship; (c) a perfect negative
relationship (all points lie along a straight line); (d) a positive relationship
which is not a straight line but all pairs of bivariate data have the same ranks;
(e) a positive relationship with only half the pairs of bivariate data having
equal ranks; (f) a positive relationship with no pairs of bivariate data having
equal ranks. Note that the value of rs is 1.00 for case (d) even though the raw
data do not show a straight-line relationship.
268
Further non-parametric tests
only obtained when there was a perfect positive or negative straight-line
relationship between the two variables. In contrast, Spearman’s rs will give a
value of 1.00 or –1.00 whenever the ranks for the two variables are in perfect
agreement or disagreement (Figure 19.2), which occurs in more cases than a
straight-line relationship.
The probability of the value of rs can be obtained by comparing it to the
expected distribution of this statistic and most statistical packages will give
rs together with its probability.
19.9
Other non-parametric tests
This chapter is only an introduction to some non-parametric tests for two or
more samples of independent and related data. Other non-parametric tests
are described in more specialized but nevertheless extremely well-explained
texts such as Siegel and Castallan (1988).
19.10 Questions
(1) The table below gives summary data for the depth of the water table, in
feet, for a population of 1000 wells. (a) What are the relative frequencies
and cumulative relative frequencies for each depth? (b) For a sample of
100 wells, give a distribution of water table depths that would not be
signiﬁcantly diﬀerent from the population. (c) For another sample of
100 give a distribution of water table depths you would expect to be
signiﬁcantly deeper than the population. (d) What test would be appropriate to compare these samples to the known population?
Depth (feet)
Number of wells
20–29
30–39
40–49
50–59
60–69
70–79
80–89
90–99
150
300
140
110
30
110
140
20
19.10 Questions
269
(2) An easy way to understand the process of ranking, and the tests that use
this procedure, is to use a contrived data set. The following two independent samples have very similar rank sums. (a) Rank the data across
both samples and calculate the rank sums. (b) Use a statistical package
to run a Mann–Whitney test on the data. Is there a signiﬁcant diﬀerence
between the samples? (c) Now change the data so you would expect a
signiﬁcant diﬀerence between groups. Run the Mann–Whitney test
again. Was the diﬀerence signiﬁcant?
Group 1
Group 2
4
7
8
11
12
15
16
19
20
5
6
9
10
13
14
17
18
21
(3) The following set of data for the percentage of sandstone porosity shows
a gross diﬀerence in distribution between two samples. (a) How might
you compare these two samples? (b) Use your suggested method to test
the hypothesis that the two samples have diﬀerent porosities. Is there a
signiﬁcant diﬀerence?
Sample 1:
Sample 2:
1, 1, 1, 1, 2, 1, 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 3, 5, 2
1, 1, 1, 1, 1, 1, 10, 11, 11, 11, 12, 12, 13, 13, 13, 13, 14, 14, 15,
17, 18, 18, 19
20 Introductory concepts of
multivariate analysis
20.1
Introduction
So far, all the analyses discussed in this book have been for either univariate
or bivariate data. Often, however, earth scientists need to analyze samples of
multivariate data – where more than two variables are measured on each
sampling or experimental unit – because univariate or bivariate data do
not give enough detail to realistically describe the material or the environment being investigated.
For example, a large ore body may contain several diﬀerent metals, and
the concentrations of each of these may vary considerably within it. It would
be useful to have a good estimate of this variation because some parts of the
deposit may be particularly worth mining, others may not be worth mining
at all, or certain parts may have to be mined and processed in diﬀerent ways.
Data for only one or two metals (e.g. copper and silver) are unlikely to be
suﬃcient to estimate the full variation in composition and value within a
deposit that also includes lead and zinc.
Samples on which multivariate data have been measured are often diﬃcult to compare with one another because there are so many variables. In
contrast, samples where only univariate data are available can easily be
visualized and compared (e.g. by summary statistics such as the mean and
standard error). Bivariate data can be displayed on a two-dimensional
graph, with one axis for each variable. Even data for three variables can be
displayed in a three-dimensional graph. But as soon as you have four or
more variables, the visualization of these in a multidimensional space and
comparison among samples becomes increasingly diﬃcult. For example,
Table 20.1 gives data for the concentrations of ﬁve metals at four sites.
Although this is only a small data set, it is diﬃcult to assess which sites are
most similar or dissimilar. (Incidentally, you may be thinking this is a
270
20.2 Simplifying and summarizing multivariate data
271
Table 20.1 The concentrations of ﬁve metals at four sites (A–D). From these raw
data, it is diﬃcult to evaluate which sites are most similar or dissimilar.
Metal
Site A
Site B
Site C
Site D
Copper
Silver
Lead
Gold
Zinc
12
11
46
32
6
43
40
63
5
40
26
28
26
19
21
21
19
21
7
38
very poor sampling design, because data are only given for one sampling
unit at each site. This is true, but here we are presenting a simpliﬁed data set
for clarity.)
Earth scientists need ways of simplifying and summarizing multivariate data to compare samples. Because univariate data are so easy to visualize, the comparison among the four sites in Table 20.1 would be greatly
simpliﬁed if the data for the ﬁve metals could somehow be reduced to a
single statistic or measure. Multivariate methods do this by reducing the
complexity of the data sets while retaining as much information as possible
about each sample. The following explanations are simpliﬁed and conceptual, but they do describe how these methods work.
20.2
Simplifying and summarizing multivariate data
The methods for simplifying and comparing samples of multivariate data
can be divided into two groups.
(a) The ﬁrst group of analyses works on the variables themselves. They
reduce the number of variables by identifying the ones that have the
most inﬂuence upon the observed diﬀerences among sampling units
so that relationships among the units can be summarized and visualized more easily. These “variable-oriented” methods are often called
R- mode analyses.
(b) The second group of analyses works on the sampling units. They often
summarize the multivariate data by calculating a single measure, or
statistic, that helps to quantify diﬀerences among sampling units.
These “sample-oriented” methods are often called Q-mode analyses.
272
Introductory concepts of multivariate analysis
This chapter will describe an example of an R-mode analysis, followed by
two Q-mode ones.
20.3
An R-mode analysis: principal components analysis
Principal components analysis (PCA) (which is called “principal component analysis” in some texts) is one of the oldest multivariate techniques.
The mathematical procedure of PCA is complex and uses matrix algebra,
but the concept of how PCA works is very easy to understand. The following
explanation only assumes an understanding of the correlation between two
variables (Chapter 15).
If you have a set of data where you have measured several variables on a
set of sampling units (e.g. a number of sites or cores), which for PCA are
often called objects, it is very diﬃcult to compare them when you have data
for more than three variables (e.g. the data in Table 20.1).
Quite often, however, a set of multivariate data shows a lot of redundancy – that is, two or more variables are highly correlated with each other.
For example, if you look at the data in Table 20.1, it is apparent that the
concentrations of copper, silver and zinc are positively correlated (when
there are relatively high concentrations of copper there are also relatively
high concentrations of silver and zinc and vice versa). Furthermore, the
concentrations of copper, silver and zinc are also correlated with gold, but
we have deliberately made these correlations negative (when there are
relatively high concentrations of gold, there are relatively low concentrations of copper, silver and zinc and vice versa) because negative correlations
are just as important as positive ones.
These correlations are an example of redundancy within the data set –
because four of the ﬁve variables are well-correlated, and knowing which
correlations are negative and which are positive, you really only need the
data for one of these variables to describe diﬀerences among the sites.
Therefore, you could reduce the data for these four metals down to only one
(copper, silver, gold or zinc) plus lead in Table 20.2 with little loss of
information about the sites.
A principal components analysis uses such cases of redundancy to reduce
the number of variables in a data set, although it does not exclude variables.
Instead, PCA identiﬁes variables that are highly correlated with each other
and combines these to construct a reduced set of new variables that still
20.4 Combining two or more variables
273
Table 20.2 Because the concentrations of copper, silver, gold and zinc are
correlated, you only need data for one of these (e.g. silver), plus the concentration
of lead, to describe the diﬀerences among the sites.
Metal
Site A
Site B
Site C
Site D
Silver
Lead
11
46
40
63
28
26
19
21
describes the diﬀerences among samples. These new variables are called
principal components and are listed in decreasing order of importance
(beginning with the one that explains the most variation among sampling
units, followed by the next greatest, etc.). With a reduced number of variables,
any diﬀerences among sampling units are likely to be easier to visualize.
20.4
How does a PCA combine two or more variables
into one?
This is a straightforward example where data for two variables are combined
into one new variable, and we are using a simpliﬁed version of the conceptual
explanation presented by Davis (2002). Imagine you need to assess variation
within a large ore body for which you have data for the concentration of silver
and gold at ten sites. It would be helpful to know which sites were most
similar (and dissimilar) and how the concentrations of silver and gold varied
among them.
The data for the ten sites have been plotted in Figure 20.1, which shows
a negative correlation between the concentrations of silver and gold. This
strong relationship between two variables can be used to construct a
single, combined variable to help make comparisons among the ten
sites. Note that you are not interested in whether the variables are
positively or negatively correlated – you only want to compare the sites.
The bivariate distribution of points for these two highly correlated
variables could be enclosed by a boundary. This is analogous to the way a
set of univariate data has a 95% conﬁdence interval (Chapter 8). For this
bivariate data set the boundary will be two dimensional, and because the
variables are correlated it will be elliptical as shown in Figure 20.2.
An ellipse is symmetrical and its relative length and width can be
described by the length of the longest line that can be drawn through it
274
Introductory concepts of multivariate analysis
A
F
Silver
I
D
B
C
H
G
J
E
Gold
Figure 20.1 The concentration of silver versus the concentration of gold at
ten sites.
A
F
Silver
I
D
B
C
H
G
J
E
Gold
Figure 20.2 An ellipse drawn around the set of data for the concentration of
silver versus the concentration of gold in ore at ten sites. The elliptical
boundary can be thought of as analogous to the 95% conﬁdence interval for
this bivariate distribution.
(which is called the major axis), and the length of a line drawn halfway down
and perpendicular to the major axis (which is called the minor axis)
(Figure 20.3).
The relative lengths of the two axes describing the ellipse will depend upon
the strength of the correlation between the two variables. Highly correlated
data like those in Figure 20.3 will be enclosed by a long and narrow ellipse, but
for weakly correlated data the ellipse will be far more circular.
At present the ten sites are described by two variables – the concentrations
of silver and gold. But because these two variables are highly correlated, all the
sites are quite close to the major axis of the ellipse, so most of the variation
among them can be described by just that axis (Figure 20.3). Therefore, you
can think of the major axis as a new single variable that is a good indication of