Tải bản đầy đủ - 0 (trang)
7 Analyzing ratio, interval or ordinal data that show gross differences in variance among treatments and cannot be satisfactorily transformed

7 Analyzing ratio, interval or ordinal data that show gross differences in variance among treatments and cannot be satisfactorily transformed

Tải bản đầy đủ - 0trang

264



Further non-parametric tests



Friedman statistic. Once this exceeds the critical value above which less than

5% of the most extreme departures from the null hypothesis occur when

samples are taken from the same population, the outcome is considered

statistically significant.

This analysis can be up to 95% as powerful as the equivalent two-factor

ANOVA without replication for randomized blocks.



19.6.2 Exact tests and randomization tests for three or more

related samples

The procedures for randomization and exact tests on the ranks of three or

more related samples are extensions of the methods for two independent

samples and do not need to be explained any further.



19.6.3 A posteriori comparisons for three or more related samples

If the Friedman test shows a significant difference among treatments and

the effect is considered fixed, you are likely to want to know which treatments

are significantly different (see 19.4.3). A posteriori testing can be done and

instructions are given in more advanced texts such as Zar (1996).



19.7



Analyzing ratio, interval or ordinal data that show

gross differences in variance among treatments and

cannot be satisfactorily transformed



Some data show gross differences in variance among treatments that

cannot be improved by transformation and are therefore unsuitable for

parametric or non-parametric analysis. An exploration geologist in Canada

was evaluating the economic potential of a circular depression thought to

be an impact crater. They knew that the large-scale impact structure at

nearby Sudbury was associated with valuable copper and nickel deposits,

and that other impact structures are excellent reservoirs for oil and gas.

So they set out to determine if the new locality might also be an impact

structure.

One of the key properties of impacted rocks is their high concentration of

platinum group elements. Perhaps the most diagnostic of these is iridium,

which is famously found all over the world in an ash layer that corresponds



19.7 Gross differences in variance



265



to the end of the Cretaceous Period and the extinction of the dinosaurs.

Iridium is not normally present in crustal rocks on the Earth’s surface – it is

usually found only in the metallic cores of differentiated planets and in iron

from meteorites. So when an impact from an iron-rich object occurs on

Earth, the iridium vaporizes and is distributed among the impact ejecta in

unusually high concentrations (up to 100 parts per billion). Thus iridium

concentration can be used as a geochemical tracer to indicate that rocks

have experienced an impact event.

The exploration geologist collected 15 core samples from his suspected

new impact site, along with 15 from the Sudbury impact structure. The

concentration of iridium in the two samples of 15 is given in Table 19.7.

It is clear there are gross differences in the distributions between the two

samples, with one showing bimodality. A solution is to transform the data to

a nominal scale and reclassify both samples into two mutually exclusive

categories of “with iridium” and “no iridium” (Table 19.8) which can be

compared using a test for two or more independent samples of categorical

data (Chapter 18).

Table 19.7 The Ir contents (in parts per billion)

of 15 rocks sampled at Sudbury crater (a classic

impact site) and 15 at a new site with a circular

feature suspected to be an impact crater.

Sudbury



New site



4

7

4

10

2

7

1

9

1

9

12

1

5

4

5



2

0

2

0

0

0

0

0

1

0

1

0

0

1

0



266



Further non-parametric tests

Table 19.8 Transformation of the ratio data in Table 19.7 to a nominal

scale showing the number of replicates in each sample as the two

mutually exclusive categories of with and without detectable iridium.



Number without iridium

Number with iridium



19.8



Sudbury



New site



0

15



10

5



Non-parametric correlation analysis



Correlation analysis was introduced in Chapter 15 as an exploratory technique used to examine whether two variables are related or vary together.

Importantly, there is no expectation that the numerical value of one variable

can be predicted from the other, nor is it necessary that either variable is

determined by the other.

The parametric test for correlation gives a statistic that varies between

+1.00 and –1.00, with both of these extremes indicating a perfect positive

and negative straight line relationship respectively, while values around

zero show no relationship. Although parametric correlation analysis is

powerful, it can only detect linear relationships and also assumes that both

the X and Y variables are normally distributed. When normality of both

variables cannot be assumed, or the relationship between the two variables

does not appear to be linear and cannot be remedied by transformation,

it is not appropriate to use a parametric test for correlation. The most

commonly used non-parametric test for correlation is Spearman’s rank

correlation.



19.8.1 Spearman’s rank correlation

This test is extremely straightforward. The two variables are ranked

separately, from lowest to highest, and the (parametric) Pearson correlation coefficient calculated for the ranked values. This gives a statistic called

Spearman’s rho, which for a population is symbolized by ρs and by rs for a

sample.

Spearman’s rs and Pearson’s r will not always be the same for the same set

of data. For Pearson’s r the correlation coefficients of 1.00 or –1.00 were



19.8 Non-parametric correlation analysis



267



Y



X

(a)

Raw score

Y

X

8

7

5

4



800

700

500

400

rs =



X

(c)



X

(b)



Rank

X Y

4

3

2

1



4

3

2

1



Raw score

Y

X



Rank

X Y



Raw score

Y

X



8

7

5

4



4

3

2

1



8

7

5

4



500

500

500

500

rs =



1.00



2.5

2.5

2.5

2.5



Rank

X Y



400

500

700

800

rs =



0.1



1

2

3

4



4

3

2

1



– 1.00



Y



X

(d)

Raw score

Y

X

8

6

5

2



900

700

300

200

rs =



Rank

X Y

4

3

2

1



X

(f)



X

(e)



4

3

2

1



1.00



Raw score

Y

X

8

7

4

2



900

700

200

300

rs =



Rank

Y

X

4

3

2

1



4

3

1

2



0.80



Raw score

Y

X

8

6

4

2



800

900

200

400

rs =



Rank

X Y

4

3

2

1



3

4

1

2



0.60



Figure 19.2 Examples of raw scores, ranks and the Spearman rank

correlation coefficient for data with: (a) a perfect positive relationship

(all points lie along a straight line); (b) no relationship; (c) a perfect negative

relationship (all points lie along a straight line); (d) a positive relationship

which is not a straight line but all pairs of bivariate data have the same ranks;

(e) a positive relationship with only half the pairs of bivariate data having

equal ranks; (f) a positive relationship with no pairs of bivariate data having

equal ranks. Note that the value of rs is 1.00 for case (d) even though the raw

data do not show a straight-line relationship.



268



Further non-parametric tests



only obtained when there was a perfect positive or negative straight-line

relationship between the two variables. In contrast, Spearman’s rs will give a

value of 1.00 or –1.00 whenever the ranks for the two variables are in perfect

agreement or disagreement (Figure 19.2), which occurs in more cases than a

straight-line relationship.

The probability of the value of rs can be obtained by comparing it to the

expected distribution of this statistic and most statistical packages will give

rs together with its probability.



19.9



Other non-parametric tests



This chapter is only an introduction to some non-parametric tests for two or

more samples of independent and related data. Other non-parametric tests

are described in more specialized but nevertheless extremely well-explained

texts such as Siegel and Castallan (1988).



19.10 Questions

(1) The table below gives summary data for the depth of the water table, in

feet, for a population of 1000 wells. (a) What are the relative frequencies

and cumulative relative frequencies for each depth? (b) For a sample of

100 wells, give a distribution of water table depths that would not be

significantly different from the population. (c) For another sample of

100 give a distribution of water table depths you would expect to be

significantly deeper than the population. (d) What test would be appropriate to compare these samples to the known population?

Depth (feet)



Number of wells



20–29

30–39

40–49

50–59

60–69

70–79

80–89

90–99



150

300

140

110

30

110

140

20



19.10 Questions



269



(2) An easy way to understand the process of ranking, and the tests that use

this procedure, is to use a contrived data set. The following two independent samples have very similar rank sums. (a) Rank the data across

both samples and calculate the rank sums. (b) Use a statistical package

to run a Mann–Whitney test on the data. Is there a significant difference

between the samples? (c) Now change the data so you would expect a

significant difference between groups. Run the Mann–Whitney test

again. Was the difference significant?

Group 1



Group 2



4

7

8

11

12

15

16

19

20



5

6

9

10

13

14

17

18

21



(3) The following set of data for the percentage of sandstone porosity shows

a gross difference in distribution between two samples. (a) How might

you compare these two samples? (b) Use your suggested method to test

the hypothesis that the two samples have different porosities. Is there a

significant difference?

Sample 1:

Sample 2:



1, 1, 1, 1, 2, 1, 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 3, 5, 2

1, 1, 1, 1, 1, 1, 10, 11, 11, 11, 12, 12, 13, 13, 13, 13, 14, 14, 15,

17, 18, 18, 19



20 Introductory concepts of

multivariate analysis



20.1



Introduction



So far, all the analyses discussed in this book have been for either univariate

or bivariate data. Often, however, earth scientists need to analyze samples of

multivariate data – where more than two variables are measured on each

sampling or experimental unit – because univariate or bivariate data do

not give enough detail to realistically describe the material or the environment being investigated.

For example, a large ore body may contain several different metals, and

the concentrations of each of these may vary considerably within it. It would

be useful to have a good estimate of this variation because some parts of the

deposit may be particularly worth mining, others may not be worth mining

at all, or certain parts may have to be mined and processed in different ways.

Data for only one or two metals (e.g. copper and silver) are unlikely to be

sufficient to estimate the full variation in composition and value within a

deposit that also includes lead and zinc.

Samples on which multivariate data have been measured are often difficult to compare with one another because there are so many variables. In

contrast, samples where only univariate data are available can easily be

visualized and compared (e.g. by summary statistics such as the mean and

standard error). Bivariate data can be displayed on a two-dimensional

graph, with one axis for each variable. Even data for three variables can be

displayed in a three-dimensional graph. But as soon as you have four or

more variables, the visualization of these in a multidimensional space and

comparison among samples becomes increasingly difficult. For example,

Table 20.1 gives data for the concentrations of five metals at four sites.

Although this is only a small data set, it is difficult to assess which sites are

most similar or dissimilar. (Incidentally, you may be thinking this is a

270



20.2 Simplifying and summarizing multivariate data



271



Table 20.1 The concentrations of five metals at four sites (A–D). From these raw

data, it is difficult to evaluate which sites are most similar or dissimilar.

Metal



Site A



Site B



Site C



Site D



Copper

Silver

Lead

Gold

Zinc



12

11

46

32

6



43

40

63

5

40



26

28

26

19

21



21

19

21

7

38



very poor sampling design, because data are only given for one sampling

unit at each site. This is true, but here we are presenting a simplified data set

for clarity.)

Earth scientists need ways of simplifying and summarizing multivariate data to compare samples. Because univariate data are so easy to visualize, the comparison among the four sites in Table 20.1 would be greatly

simplified if the data for the five metals could somehow be reduced to a

single statistic or measure. Multivariate methods do this by reducing the

complexity of the data sets while retaining as much information as possible

about each sample. The following explanations are simplified and conceptual, but they do describe how these methods work.



20.2



Simplifying and summarizing multivariate data



The methods for simplifying and comparing samples of multivariate data

can be divided into two groups.

(a) The first group of analyses works on the variables themselves. They

reduce the number of variables by identifying the ones that have the

most influence upon the observed differences among sampling units

so that relationships among the units can be summarized and visualized more easily. These “variable-oriented” methods are often called

R- mode analyses.

(b) The second group of analyses works on the sampling units. They often

summarize the multivariate data by calculating a single measure, or

statistic, that helps to quantify differences among sampling units.

These “sample-oriented” methods are often called Q-mode analyses.



272



Introductory concepts of multivariate analysis



This chapter will describe an example of an R-mode analysis, followed by

two Q-mode ones.



20.3



An R-mode analysis: principal components analysis



Principal components analysis (PCA) (which is called “principal component analysis” in some texts) is one of the oldest multivariate techniques.

The mathematical procedure of PCA is complex and uses matrix algebra,

but the concept of how PCA works is very easy to understand. The following

explanation only assumes an understanding of the correlation between two

variables (Chapter 15).

If you have a set of data where you have measured several variables on a

set of sampling units (e.g. a number of sites or cores), which for PCA are

often called objects, it is very difficult to compare them when you have data

for more than three variables (e.g. the data in Table 20.1).

Quite often, however, a set of multivariate data shows a lot of redundancy – that is, two or more variables are highly correlated with each other.

For example, if you look at the data in Table 20.1, it is apparent that the

concentrations of copper, silver and zinc are positively correlated (when

there are relatively high concentrations of copper there are also relatively

high concentrations of silver and zinc and vice versa). Furthermore, the

concentrations of copper, silver and zinc are also correlated with gold, but

we have deliberately made these correlations negative (when there are

relatively high concentrations of gold, there are relatively low concentrations of copper, silver and zinc and vice versa) because negative correlations

are just as important as positive ones.

These correlations are an example of redundancy within the data set –

because four of the five variables are well-correlated, and knowing which

correlations are negative and which are positive, you really only need the

data for one of these variables to describe differences among the sites.

Therefore, you could reduce the data for these four metals down to only one

(copper, silver, gold or zinc) plus lead in Table 20.2 with little loss of

information about the sites.

A principal components analysis uses such cases of redundancy to reduce

the number of variables in a data set, although it does not exclude variables.

Instead, PCA identifies variables that are highly correlated with each other

and combines these to construct a reduced set of new variables that still



20.4 Combining two or more variables



273



Table 20.2 Because the concentrations of copper, silver, gold and zinc are

correlated, you only need data for one of these (e.g. silver), plus the concentration

of lead, to describe the differences among the sites.

Metal



Site A



Site B



Site C



Site D



Silver

Lead



11

46



40

63



28

26



19

21



describes the differences among samples. These new variables are called

principal components and are listed in decreasing order of importance

(beginning with the one that explains the most variation among sampling

units, followed by the next greatest, etc.). With a reduced number of variables,

any differences among sampling units are likely to be easier to visualize.



20.4



How does a PCA combine two or more variables

into one?



This is a straightforward example where data for two variables are combined

into one new variable, and we are using a simplified version of the conceptual

explanation presented by Davis (2002). Imagine you need to assess variation

within a large ore body for which you have data for the concentration of silver

and gold at ten sites. It would be helpful to know which sites were most

similar (and dissimilar) and how the concentrations of silver and gold varied

among them.

The data for the ten sites have been plotted in Figure 20.1, which shows

a negative correlation between the concentrations of silver and gold. This

strong relationship between two variables can be used to construct a

single, combined variable to help make comparisons among the ten

sites. Note that you are not interested in whether the variables are

positively or negatively correlated – you only want to compare the sites.

The bivariate distribution of points for these two highly correlated

variables could be enclosed by a boundary. This is analogous to the way a

set of univariate data has a 95% confidence interval (Chapter 8). For this

bivariate data set the boundary will be two dimensional, and because the

variables are correlated it will be elliptical as shown in Figure 20.2.

An ellipse is symmetrical and its relative length and width can be

described by the length of the longest line that can be drawn through it



274



Introductory concepts of multivariate analysis



A



F



Silver

I



D

B

C



H

G

J



E



Gold



Figure 20.1 The concentration of silver versus the concentration of gold at

ten sites.



A



F



Silver

I



D

B

C



H

G

J



E



Gold



Figure 20.2 An ellipse drawn around the set of data for the concentration of

silver versus the concentration of gold in ore at ten sites. The elliptical

boundary can be thought of as analogous to the 95% confidence interval for

this bivariate distribution.



(which is called the major axis), and the length of a line drawn halfway down

and perpendicular to the major axis (which is called the minor axis)

(Figure 20.3).

The relative lengths of the two axes describing the ellipse will depend upon

the strength of the correlation between the two variables. Highly correlated

data like those in Figure 20.3 will be enclosed by a long and narrow ellipse, but

for weakly correlated data the ellipse will be far more circular.

At present the ten sites are described by two variables – the concentrations

of silver and gold. But because these two variables are highly correlated, all the

sites are quite close to the major axis of the ellipse, so most of the variation

among them can be described by just that axis (Figure 20.3). Therefore, you

can think of the major axis as a new single variable that is a good indication of



Tài liệu bạn tìm kiếm đã sẵn sàng tải về

7 Analyzing ratio, interval or ordinal data that show gross differences in variance among treatments and cannot be satisfactorily transformed

Tải bản đầy đủ ngay(0 tr)

×