Tải bản đầy đủ - 0 (trang)
2 Variables, sampling units and types of data

# 2 Variables, sampling units and types of data

Tải bản đầy đủ - 0trang

16

Collecting and displaying data

or a deﬁned item (e.g. a square meter of the outcrop, a speciﬁc stratigraphic

unit, or a particular locality).

If you only measure one variable per sampling unit the data set is univariate. Data for two variables per unit are bivariate, while data for three or

more variables measured on the same sampling unit are multivariate.

Variables can be measured on four scales – ratio, interval, ordinal or

nominal.

A ratio scale describes a variable whose numerical values truly indicate

the quantity being measured.

*

*

*

There is a true zero point below which you cannot have any data

(for example, if you are measuring the length of feldspar crystals in a

thin section, you cannot have a crystal of negative length).

An increase of the same numerical amount indicates the same quantity

across the range of measurements (for example, a 0.2 mm and a 2 mm

feldspar will have grown by the same amount if they both increase in

length by 10 mm).

A particular ratio holds across the range of the variable (for example,

a 200 μm feldspar grain is twenty times longer than a 10 μm grain and a

100 μm grain is also twenty times longer than a 5 μm one).

An interval scale describes a variable that can be less than zero.

*

*

*

The zero point is arbitrary (for example, temperature measured in

degrees Celsius has a zero point at which water freezes), so negative

values are possible. The true zero point for temperature, where there is

a complete absence of heat, is zero kelvin (about –273 °C), so (unlike

Celsius) the kelvin is a ratio scale.

An increase of the same numerical amount indicates the same quantity

across the range of measurements (for example, a 2 °C increase indicates

the same increase in heat whatever the starting temperature).

Because the zero point is arbitrary, a particular ratio does not hold across

the range of the variable. For example, the ratio of 6 °C compared to 1 °C

is not the same as 60 °C to 10 °C. The two ratios in terms of the kelvin

scale are 279:274 K and 333:283 K.

An ordinal scale applies to data where values are ranked – which means

they are given a value that simply indicates their relative order. For

example, ﬁve mountains with elevations of 10 000 m, 4500 m, 4300 m,

3.3 Displaying data

17

4000 m and 3984 m have been measured on a ratio scale. If you rank these in

order, from highest to lowest, as 5, 4, 3, 2 and 1, the data have been reduced

to an ordinal scale, but this is not very informative and does not mean that

the highest mountain is ﬁve times the elevation of the lowest. For ordinal

data, an increase in the same numerical amount of ranks does not necessarily hold across the range of the variable.

A nominal scale applies to data where the values are classiﬁed according

to an attribute. For example, the breakdown of rocks at the Earth’s surface

can be classiﬁed as either chemical or mechanical weathering, so a sample of

diﬀerent sediments can be subdivided into the numbers within each of these

two categories. You might have a sample of ten, of which three fall in the

“chemical” category and the remaining seven in the “mechanical” one.

The ﬁrst three types of data described above can include either continuous or discrete data. Nominal scale data (since they are attributes) can

only be discrete.

Continuous data can have any value within a range. For example, any

value of temperature is possible within the range from 10 °C to 20 °C, such

as 15.3 °C or 17.82 °C.

Discrete data are very diﬀerent from continuous data because they can

only have ﬁxed numerical values within a range. For example, the number of

electrons in an atom increases from one ﬁxed whole number to the next,

because you cannot have a fraction of an electron.

It is important that you know what type of data you are dealing with

because this will be one of the factors that determines your choice of

statistical test.

3.3

Displaying data

A list of data may reveal very little, but a pictorial summary is a way of

exploring the data that might help you notice a pattern, which can help

generate or test hypotheses.

3.3.1

Histograms

Here is a list of the number of visits made to their lecturer’s oﬃce by a sample

of 60 students chosen at random from 320 students in the course

Introductory Geoscience. These data are univariate, ratio scaled and discrete.

18

Collecting and displaying data

1, 1, 6, 1, 12, 1, 2, 6, 2, 7, 2, 2, 5, 2, 1, 2, 1, 9, 1, 8, 1, 1, 2, 5, 1, 6, 1, 1, 1, 5, 1, 1,

1, 2, 2, 3, 2, 3, 3, 3, 3, 3, 4, 5, 6, 7, 8, 9, 4, 1, 1, 9, 10, 1, 4, 10, 11, 1, 2, 3

It is diﬃcult to see any pattern from this list of numbers, but you could

summarize and display these data by drawing a histogram. To do this you

separately count the number (the frequency) of cases for students who

visited never, once, twice, three times, through to the maximum number of

visits and plot these as a series of rectangles on a graph with the X axis

showing the number of visits and the Y axis the number of students in each

of these cases. Figure 3.1 shows a histogram of these data.

This visual summary shows that the distribution is skewed to the right –

most students made few visits for help, but there is a long upper “tail” who

have made ﬁve or more visits. Incidentally, looking at the graph you

may be a little suspicious because every student made at least one visit.

This was because each of them had to visit the lecturer’s oﬃce to pick up

an assignment during the ﬁrst three weeks of class to ensure they knew

where to go if they did ever need help, so these data are somewhat

misleading in terms of indicating the neediness of the group. You may

be tempted to draw a line joining the midpoints of the tops of each bar to

indicate the shape of the distribution, but this implies that the data on the

X axis are continuous, which is not the case because visits are discrete

whole numbers.

Number of students

20

15

10

5

0

1

2

3

4

5

6

7

8

9

10

11

12

Number of visits

Figure 3.1 The number of visits made to their lecturer’s oﬃce by a sample of

60 students chosen at random from 320 students in the course Introductory

Geoscience.

3.3 Displaying data

3.3.2

19

Frequency polygons or line graphs

If the data are continuous, it is appropriate to draw a line linking the

midpoint of the tops of each bar in the histogram. Here is a geological

example for some continuous data that can be summarized as a histogram

or as a frequency polygon (often called a line graph). Carbon isotope data

are very useful for understanding the global distribution of carbon between

the Earth’s atmosphere, seawater and carbonate minerals. The δ13C of

carbonate minerals can provide information about variations of δ13C in

ocean water, which can be related to the global carbon cycle and palaeoceanographic circulation patterns.

A sample of 28 “muddy” limestones (wackestones) was collected from an

extended outcrop, and isotopic analyses for δ13C ‰ were obtained. Nothing

is very obvious from this list of results:

1.01, 0.59, 2.32, 0.19, −2.39, −3.76, −0.8, 1.6, 0.28, −1.62, −0.33, −1.26,

−0.01, 1.36, 0.99, 1.12, −0.45, 0.71, 1.12, −0.72, 1.36, 1.59, 2.27, 2.25, 3.05,

2.58, 1.94, 3.28

Because the data are continuous, they are not as easy to summarize as the

discrete data in Figure 3.1. To display a histogram for continuous data you

need to subdivide the data into the frequency of cases within a series of

intervals of equal width. First you need to look at the range of the data (here

δ13C ‰ varies from a minimum of −3.76 through to a maximum of 3.28)

and decide on an interval width that will give you an informative display of

the data. Here the chosen width is 1.0 ‰. Therefore, starting from −4.0 ‰,

this will give 8 intervals, the ﬁrst of which is −4 to −3.01 ‰. The chosen

interval width needs to be one that shows the shape of the distribution: there

would be no point in choosing a width that included all the data in just two

intervals because you would only have two bars on the histogram. Nor

would there be any point in choosing more than 20 intervals because this

would give a lot of bars with each containing only a few data.

Once you have decided on an appropriate interval size, you need to count

the number of cases with δ13C values that fall within each interval

(Table 3.1) and plot these frequencies on the Y axis against the intervals

(indicated by the midpoint of each interval) on the X axis. This has been

done in Figure 3.2(a). Finally, the midpoints of the tops of each rectangle

have been joined by a line to give a frequency polygon, or line graph

(Figure 3.2(b)).

20

Collecting and displaying data

Table 3.1 Summary of δ13C ‰ data for limestones listed as frequencies

and cumulative frequencies.

Cumulative Frequency

Interval range

δ13C ‰

Cases

Total

Percent

−4 to −3.01

−3 to −2.01

−2 to −1.01

−1 to −0.01

0 to 0.99

1 to 1.99

2 to 2.99

3 to 3.99

1

1

2

5

5

8

4

2

1

2

4

9

14

22

26

28

3.6

7.1

14.3

32.1

50.0

78.6

92.9

100.0

(a)

(b) 8

8

Frequency

6

4

4

2

0

0

–4

–3 –2

–1

0

1

2

δ13 C ‰

3

4

–4

–3

–2

–1

0

1

2

3

4

δ13 C‰

Figure 3.2 Carbon isotope data for 21 sampling units of limestone from the

same outcrop, displayed as (a) a histogram and (b) a frequency polygon or line

graph. The points on the frequency polygon (b) correspond to the midpoints

of the bars on (a).

3.3.3

Cumulative graphs

Often it is useful to display data as a histogram of cumulative frequencies.

This is a graph that displays the progressive total (starting at zero, or zero

percent and ﬁnishing at the sample size or 100%) on the Y axis against the

increasing value of the variable on the X axis. Figure 3.3 gives an example,

using the data from Table 3.1.

A cumulative frequency graph can never decrease. Figure 3.3 displays the

data in Table 3.1 as a cumulative frequency histogram.

3.5 Bivariate data

21

28

Count

21

14

7

0

–4

–3

–2

–1

0

1

2

3

4

δ13C‰

Figure 3.3 A cumulative frequency histogram for δ13C data for limestones.

Although we have given the rather tedious manual procedures for constructing histograms, you will ﬁnd that most statistical software packages

(and spreadsheets) have excellent graphics programs for displaying your

data. These will automatically select an interval width, summarize the data

and plot the graph of your choice.

3.4

Displaying ordinal or nominal scale data

When you display data for ordinal or nominal scale variables, you need to

modify the form of the graph slightly because the categories are unlikely to

be continuous, so the bars need to be separated to clearly indicate the lack of

continuity. Here is an example for some nominal scale data. Table 3.2 gives

the locations of 594 tornadoes during the period from 1998–2007 in the

southeastern states of the US.

These can be displayed on a bar graph with the categories in any order along

the X axis and the number of cases on the Y axis (Figure 3.4(a)). It often helps

to rank the data in order of magnitude to aid interpretation (Figure 3.4(b)).

3.5

Bivariate data

Data where two variables have been measured on each sampling unit can

often reveal patterns that may suggest hypotheses, or be useful for testing

them. Here is another case where the mineral apatite aﬀects public health (in

22

Collecting and displaying data

Table 3.2 Preliminary data on tornado occurrence in

southeastern US states from 1998–2007, according to the

NOAA National Weather Service Storm Prediction Center

(www.spc.noaa.gov/wcm/).

Location

1998–2007

Texas

Oklahoma

Louisiana

Arkansas

Mississippi

Alabama

Georgia

Tennessee

North Carolina

South Carolina

Florida

95

68

38

68

68

64

48

44

36

48

17

Chapter 2 there was an example where apatite was used to clean up lead

waste – this is about hydroxylapatite in your teeth). Table 3.3 gives two lists

of bivariate data for the number of dental caries (these are the holes that

develop in decaying teeth) and age for 20 children between the ages of one

and nine years from each of the cities of Hale and Yarvard.

Looking at these data, there is not anything that stands out, apart from an

increase in the number of caries with age. If you calculate descriptive

statistics such as the average age and average number of dental caries for

each of the two groups (Table 3.4) they are not very informative either. (You

probably know how to calculate the average for a set of data and this

procedure will be described in Chapter 7, but the average is the sum of all

the values divided by the sample size.)

Table 3.4 shows that the sample from Yarvard had slightly more caries on

average than the one from Hale, but this is not surprising because the

Yarvard sample was an average of one year older. If, however, you graph

these data, patterns emerge. One way of displaying bivariate data is a twodimensional plot with increasing values of one variable on the horizontal

(or X axis) and increasing values of the second variable on the vertical

(or Y axis). Figure 3.5 shows both sets of data with the number of caries

(Y axis) plotted against child age (X axis) for each city.

3.5 Bivariate data

(a)

23

100

50

0

AL AR FL GA LA MS NC OK SC TN TX

(b)

100

50

0

TX AR MS OK AL GA SC TN LA NC FL

Figure 3.4 (a) Preliminary data on tornado occurrence in southeastern US

states (listed alphabetically) from 1998–2007. (b) The same data but with the

number of cases ranked in order from most to least.

These graphs show that tooth decay increases with age, but the pattern

diﬀers between cities – in Hale the increase is fairly steady, but in Yarvard it

remains low in children up to age seven but then suddenly increases. This

led to several hypotheses including that there might have been a child dental

care program, or water ﬂuoridation, in place in Yarvard for the past eight

years compared to no action on decay in Hale.

Of course, there is always the possibility that the samples are diﬀerent due

to chance, so perhaps the ﬁrst step in any further investigation would be to

repeat the sampling using much larger numbers of children from each city.

Subsequent investigation found that the Yarvard municipal drinking

water had been ﬂuoridated for the past eight years, but this treatment had

24

Collecting and displaying data

Table 3.3 The number of dental caries and age of 20 children

chosen at random from each of the two cities of Hale and Yarvard.

Hale

Yarvard

Caries

Age

Caries

Age

1

1

4

4

5

6

2

9

4

2

7

3

9

11

1

1

3

1

1

6

3

2

4

3

6

5

3

9

5

1

8

4

8

9

2

4

7

1

1

5

10

1

12

1

1

11

2

14

2

8

1

4

1

1

7

1

1

1

2

1

9

5

9

2

2

9

3

9

6

9

1

7

1

5

8

7

6

4

6

2

Table 3.4 The average number of dental caries and

age of 20 children chosen at random from each of the

two cities of Hale and Yarvard.

Hale

Yarvard

Caries

Age

Caries

Age

4.05

4.5 years

4.10

5.5 years

not been introduced in Hale. The ﬂuoride program works because your

teeth are made of the mineral hydroxyapatite (the same mineral that binds

to heavy metals). In this case the apatite in your teeth binds ﬂuorine ions

which substitute for hydroxyls in the apatite structure, making the enamel

of your teeth less soluble and therefore less prone to decay. This seems a very

3.6 Data expressed as proportions of a total

25

(b)

(a) 12

Number of caries

Number of caries

12

8

4

8

4

0

0

0

5

Age (years)

10

0

5

Age (years)

10

Figure 3.5 The number of dental caries plotted against the age of 20 children

chosen at random from each of the two cities of (a) Hale and (b) Yarvard.

plausible reason, but bear in mind that these data are only correlative and

there may be other reason(s) for the diﬀerence between the two cities.

3.6

Data expressed as proportions of a total

Data for the relative frequencies in two or more categories that sum to a

total of 1.0, or 100%, can be displayed as a pie diagram – a circle in which

each of the categories is displayed as a “slice,” the size of which is proportional to its value. For example, a sample containing four diﬀerent minerals that are equally abundant would be shown as a circle subdivided into

four equal 90o slices. Pie diagrams are easily interpreted when there are 10

or fewer categories and each contains at least 10% of the data (Figure 3.6).

When there are more than 10 categories the display will appear cluttered,

especially when slices are distinguished by their color, but it will be even

harder to diﬀerentiate among a lot of categories shown only as black, white

and shades of grey. Categories representing a relatively small number or

proportion of total cases will appear very narrow and may be overlooked.

The procedure for drawing a pie diagram showing either the relative

proportion of cases in several categories, or the values of two or more

variables (e.g. the concentrations of six diﬀerent ions) is straightforward.

First, the data for each category are listed, summed to give a total, and then

expressed as proportions of this total. Each proportion is then multiplied by

360 to give the width of the slice in degrees, which is used to draw the

appropriate divisions on the pie diagram.

26

Collecting and displaying data

Hornblende

(a)

(b)

Quartz

Hornblende

K-feldspar

Biotite

Biotite

Quartz

Plagioclase

K-feldspar

Plagioclase

Figure 3.6 Pie diagrams comparing the mineralogy of two diﬀerent granites.

From this type of comparison it is clear that the rock in (a) has far less Kfeldspar and much more hornblende compared to the rock in (b).

3.7

Display of geographic direction or orientation

Rose diagrams are used to show a summary of the direction or orientation

of a sample of objects such as crystals or fractures in rock, or the geographic

orientation of paleocurrent directions in ancient river systems. For example,

a unimodal paleocurrent implies a river with steep slopes, but a bimodal one

suggests a meandering river with a low slope. Rose diagrams are also

commonly used by meteorologists to report the direction and magnitude

of winds. The procedures for drawing rose diagrams and analyzing data for

direction and orientation are described in Chapter 22.

3.8

Multivariate data

Often earth scientists have data for three or more variables measured on the

same sampling unit. For example, a geologist might have data for mineralogy,

chemical composition, geological age and metamorphic grade for 20 outcrops

across a zone of contact metamorphism, or a paleontologist might have data

for the numbers of several species of brachiopods from a speciﬁc formation.

Results for three variables could be shown as three-dimensional graphs,

but direct display is diﬃcult for more than this number of variables. Some

relatively new statistical techniques have made it possible to condense and

summarize multivariate data in a two-dimensional display, and these are

introduced in Chapter 20.

### Tài liệu bạn tìm kiếm đã sẵn sàng tải về

2 Variables, sampling units and types of data

Tải bản đầy đủ ngay(0 tr)

×