2 Variables, sampling units and types of data
Tải bản đầy đủ - 0trang
16
Collecting and displaying data
or a deﬁned item (e.g. a square meter of the outcrop, a speciﬁc stratigraphic
unit, or a particular locality).
If you only measure one variable per sampling unit the data set is univariate. Data for two variables per unit are bivariate, while data for three or
more variables measured on the same sampling unit are multivariate.
Variables can be measured on four scales – ratio, interval, ordinal or
nominal.
A ratio scale describes a variable whose numerical values truly indicate
the quantity being measured.
*
*
*
There is a true zero point below which you cannot have any data
(for example, if you are measuring the length of feldspar crystals in a
thin section, you cannot have a crystal of negative length).
An increase of the same numerical amount indicates the same quantity
across the range of measurements (for example, a 0.2 mm and a 2 mm
feldspar will have grown by the same amount if they both increase in
length by 10 mm).
A particular ratio holds across the range of the variable (for example,
a 200 μm feldspar grain is twenty times longer than a 10 μm grain and a
100 μm grain is also twenty times longer than a 5 μm one).
An interval scale describes a variable that can be less than zero.
*
*
*
The zero point is arbitrary (for example, temperature measured in
degrees Celsius has a zero point at which water freezes), so negative
values are possible. The true zero point for temperature, where there is
a complete absence of heat, is zero kelvin (about –273 °C), so (unlike
Celsius) the kelvin is a ratio scale.
An increase of the same numerical amount indicates the same quantity
across the range of measurements (for example, a 2 °C increase indicates
the same increase in heat whatever the starting temperature).
Because the zero point is arbitrary, a particular ratio does not hold across
the range of the variable. For example, the ratio of 6 °C compared to 1 °C
is not the same as 60 °C to 10 °C. The two ratios in terms of the kelvin
scale are 279:274 K and 333:283 K.
An ordinal scale applies to data where values are ranked – which means
they are given a value that simply indicates their relative order. For
example, ﬁve mountains with elevations of 10 000 m, 4500 m, 4300 m,
3.3 Displaying data
17
4000 m and 3984 m have been measured on a ratio scale. If you rank these in
order, from highest to lowest, as 5, 4, 3, 2 and 1, the data have been reduced
to an ordinal scale, but this is not very informative and does not mean that
the highest mountain is ﬁve times the elevation of the lowest. For ordinal
data, an increase in the same numerical amount of ranks does not necessarily hold across the range of the variable.
A nominal scale applies to data where the values are classiﬁed according
to an attribute. For example, the breakdown of rocks at the Earth’s surface
can be classiﬁed as either chemical or mechanical weathering, so a sample of
diﬀerent sediments can be subdivided into the numbers within each of these
two categories. You might have a sample of ten, of which three fall in the
“chemical” category and the remaining seven in the “mechanical” one.
The ﬁrst three types of data described above can include either continuous or discrete data. Nominal scale data (since they are attributes) can
only be discrete.
Continuous data can have any value within a range. For example, any
value of temperature is possible within the range from 10 °C to 20 °C, such
as 15.3 °C or 17.82 °C.
Discrete data are very diﬀerent from continuous data because they can
only have ﬁxed numerical values within a range. For example, the number of
electrons in an atom increases from one ﬁxed whole number to the next,
because you cannot have a fraction of an electron.
It is important that you know what type of data you are dealing with
because this will be one of the factors that determines your choice of
statistical test.
3.3
Displaying data
A list of data may reveal very little, but a pictorial summary is a way of
exploring the data that might help you notice a pattern, which can help
generate or test hypotheses.
3.3.1
Histograms
Here is a list of the number of visits made to their lecturer’s oﬃce by a sample
of 60 students chosen at random from 320 students in the course
Introductory Geoscience. These data are univariate, ratio scaled and discrete.
18
Collecting and displaying data
1, 1, 6, 1, 12, 1, 2, 6, 2, 7, 2, 2, 5, 2, 1, 2, 1, 9, 1, 8, 1, 1, 2, 5, 1, 6, 1, 1, 1, 5, 1, 1,
1, 2, 2, 3, 2, 3, 3, 3, 3, 3, 4, 5, 6, 7, 8, 9, 4, 1, 1, 9, 10, 1, 4, 10, 11, 1, 2, 3
It is diﬃcult to see any pattern from this list of numbers, but you could
summarize and display these data by drawing a histogram. To do this you
separately count the number (the frequency) of cases for students who
visited never, once, twice, three times, through to the maximum number of
visits and plot these as a series of rectangles on a graph with the X axis
showing the number of visits and the Y axis the number of students in each
of these cases. Figure 3.1 shows a histogram of these data.
This visual summary shows that the distribution is skewed to the right –
most students made few visits for help, but there is a long upper “tail” who
have made ﬁve or more visits. Incidentally, looking at the graph you
may be a little suspicious because every student made at least one visit.
This was because each of them had to visit the lecturer’s oﬃce to pick up
an assignment during the ﬁrst three weeks of class to ensure they knew
where to go if they did ever need help, so these data are somewhat
misleading in terms of indicating the neediness of the group. You may
be tempted to draw a line joining the midpoints of the tops of each bar to
indicate the shape of the distribution, but this implies that the data on the
X axis are continuous, which is not the case because visits are discrete
whole numbers.
Number of students
20
15
10
5
0
1
2
3
4
5
6
7
8
9
10
11
12
Number of visits
Figure 3.1 The number of visits made to their lecturer’s oﬃce by a sample of
60 students chosen at random from 320 students in the course Introductory
Geoscience.
3.3 Displaying data
3.3.2
19
Frequency polygons or line graphs
If the data are continuous, it is appropriate to draw a line linking the
midpoint of the tops of each bar in the histogram. Here is a geological
example for some continuous data that can be summarized as a histogram
or as a frequency polygon (often called a line graph). Carbon isotope data
are very useful for understanding the global distribution of carbon between
the Earth’s atmosphere, seawater and carbonate minerals. The δ13C of
carbonate minerals can provide information about variations of δ13C in
ocean water, which can be related to the global carbon cycle and palaeoceanographic circulation patterns.
A sample of 28 “muddy” limestones (wackestones) was collected from an
extended outcrop, and isotopic analyses for δ13C ‰ were obtained. Nothing
is very obvious from this list of results:
1.01, 0.59, 2.32, 0.19, −2.39, −3.76, −0.8, 1.6, 0.28, −1.62, −0.33, −1.26,
−0.01, 1.36, 0.99, 1.12, −0.45, 0.71, 1.12, −0.72, 1.36, 1.59, 2.27, 2.25, 3.05,
2.58, 1.94, 3.28
Because the data are continuous, they are not as easy to summarize as the
discrete data in Figure 3.1. To display a histogram for continuous data you
need to subdivide the data into the frequency of cases within a series of
intervals of equal width. First you need to look at the range of the data (here
δ13C ‰ varies from a minimum of −3.76 through to a maximum of 3.28)
and decide on an interval width that will give you an informative display of
the data. Here the chosen width is 1.0 ‰. Therefore, starting from −4.0 ‰,
this will give 8 intervals, the ﬁrst of which is −4 to −3.01 ‰. The chosen
interval width needs to be one that shows the shape of the distribution: there
would be no point in choosing a width that included all the data in just two
intervals because you would only have two bars on the histogram. Nor
would there be any point in choosing more than 20 intervals because this
would give a lot of bars with each containing only a few data.
Once you have decided on an appropriate interval size, you need to count
the number of cases with δ13C values that fall within each interval
(Table 3.1) and plot these frequencies on the Y axis against the intervals
(indicated by the midpoint of each interval) on the X axis. This has been
done in Figure 3.2(a). Finally, the midpoints of the tops of each rectangle
have been joined by a line to give a frequency polygon, or line graph
(Figure 3.2(b)).
20
Collecting and displaying data
Table 3.1 Summary of δ13C ‰ data for limestones listed as frequencies
and cumulative frequencies.
Cumulative Frequency
Interval range
δ13C ‰
Cases
Total
Percent
−4 to −3.01
−3 to −2.01
−2 to −1.01
−1 to −0.01
0 to 0.99
1 to 1.99
2 to 2.99
3 to 3.99
1
1
2
5
5
8
4
2
1
2
4
9
14
22
26
28
3.6
7.1
14.3
32.1
50.0
78.6
92.9
100.0
(a)
(b) 8
8
Frequency
6
4
4
2
0
0
–4
–3 –2
–1
0
1
2
δ13 C ‰
3
4
–4
–3
–2
–1
0
1
2
3
4
δ13 C‰
Figure 3.2 Carbon isotope data for 21 sampling units of limestone from the
same outcrop, displayed as (a) a histogram and (b) a frequency polygon or line
graph. The points on the frequency polygon (b) correspond to the midpoints
of the bars on (a).
3.3.3
Cumulative graphs
Often it is useful to display data as a histogram of cumulative frequencies.
This is a graph that displays the progressive total (starting at zero, or zero
percent and ﬁnishing at the sample size or 100%) on the Y axis against the
increasing value of the variable on the X axis. Figure 3.3 gives an example,
using the data from Table 3.1.
A cumulative frequency graph can never decrease. Figure 3.3 displays the
data in Table 3.1 as a cumulative frequency histogram.
3.5 Bivariate data
21
28
Count
21
14
7
0
–4
–3
–2
–1
0
1
2
3
4
δ13C‰
Figure 3.3 A cumulative frequency histogram for δ13C data for limestones.
Although we have given the rather tedious manual procedures for constructing histograms, you will ﬁnd that most statistical software packages
(and spreadsheets) have excellent graphics programs for displaying your
data. These will automatically select an interval width, summarize the data
and plot the graph of your choice.
3.4
Displaying ordinal or nominal scale data
When you display data for ordinal or nominal scale variables, you need to
modify the form of the graph slightly because the categories are unlikely to
be continuous, so the bars need to be separated to clearly indicate the lack of
continuity. Here is an example for some nominal scale data. Table 3.2 gives
the locations of 594 tornadoes during the period from 1998–2007 in the
southeastern states of the US.
These can be displayed on a bar graph with the categories in any order along
the X axis and the number of cases on the Y axis (Figure 3.4(a)). It often helps
to rank the data in order of magnitude to aid interpretation (Figure 3.4(b)).
3.5
Bivariate data
Data where two variables have been measured on each sampling unit can
often reveal patterns that may suggest hypotheses, or be useful for testing
them. Here is another case where the mineral apatite aﬀects public health (in
22
Collecting and displaying data
Table 3.2 Preliminary data on tornado occurrence in
southeastern US states from 1998–2007, according to the
NOAA National Weather Service Storm Prediction Center
(www.spc.noaa.gov/wcm/).
Location
Number of tornadoes
1998–2007
Texas
Oklahoma
Louisiana
Arkansas
Mississippi
Alabama
Georgia
Tennessee
North Carolina
South Carolina
Florida
95
68
38
68
68
64
48
44
36
48
17
Chapter 2 there was an example where apatite was used to clean up lead
waste – this is about hydroxylapatite in your teeth). Table 3.3 gives two lists
of bivariate data for the number of dental caries (these are the holes that
develop in decaying teeth) and age for 20 children between the ages of one
and nine years from each of the cities of Hale and Yarvard.
Looking at these data, there is not anything that stands out, apart from an
increase in the number of caries with age. If you calculate descriptive
statistics such as the average age and average number of dental caries for
each of the two groups (Table 3.4) they are not very informative either. (You
probably know how to calculate the average for a set of data and this
procedure will be described in Chapter 7, but the average is the sum of all
the values divided by the sample size.)
Table 3.4 shows that the sample from Yarvard had slightly more caries on
average than the one from Hale, but this is not surprising because the
Yarvard sample was an average of one year older. If, however, you graph
these data, patterns emerge. One way of displaying bivariate data is a twodimensional plot with increasing values of one variable on the horizontal
(or X axis) and increasing values of the second variable on the vertical
(or Y axis). Figure 3.5 shows both sets of data with the number of caries
(Y axis) plotted against child age (X axis) for each city.
3.5 Bivariate data
(a)
23
Number of tornados
100
50
0
AL AR FL GA LA MS NC OK SC TN TX
Location of tornado (US state)
(b)
Number of tornados
100
50
0
TX AR MS OK AL GA SC TN LA NC FL
Location of tornado (US state)
Figure 3.4 (a) Preliminary data on tornado occurrence in southeastern US
states (listed alphabetically) from 1998–2007. (b) The same data but with the
number of cases ranked in order from most to least.
These graphs show that tooth decay increases with age, but the pattern
diﬀers between cities – in Hale the increase is fairly steady, but in Yarvard it
remains low in children up to age seven but then suddenly increases. This
led to several hypotheses including that there might have been a child dental
care program, or water ﬂuoridation, in place in Yarvard for the past eight
years compared to no action on decay in Hale.
Of course, there is always the possibility that the samples are diﬀerent due
to chance, so perhaps the ﬁrst step in any further investigation would be to
repeat the sampling using much larger numbers of children from each city.
Subsequent investigation found that the Yarvard municipal drinking
water had been ﬂuoridated for the past eight years, but this treatment had
24
Collecting and displaying data
Table 3.3 The number of dental caries and age of 20 children
chosen at random from each of the two cities of Hale and Yarvard.
Hale
Yarvard
Caries
Age
Caries
Age
1
1
4
4
5
6
2
9
4
2
7
3
9
11
1
1
3
1
1
6
3
2
4
3
6
5
3
9
5
1
8
4
8
9
2
4
7
1
1
5
10
1
12
1
1
11
2
14
2
8
1
4
1
1
7
1
1
1
2
1
9
5
9
2
2
9
3
9
6
9
1
7
1
5
8
7
6
4
6
2
Table 3.4 The average number of dental caries and
age of 20 children chosen at random from each of the
two cities of Hale and Yarvard.
Hale
Yarvard
Caries
Age
Caries
Age
4.05
4.5 years
4.10
5.5 years
not been introduced in Hale. The ﬂuoride program works because your
teeth are made of the mineral hydroxyapatite (the same mineral that binds
to heavy metals). In this case the apatite in your teeth binds ﬂuorine ions
which substitute for hydroxyls in the apatite structure, making the enamel
of your teeth less soluble and therefore less prone to decay. This seems a very
3.6 Data expressed as proportions of a total
25
(b)
(a) 12
Number of caries
Number of caries
12
8
4
8
4
0
0
0
5
Age (years)
10
0
5
Age (years)
10
Figure 3.5 The number of dental caries plotted against the age of 20 children
chosen at random from each of the two cities of (a) Hale and (b) Yarvard.
plausible reason, but bear in mind that these data are only correlative and
there may be other reason(s) for the diﬀerence between the two cities.
3.6
Data expressed as proportions of a total
Data for the relative frequencies in two or more categories that sum to a
total of 1.0, or 100%, can be displayed as a pie diagram – a circle in which
each of the categories is displayed as a “slice,” the size of which is proportional to its value. For example, a sample containing four diﬀerent minerals that are equally abundant would be shown as a circle subdivided into
four equal 90o slices. Pie diagrams are easily interpreted when there are 10
or fewer categories and each contains at least 10% of the data (Figure 3.6).
When there are more than 10 categories the display will appear cluttered,
especially when slices are distinguished by their color, but it will be even
harder to diﬀerentiate among a lot of categories shown only as black, white
and shades of grey. Categories representing a relatively small number or
proportion of total cases will appear very narrow and may be overlooked.
The procedure for drawing a pie diagram showing either the relative
proportion of cases in several categories, or the values of two or more
variables (e.g. the concentrations of six diﬀerent ions) is straightforward.
First, the data for each category are listed, summed to give a total, and then
expressed as proportions of this total. Each proportion is then multiplied by
360 to give the width of the slice in degrees, which is used to draw the
appropriate divisions on the pie diagram.
26
Collecting and displaying data
Hornblende
(a)
(b)
Quartz
Hornblende
K-feldspar
Biotite
Biotite
Quartz
Plagioclase
K-feldspar
Plagioclase
Figure 3.6 Pie diagrams comparing the mineralogy of two diﬀerent granites.
From this type of comparison it is clear that the rock in (a) has far less Kfeldspar and much more hornblende compared to the rock in (b).
3.7
Display of geographic direction or orientation
Rose diagrams are used to show a summary of the direction or orientation
of a sample of objects such as crystals or fractures in rock, or the geographic
orientation of paleocurrent directions in ancient river systems. For example,
a unimodal paleocurrent implies a river with steep slopes, but a bimodal one
suggests a meandering river with a low slope. Rose diagrams are also
commonly used by meteorologists to report the direction and magnitude
of winds. The procedures for drawing rose diagrams and analyzing data for
direction and orientation are described in Chapter 22.
3.8
Multivariate data
Often earth scientists have data for three or more variables measured on the
same sampling unit. For example, a geologist might have data for mineralogy,
chemical composition, geological age and metamorphic grade for 20 outcrops
across a zone of contact metamorphism, or a paleontologist might have data
for the numbers of several species of brachiopods from a speciﬁc formation.
Results for three variables could be shown as three-dimensional graphs,
but direct display is diﬃcult for more than this number of variables. Some
relatively new statistical techniques have made it possible to condense and
summarize multivariate data in a two-dimensional display, and these are
introduced in Chapter 20.