8 The interquartile range; the quartile deviation
Tải bản đầy đủ - 0trang
FUNDAMENTALS OF BUSINESS MATHEMATICS
Basic weekly wage (£) (less than)
200
225
250
275
300
Cumulative frequency
16
169
270
362
430
It will be noted that it is unnecessary to close the ﬁnal class in order to draw the ogive, and so we do not do
so. The ogive is shown in Figure 4.6. Alternatively, a sensible closing value, such as £325, could be selected
and an extra point, with cumulative frequency 480, added to the ogive.
Cumulative
frequency
500
400
300
200
100
0
150
Figure 4.6
200
250
Q1
Q3
Weekly wage (£)
300
Ogive: wage distribution for Example 4.8.1.
We now note that the total frequency is 480, and so, from the constructions shown on the ogive, we have the
following approximations:
3
of 480, or 360) ϭ £274
4
1
Q1(corresponding to a cumulative frequency of
of 480, or 120) ϭ £217
4
Q3(corresponding to a cumulative frequency of
and thus:
Interquartile range ϭ £274 Ϫ £217 ϭ £57 .
Thus, the manager could use an approximate measure of the spread of wages of the middle 50 per cent of the
workforce of £57.
There is a very closely related measure here, the quartile deviation, which is half the
interquartile range. In the above example, the quartile deviation is £28.50. In practice,
the quartile deviation is used rather more than the interquartile range. If you rearrange
Q 3 Ϫ Q 1 as (Q 3 Ϫ M) ϩ (M Ϫ Q 1) you will see that the two expressions in brackets
DESCRIPTIVE STATISTICS
In fact, to determine the interquartile range, we adopt the same approach as we did for the median. First of
all, we assume that the wage values are evenly spread throughout their classes, and draw the ogive. The necessary cumulative frequency distribution is:
141
DESCRIPTIVE STATISTICS
142
STUDY MATERIAL C3
give the distances from the quartiles to the median and then dividing by two gives the
average distance from the quartiles to the median. So we can say that approximately 50 per
cent of the observations lie withinϮone quartile deviation of the median.
Example 4.8.2
Using the data on the output of product Q (see Example 4.3.4), ﬁnd the quartiles, the interquartile range and the
quartile deviation from the ogive (Figure 4.3).
Solution
The total frequency ϭ 22, so the cumulative frequency of Q1 is 22/4 ϭ 5.5, and the cumulative frequency of
Q3 is (3 ϫ 22/4) ϭ 16.5
From the ogive, Q1 ϭ 362.5 kg and Q3 ϭ 383.5 kg
Hence, the interquartile range ϭ 383.5 Ϫ 362 ϭ 21.5 kg, and the quartile deviation ϭ 21.5 ÷ 2 ϭ 10.75 kg.
4.9
Deciles
Just as quartiles divide a cumulative distribution into quarters, deciles divide a cumulative
distribution into tenths. Thus:
the ﬁrst decile has 10 per cent of values below it and 90 per cent above it,
the second decile has 20 per cent of values below it and 80 per cent above it and so on.
The use and evaluation of deciles can best be illustrated through an example.
Example 4.9.1
As a promotional example, a mail-order company has decided to give free gifts to its highest-spending customers.
It has been suggested that the highest-spending 30 per cent get a gift, while the highest-spending 10 per cent
get an additional special gift. The following distribution of a sample of spending patterns over the past year is
available:
Amount spent (£)
under 50
50–under 100
100–under 150
150–under 200
200–under 300
300 and over
Number of customers
spending this amount
37
59
42
20
13
9
To which customers should the gift and the additional special gift be given?
Solution
We have to determine the ninth decile (90 per cent below it) and the seventh decile (70 per cent below it).
These can be found in the same way as with quartiles, by reading from an ogive.
FUNDAMENTALS OF BUSINESS MATHEMATICS
Amount spent (less than, £)
50
100
150
200
300
Cumulative frequency
37
96
138
158
171
The ogive is shown in Figure 4.7.
Histogram Payment record of 100 customers
Cumulative
frequency
200
150
100
50
0
0
50
100
150
200
250
300
Amount spent last year (£)
7th decile
Figure 4.7
9th decile
Ogive for Example 4.9.1
The ninth decile will correspond to a cumulative frequency of 162 (90 per cent of the total frequency, 180).
From the ogive, this is: £230.
Similarly, the seventh decile corresponds to a cumulative frequency of 70 per cent of 180, that is 126. From
the ogive, this is: £135.
Hence, in order to implement the suggestion, the company should give the free gift to those customers who
have spent over £135 in the past year, and the additional free gift to those who have spent over £230.
Example 4.9.2
Using the data on the output of product Q (see Example 4.3.4), ﬁnd the ninth decile from the ogive (Figure 4.3).
Solution
The cumulative frequency of the ninth decile is 0.9 ϫ 22 ϭ 19.8. From the ogive, the ninth decile is 392.5 kg
(approximately).
DESCRIPTIVE STATISTICS
The cumulative frequency distribution (ignoring the last open-ended class) is:
143
DESCRIPTIVE STATISTICS
144
STUDY MATERIAL C3
In your exam you cannot be asked to draw the ogive so you just have to know
how to obtain the quartiles and percentiles from it. It is possible to calculate these
statistics but this is not required in your syllabus and the formulae are not given.
4.10
The mean absolute deviation
If the mean is the average being used, then one very good way of measuring the amount of
variability in the data is to calculate the extent to which the values differ from the mean.
This is essentially the thinking behind the mean absolute deviation and the standard deviation (for which, see Section 4.11).
£1,120
£990
£1,040
£1,030
£1,105
£1,015
Example 4.10.1
Measure the spread of shop A’s weekly takings (Example 4.7.1), given the following sample over 6 weeks. The
sample has an arithmetic mean of £1,050.
Solution
A simple way of seeing how far a single value is from a (hopefully) representative average ﬁgure is to determine
the difference between the two. In particular, if we are dealing with the mean, x , this difference is known as the
deviation from the mean or, more simply, the deviation. It is clear that, for a widely spread data set, the deviations of the individual values in the set will be relatively large. Similarly, narrowly spread data sets will have relatively small deviation values. We can therefore base our measure on the values of the deviations from the mean.
In this case:
Deviation ϭ x Ϫ x
In this case, the values of (x Ϫx ) are:
£70, Ϫ£60, Ϫ£10, Ϫ£20, £55, Ϫ£35
The obvious approach might now be to take the mean of these deviations as our measure. Unfortunately, it can
be shown that this always turns out to be zero and so the mean deviation will not distinguish one distribution
from another. The basic reason for this result is that the negative deviations, when summed, exactly cancel out the
positive ones: we must therefore remove this cancellation effect.
One way to remove negative values is simply to ignore the signs, that is, to use the absolute values. In this
case, the absolute deviations are:
(x Ϫ x ) : £70, £60, £10, £20, £55, £35
The two vertical lines are the mathematical symbol for absolute values and are often referred to as ‘modulus’, or
‘mod’, of (x Ϫx ) in this case. The mean of this list is now a measure of the spread in the data. It is known as the
mean absolute deviation. Hence the mean absolute deviation of weekly takings for shop A is:
70 ϩ 60 ϩ 10 ϩ 20 ϩ 55 ϩ 35
ϭ £41 .67
6
Thus, our ﬁrst measure of the spread of shop A’s weekly takings is £41.67.
FUNDAMENTALS OF BUSINESS MATHEMATICS
Find the mean absolute deviation for the following data:
2
3
5
7
8
Solution
The mean, x , is (2 ϩ 3 ϩ 5 ϩ 7 ϩ 8)/5 ϭ 5. So absolute deviations are given by subtracting 5 from each of
the data and ignoring any negative signs. This gives values of:
3
2
0
2
3
The mean absolute deviation is (3 ϩ 2 ϩ 0 ϩ 2 ϩ 3)/5 ϭ 2.
The mean deviation is not explicitly mentioned in your syllabus and is
unlikely to be examined. We have included it as part of the theoretical build
up to the standard deviation.
4.11
The standard deviation
In the preceding section, we solved the problem of negative deviations cancelling out
positive ones by using absolute values. There is another way of ‘removing’ negative signs,
namely by squaring the ﬁgures. If we do that, then we get another, very important, measure of spread, the standard deviation.
Example 4.11.1
Evaluate the measure of the spread in shop A’s weekly takings (Example 4.7.1), using this new approach.
Solution
Recall that we have the deviations:
x Ϫ x : £70, Ϫ£60, Ϫ£10, Ϫ£20, £55, Ϫ£35
so, by squaring, we get:
(x Ϫ x )2: 4,900, 3,600, 100, 400, 3,025, 1225
,
The mean of these squared deviations is:
13250
,
ϭ 2,208 .3
6
This is a measure of spread whose units are the square of those of the original data, because we squared the
deviations. We thus take the square root to get back to the original units (£). Our measure of spread is therefore:
'2208 . 3 ϭ £46 .99
This is known as the standard deviation, denoted by ‘s’. Its square, the intermediate step before square-rooting,
is called the variance, s2.
The formula that has been implicitly used here is:
s ϭ
(x Ϫ x )2
n
DESCRIPTIVE STATISTICS
Example 4.10.2
145
DESCRIPTIVE STATISTICS
146
STUDY MATERIAL C3
Applying the same series of steps to the data in a frequency distribution will give us the corresponding formula in
this case:
●
●
square the deviations: (x Ϫ x )2
ﬁnd the mean of the (x Ϫ x )2 values occurring with frequencies denoted by f.
f (x Ϫ x )2
(ϭs2 )
n
●
Take the square root:
(x Ϫ x )2
(ϭs)
f
In practice, this formula can turn out to be very tedious to apply. It can be shown that the following, more easily
applicable, formula is the same as the one above:
2
fx 2 ⎛⎜ fx ⎞⎟
⎟⎟
Ϫ⎜
⎜⎝ f ⎟⎠
f
s ϭ
This formula will be given in the Business Mathematics exam, with
of ⌺fx/⌺f.
x
in place
An example will now demonstrate a systematic way of setting out the computations
involved with this formula.
Example 4.11.2
An analyst is considering two categories of company, X and Y, for possible investment. One of her assistants has
compiled the following information on the price-earnings ratios of the shares of companies in the two categories
over the past year.
Price-earnings ratios
4.95–under 8.95
8.95–under 12.95
12.95–under 16.95
16.95–under 20.95
20.95–under 24.95
24.95–under 28.95
Number of category X
companies
3
5
7
6
3
1
Number of category Y
companies
4
8
8
3
3
4
Compute the standard deviations of these two distributions and comment. (You are given that the means of the
two distributions are 15.59 and 15.62, respectively.)
Solution
Concentrating ﬁrst of all on category X, we see that we face the same problem as when we calculated the mean
of such a distribution, namely that we have classiﬁed data, instead of individual values of x. Adopting a similar
approach as before, we take the mid-point of each class:
FUNDAMENTALS OF BUSINESS MATHEMATICS
f
3
5
7
6
3
1
25
fx
20.85
54.75
104.65
113.70
68.85
26.95
389.75
fx2
144.9075
599.5125
1,564.5175
2,154.6150
1,580.1075
726.3025
6,769.9625
Thus the standard deviation is:
s ϭ
⎛ fx ⎞⎟2
fx 2
⎟
Ϫ ⎜⎜
⎜⎝ f ⎟⎟⎠
f
s ϭ
⎛ 389 .75 ⎞⎟2
6,769 . 9625
Ϫ ⎜⎜
⎟
⎜⎝ 25 ⎠⎟
25
ϭ
270 . 7985 Ϫ 243 . 0481 ϭ
27 . 7504 ϭ 5 .27 .
The standard deviation of the price-earnings ratios for category X is therefore 5.27. In the same way, you can
verify that the standard deviation in the case of category Y is 6.29. These statistics again emphasise the wider
spread in the category Y data than in the category X data. Note how a full degree of accuracy (four decimal
places) is retained throughout the calculation in order to ensure an accurate ﬁnal result.
The calculation for Y should be as for X above. In outline:
x (mid-point)
6.95
…
…
26.95
s ϭ
x2
48.3025
…
…
726.3025
f
4
…
…
4
30
fx
27.80
…
…
107.80
468.50
fx2
193.210
…
…
2,905.210
8,503.075
(283 . 4358 Ϫ 243 . 8803) ϭ 6 . 289
Example 4.11.3
Using the data from Example 4.2.3 relating to absences from work, and the mean that you have already calculated, ﬁnd the standard deviation.
No. of employees absent
2
3
4
5
6
7
8
No. of days (frequency)
2
4
3
4
3
3
3
It is probably easiest to calculate fx2 by multiplying fx by x, for example, 2 ϫ 4, 3 ϫ 12, etc.
DESCRIPTIVE STATISTICS
x2
48.3025
119.9025
223.5025
359.1025
526.7025
726.3025
x (mid-point)
6.95
10.95
14.95
18.95
22.95
26.95
147
DESCRIPTIVE STATISTICS
148
STUDY MATERIAL C3
Solution
x
2
3
4
5
6
7
8
sϭ
ϭ
fx 2
Ϫ x2
f
ϭ
f
2
4
3
4
3
3
3
22
fx
4
12
12
20
18
21
24
111
fx2
8
36
48
100
108
147
192
639
⎛111⎞⎟2
639
Ϫ ⎜⎜
⎟
⎜⎝ 22 ⎟⎠
22
(29 .0455 Ϫ 25 .4566) ϭ
3 .58 8 9 ϭ 1 .89 (to two d.p.)
Example 4.11.4
Using the data from Exercise 4.2.5 relating to output of product Q, and the mean that you have already calculated, ﬁnd the standard deviation.
Output of Q (kg)
350–under 360
360–370
370–380
380–390
390–400
No. of days (frequency)
4
6
5
4
3
Solution
Mid-point
x
355
365
375
385
395
s ϭ
ϭ
4.12
fx 2
Ϫ x2 ϭ
f
Frequency
f
4
6
5
4
3
22
fx
1,420
2,190
1,875
1,540
1,185
8,210
fx2
504,100
799,350
703,125
592,900
468,075
3,067,550
⎛ 8,210 ⎞⎟2
3,067,550
Ϫ ⎜⎜
⎟
⎜⎝ 22 ⎟⎠
22
(139,434 .0909 Ϫ 1 39,264 .6694) ϭ 169 .4215 ϭ 13 . 02 (to two d.p.)
The coefficient of variation
The coefﬁcient of variation is a statistical measure of the dispersion of data points in a data
series around the mean. It is calculated as follows:
Coefficient of variation ϭ
Standard deviation
Expected return
FUNDAMENTALS OF BUSINESS MATHEMATICS
Example 4.12.1
Government statistics on the basic weekly wages of workers in two countries show the following. (All ﬁgures converted to sterling equivalent.)
Country V:
Country W:
x ϭ 120
x ϭ 90
s ϭ £55
s ϭ £50
Can we conclude that country V has a wider spread of basic weekly wages?
Solution
By simply looking at the two standard deviation ﬁgures, we might be tempted to answer ‘yes’. In doing so, however, we should be ignoring the fact that the two mean values indicate that wages in country V are inherently
higher, and so the deviations from the mean and thus the standard deviation will tend to be higher. To make a
comparison of like with like we must use the coefﬁcient of variation:
Coefficient of variation ϭ
s
x
Thus:
Coefficient of variation of wages in country V ϭ
55
ϭ 45 .8%
120
Coefficient of variation of wages in country W ϭ
50
ϭ 55 .6%
90
Hence we see that, in fact, it is country W that has the higher variability in basic weekly wages.
Example 4.12.2
Calculate the coefﬁcients of variation for the data in Exercises 4.11.3 and 4.11.4.
Solution
●
●
In Example 4.11.3, x ϭ 5.045 and s ϭ 1.8944, so the coefﬁcient of variation is: 100 ϫ 1.8944ր
5.045 ϭ 37.6%
In Example 4.11.4, x ϭ 373.18 and s ϭ 13.0162, so the coefﬁcient of variation is: 100 ϫ 13.0162ր
373.18 ϭ 3.5%.
DESCRIPTIVE STATISTICS
The coefﬁcient of variation is the ratio of the standard deviation to the mean, and is useful
when comparing the degree of variation from one data series to another, even if the means
are quite different from each other.
In a ﬁnancial setting, the coefﬁcient of variation allows you to determine how much
risk you are assuming in comparison to the amount of return you can expect from an
investment. The lower the ratio of standard deviation to mean return, the better your riskreturn tradeoff.
Note that if the expected return in the denominator of the calculation is negative or
zero, the ratio will not make sense.
In Example 4.11.2, it was relatively easy to compare the spread in two sets of data by
looking at the standard deviation ﬁgures alone, because the means of the two sets were so
similar. Another example will show that it is not always so straightforward.
149
DESCRIPTIVE STATISTICS
150
STUDY MATERIAL C3
4.13
A comparison of the measures of spread
Like the mode, the range is little used except as a very quick initial view of the overall spread of the data. The problem is that it is totally dependent on the most extreme
values in the distribution, which are the ones that are particularly liable to reﬂect errors or
one-off situations. Furthermore, the range tells us nothing at all about how the data is
spread between the extremes.
The standard deviation is undoubtedly the most important measure of spread. It has a
formula that lends itself to algebraic manipulation, unlike the quartile deviation, and so,
along with the mean, it is the basis of almost all advanced statistical theory. This is a pity
because it does have some quite serious disadvantages. If data is skewed, the standard deviation will exaggerate the degree of spread because of the large squared deviations associated
with extreme values. Similarly, if a distribution has open intervals at the ends, the choice of
limits and hence of mid-points will have a marked effect on the standard deviation.
The quartile deviation, and to a lesser extent the interquartile range, is the best measure
of spread to use if the data is skewed or has open intervals. In general, these measures would
not be preferred to the standard deviation because they ignore much of the data and are little known.
Finally, it is often the case that data is intended to be compared with other data, perhaps
nationwide ﬁgures or previous year’s ﬁgures, etc. In such circumstances, unless you have
access to all the raw data, you are obliged to compare like with like, regardless perhaps of
your own better judgement.
4.14
Descriptive statistics using Excel
Many of the techniques discussed in this chapter can be facilitated through the use of
Excel. This section discusses a number of these, including the mean, the mode, the median,
the standard deviation, the variance and the range.
Figure 4.8 shows 100 observations that represent sample production weights of a product such as cereals, produced in grams. This data is the sample data from which the descriptive statistics are measured. The term sample is important as it implies that the data does
not represent the full population and this affects some of the spreadsheet functions used.
A population is the complete data set from which a conclusion is to be made.
Figure 4.8
100 observations of production weights in grams
FUNDAMENTALS OF BUSINESS MATHEMATICS
Figure 4.9
Results of descriptive statistics
Mean
To calculate the sample arithmetic mean of the production weights the average function
is used as follows in cell b4.
ϭ
AVERAGE(DATA)
It is important to note that the average function totals the cells containing values and
divides by the number of cells that contain values. In certain situations this may not produce the required results and it might be necessary to ensure that zero has been entered in
order that the function sees the cell as containing a value.
Sample median
The sample median is deﬁned as the middle value when the data values are ranked in
increasing, or decreasing, order of magnitude. The following formula in cell b5 uses the
median function to calculate the median value for the production weights.
ϭ MEDIAN(DATA)
Sample mode
The sample mode is deﬁned as the value which occurs most frequently. The following formula is required in cell b6 to calculate the mode of the production weights.
ϭ MODE(DATA)
DESCRIPTIVE STATISTICS
The data, has been entered into a spreadsheet and the range a3 through e22 has been
named data. Any rectangular range of cells in Excel can be given a name, which can be
easier to reference than depending cell references. To name the range, ﬁrst select the area to
be named (a3:e22 in this case), double-click on the name box at the top of the screen and
type in the required name (data in this case).
The mean, median and mode are described as measures of central tendency and offer different ways of presenting a typical or representative value of a group of values. The range,
the standard deviation and the variance are measures of dispersion and refer to the degree to
which the observations in a given data set are spread about the arithmetic mean. The mean
is the most frequently used measure of central tendency, and statisticians to describe a data
set frequently use the mean together with the standard deviation. Figure 4.9 shows the
result of the descriptive statistic functions. Each statistic is explained in detail below.
151