3 Bar Graphs, Histograms, and Box Plots
Tải bản đầy đủ - 0trang
of the independent, categorical variable (locality) and the y-axis represents the dependent
variable (mean snowfall in meters).
Figure 3.1 Clustered bar chart comparing the mean snowfall of alpine forests between
2013 and 2015 in Mammoth, CA; Mount Baker, WA; and Alyeska, AK.
Notice that Figure 3.1 gives a clear depiction of the differences in the mean snowfall at the
three localities. By adding error bars (standard deviations), the researcher is also able to
illustrate the variance in each one of the groups of data. For instance, the snowfall in
Alyeska, AK is less variable than the snowfall in Mount Baker, WA.
Figure 3.2 Clustered bar chart comparing the mean snowfall of alpine forests between
2013 and 2015 in Mount Baker, WA and Alyeska, AK. An improperly scaled axis
exaggerates the differences between groups.
One of the most important considerations when displaying data with a bar graph is the
scaling of the axes. Unfortunately, graphs built in the programs Excel and Numbers are
often created with an improperly scaled y-axis. If the y-axis does not begin with zero, then
the differences between groups appear exaggerated. By “zooming in” on this smaller set of
y-axis values, the graph can be misleading. Take the previous example of snowfall
measurements. It is clear that the average snowfall in Mount Baker, WA and Alyeska, AK
are very similar. However, if the graph of these two localities is built with a modified yaxis as in Figure 3.2, with a minimum value set to 16.5, the differences appear dramatic,
when in reality they are not.
Figure 3.3 Clumped bar chart comparing the mean snowfall of alpine forests by year
(2013, 2014, and 2015) in Mammoth, CA; Mount Baker, WA; and Alyeska, AK.
Clumped Bar Charts
If this same researcher was interested in illustrating the trends in snowfall patterns over
a 3-year period, a clumped bar chart would be useful. Figure 3.3 shows snowfall patterns
within each locality over a 3-year period. By using a clumped bar chart, the researcher can
demonstrate trends within each category of data. For example, we can see that the
snowfall was exceptionally high in 2014 at the Mount Baker, WA location; however, the
snowfall at the Mammoth, CA location was fairly stable over time.
Stacked Bar Charts
Next the researcher wants to illustrate differences in the timing of snowfall by month
within each site. For this example, a stacked bar chart is helpful in illustrating the relative
contributions of parts to the whole. Figure 3.4 shows the amount of snow that fell within
the months of January, February, and March, 2015. Notice that in Mammoth, CA there
was zero snowfall in the month of January.
Figure 3.4 Stacked bar chart comparing the mean snowfall of alpine forests by month
(January, February, and March) for 2015 in Mammoth, CA; Mount Baker, WA; and
Alyeska, AK.
Figure 3.5 Histogram of seal size.
Histograms
Histograms are another form of bar charts used to display continuous categories, like a
consecutive range of values for age. If your data are made up of quantitative variables,
then consider constructing a histogram. The format is similar to that of a bar chart;
however, the categories along the bottom are represented with a set range of values.
Hence, both axes will be represented on a numerical scale. Also, the aesthetics are slightly
different because there are no spaces between the bars. In a histogram, there will never
be space between bars because the horizontal axis is representing continuous values
(Figure 3.5). If a space does exist between bars, then it means that there are no values for
that range.
Box Plots
The box plot (also called a box and whisker plot) is a convenient way to illustrate several
key descriptive statistics from a dataset. Box plots show the median, as well as the
distribution of the data through the use of quartiles, which divide ranked data into four
equal groups, each consisting of a quarter of the data.
Consider the following dataset:
The first step in developing a box plot for these data is to define the quartiles. Several
methods are currently debated regarding how to define quartiles; the following example
uses the simplest and most intuitive method. In the sample dataset above, the numbers
must first be rearranged so that they are in order:
Second, find the median, which is also defined as the second quartile (Q2). In the current
example, there is an even number of data points, so the median is calculated as the
average of the middle two numbers (Q2 = 24). If there were an odd number of points, the
median would be excluded for the next step. Third, calculate the median of each half of
the data (on either side of the median); these medians are the first and third quartiles (Q1
and Q3):
The box component of a box plot spans the first quartile to the third quartile, and is
known as the interquartile range (IQR); the median is shown inside the box at the
position of the second quartile, as illustrated in Figures 3.6 and 3.7.
Figure 3.6 Example box plot showing the median, first and third quartiles, as well as the
whiskers.
Figure 3.7 Comparison of the box plot to the normal distribution of a sample population.
By showing the median, as well as the position of the first and third quartiles, box plots
give information about the degree of dispersion, as well as the skewness of the data. Box
plots often also have lines (the whiskers) extending from the box to represent the
variability of the data outside of the upper and lower quartiles. The whiskers usually mark
the minimum and maximum values for the dataset. However, if the dataset contains
outliers, the whiskers will extend only up to a certain point, defined as Q1 − 1.5 × IQR or
Q3 + 1.5 × IQR (Figure 3.7). Outliers will be depicted as points outside of the whiskers
(Figure 3.8).
Figure 3.8 Sample box plot with an outlier.
The box plots on previous pages, Figures 3.7 and 3.8, have been drawn for illustrative
purposes in a horizontal orientation, but are most often shown vertically, as in Figure 3.8.
In Figure 3.8, descriptive information from two groups of data is depicted. Although the
medians for the two groups are the same, the differences in the dispersion and skew of
the data are apparent. While group B shows a normal distribution, group A shows a
“positive skew,” with a tail that extends in the positive direction. The box plot for group A
also shows the position of an outlier, whose value is beyond the range of the whiskers.
Generating box plots is straightforward in both SPSS and R and is included in this book's
tutorials. However, generating box plots in Excel and Numbers is both lengthy and
complex, and involves manipulating stacked bar charts. If you do not have access to SPSS
or R, we recommend looking for a free, online box plot generator, which is an easy and
quick solution for creating box plots of your data.
Tutorials
How to Make a Bar Chart in Excel
The following tutorial will walk you through the construction of a bar chart (also known
as column graph or bar plot) using Excel. The data involve the number of rows of snail
radula.
*Data
taken from the research of Vanessa C. Morales, Robert Candelaria, and
Dr. Kathleen Weaver.
Refer to Chapter 12 for tips and tools when using Excel.
Excel offers two methods to construct a simplified bar chart with error bars. While the
first method may be more challenging at first, the lessons learned will give you greater
mastery and flexibility. Calculate the average and standard deviation of the radula from
each population prior to beginning the tutorial.
Method 1
1. Arrange data in columns on the spreadsheet.
2. Click on an empty cell. Select Insert, Column, and select the first 2-D Column
option. There are several types of bar graphs available. Use the one appropriate for the
data you want to display.
3. A blank canvas will appear.
4. Right click on the blank canvas and choose the Select Data option.
5. Under Legend Entries, select Add.
Note: Add each data point as separate series so that the standard deviation bars can be
entered separately.
6. Select the icon corresponding to the Series name subheading.
7. Select the first series title then click on the icon to the right.
8. Select the icon corresponding to the Series values.
9. Select the first value then click the icon on the right.
10. Click OK.
11. You will be directed to the original popup. Repeat steps 5–10 to input the remaining
values.
12. After the second variable is added, you should be left with a graph that looks like the
following.
After the third variable:
13. Once all the variables have been added to the graph, click OK.
14. A very basic column graph will appear, similar to the one below.
15. As a default, Excel labels the x-axis as “1.” To delete this label, select label “1.” A box
will appear. Then, press delete on your keyboard.