8 Pearson/Spearman's Correlation R Tutorial
Tải bản đầy đủ - 0trang
produce enough sperm to compete with other males’ sperm when females copulate with
two males in close succession. The more sperm a male produces, the greater the
likelihood his sperm will fertilize the ovum.
However, monogamous groups, like Hamadryas baboons, are usually associated with
smaller testicular volume in relation to body mass, as they do not need to regularly
participate in sperm competition for successful production of offspring. When comparing
primate testicular volume across species, it is often done so in conjunction with body
mass in order to make the testicular volumes comparable across species of different sizes.
Before utilizing the measures together, it is best to run a correlation to verify if there is a
relationship between testicular volume and body size. Usually, age is also taken into
account to ensure males are at the age of reproduction when testing correlation
hypotheses between testicular volume and body mass.
*The
data used for this tutorial were taken from the study conducted by Jolly and PhillipsConroy (2003).
Formulate a question about the data that can be addressed by performing a
correlation.
Question: Are testicular volume and body mass of Hamadryas baboons associated
with one another?
Based on the question, formulate the null and alternative hypotheses that
address the question proposed.
Null Hypothesis (H0): There is no relationship between Hamadryas baboon's
testicular volume and body mass.
Alternative Hypothesis (H1): There is a relationship between Hamadryas baboon's
testicular volume and body mass.
Now that an appropriate testable question has been developed along with a set of testable
hypotheses, you can run the statistical analysis.
This tutorial focuses on running correlations in R.
Refer to Chapter 15 for R-specific terminology and instructions on how to invoke and
construct code.
Check all assumptions prior to running the test.
Pearson's and Spearman's Correlation R Tutorial
1. At the command prompt, type the following testicular volumes (reported in cubic
centimeters or cc) into a vector and store it to a name that describes its contents (e.g.,
tv). Press enter/return.
2. Create a second vector for body mass (in kilogram) by typing in the body mass
measurements from the Hamadryas baboons and storing it to a name such as
bodymass. Press enter/return.
3. The Hmisc package contains a function that calculates both the correlation value (r
or ρ) and the p-value. If you already have it, load the Hmisc package by using the
library() function. If you are unsure if you have installed the package, type in
library(Hmisc) and press enter/return.
If you do not have the package installed, you should see an error message similar to
the one below. To install the package, follow the procedures outlined in Chapter 15.
If the package is installed, you will see the function execute normally, as in the image
below.
4. Apply the rcorr() function to both vectors and select the type of correlation to run
with the type= argument. The type= argument allows for the practitioner to select a
pearson or spearman correlation. Type in your selection of correlation procedure
within quotation marks using the knowledge from examining your assumptions (in
this case, the data are not normally distributed; therefore, the statistical analysis
should be the nonparametric Spearman's rank-order correlation). Press enter/return.
Note: To switch to Pearson's correlation, simply type in “pearson” instead of
“spearman” in the script.
5. The output will look similar to the one below.
The values on the diagonal opposite of the line of 1.0 in the upper matrix are the ρ-value
(0.41), indicating the strength of the correlation (which is weak). The lower matrix
contains the p-value (0.0238).
In this case, the p-value is 0.0238 and is significant when using α = 0.05, causing us to
reject the null hypothesis and state that in this sample of baboons, testicular volume is
correlated with body mass. The correlation coefficient ρ of 0.41 is positive, indicating a
positive correlation between the two variables. Therefore, as body mass increases, so does
testicular volume.
Concluding Statement
In this sample of baboons, testicular volume (cubic centimeter) is significantly weakly
positively correlated (ρ = 0.41, p-value = 0.0238) to body mass (kilogram).
11
Linear Regression
Learning Outcomes:
By the end of this chapter, you should be able to:
1. Evaluate the relationship between independent and dependent variables.
2. Use statistical programs to run a regression analysis and determine the
significance of F and interpret the R2 value for your analysis.
3. Use the skills acquired to analyze and evaluate your own dataset from
independent research.
11.1 Linear Regression Background
As previously explained in Chapter 10, a correlation analysis can potentially determine
whether a linear relationship exists between two variables, as well as indicate the strength
of the relationship (r, ρ). It is important to remember that the output generated from the
correlation analysis can give the researcher a quantifiable value describing the
relationship, but it is unable to determine if X (the independent variable) can predict Y
(the dependent variable). If researchers are interested in prediction, then a simple linear
regression is an appropriate test to run.
A simple linear regression is part of the general linear model (GLM), which also includes
an ANOVA, an analysis of covariance (ANCOVA), and t-tests, among others. Simple linear
regression examines the level of change of one variable (independent or explanatory) due
to another variable (dependent or response). In addition, a linear regression quantifies
this change and provides a measure (R2, or the coefficient of determination) of the
variation in the dependent variable (Y) explained by the independent variable (X), see
Figure 11.1 for a graphical representation of a typical regression analysis. The linear model
can be utilized to predict Y based on the values of X. Under these circumstances,
regression is typically applied to develop a linear equation that best describes the
relationship between two variables. In other words, addressing the questions: “Is there a
relationship between X and Y. If so, what is the linear equation that best describes the
relationship between X and Y?”
Figure 11.1 Scatter plot with regression line representing a typical regression analysis.
If you have more than two independent variables, then multiple regression can be used to
analyze the linear relationship among multiple variables. For example, an epidemiologist
may consider running a simple linear regression on how the level of alcohol consumption
determines the degree of liver dysfunction. However, because alcohol consumption is
often linked with cigarette smoking, the same epidemiologist may consider looking at an
additional independent variable, cigarette smoking, to see how alcohol affects the liver
when cigarette smoking is taken into account. Throughout this chapter, we will be
referring to a simple linear regression when referencing linear regression or regression
analysis.
The equation for a line is depicted in the following equation:
This relationship is illustrated in the following graph:
Terms and variables in the equation are:
X = Independent (explanatory) variable
Y = Dependent (response) variable
m = Slope (change in y/change in x)
b = y-intercept (the point where the line crosses the y-axis)
The analogous regression equation is
Where
X1 = Value of first observation of independent variable
Y = Dependent variable
b0 = y-intercept
b1 = Value of first observation of slope
The regression equation describes the regression line that is fit through the data points. It
also allows you to predict values of Y from values of X. Similar to depicting the data
points of a correlation analysis in the form of an XY scatter plot, a dataset can also be
expressed as an XY scatter plot for linear regression, along with the best-fit line plotted
through the points. In other words, the regression line reflects the best linear model
associated with the data. It is important to be able to draw conclusions from a graph,
specifically from the orientation of the slope (whether it is positive or negative). The
orientation of the slope gives insight into the type of relationship x and y have with one
another. We will be discussing this in more detail with respect to the R2 value. Figure 11.2
depicts different slope orientations while examining the relationship between x and y.
Figure 11.2 Graphs depicting the spread around the trend line. Orientation of the slope
determines the type of relationship between x and y and R2 describes the strength of the
relationship.
Regression Nuts and Bolts
A regression analysis determines whether there is an existing relationship between the
independent and dependent variables, and if there is, a regression can also describe the
quantifiable change in the dependent variable due to the change in the independent
variable.
The p-value associated with the F-value in the ANOVA table reflects if the independent
variable in the simple linear regression model has a relationship with the dependent
variable. If p ≤ 0.05, then the relationship between the independent and dependent
variable is significant and there is an association between the two variables. If p > 0.05,
then there is no significant relationship between the independent and dependent variable.
There are three coefficients that are evaluated by a p-value: (1) the ANOVA F-value (as
described above), (2) the t-value for slope, and (3) the t-value for the intercept. The tvalue in the coefficients table is from a t-test that estimates the significance of the
regression coefficients in the model (i.e., the slope and y-intercept). In simple linear
regression, the F value will be identical to the slope t-value. In other types of regression,
these values will differ and interpretations will become more complex.
Another important factor to consider is the R2 value. The R2 value is defined as the square
of the correlation coefficient and is an indication of how well the linear model describes
the data. Specifically, R2 explains how much of the variation in the Y variable can be
explained by the variation in the X variable, which is related to the ability of the
independent variable to predict the dependent variable.
The closer R2 is to 1, the better the fit and the closer the data points are to the regression
line. Alternatively, the closer R2 is to 0, the worse the fit and the further away the data
points are to the regression line. Any outliers (or data points that do not follow the trend)
will possibly result in a lower R2 value. For example, an R2 = 0.59 indicates that the data
points are not closely located along the regression line. Another way to look at R2 is to
turn the R2 value into a percentage. If R2 = 0.59, this means that 59% of the variation in
the dependent variable (y) is explained by the independent variable (x). The remaining
41% is due to unconsidered factors influencing the data. The magnitude of R2 is
interpreted differently by various disciplines. Thus, check with your advisor for what is
considered weak and strong in your field.
Generalized Hypotheses
Based on the linear model, Y = b0 + b1 X1, a simple linear regression analysis attempts to
predict the dependent variable given one independent variable. If the dependent variable
can be predicted from the independent variable, then the slope and intercept of the linear
model increases or decreases based on the X-value. With this in mind, the following
general hypotheses can be formulated for a regression analysis.
Association of the independent variable and the dependent variable (from
ANOVA table, assessed with an F-value):
H0: The independent variable (X) has no association with the dependent variable (Y).
H1: The independent variable (X) has an association with the dependent variable (Y).
Slope (from coefficients table, assessed with a t-value):
H0: The slope of the linear model is zero; therefore, the independent variable (X) has
no relationship with the dependent variable (Y).
H1: The slope of the linear model is not zero; therefore, there is a relationship between
the independent and dependent variable.
y-intercept (from coefficients table, assessed with a t-value):
H0: The y-intercept is equal to zero.
H1: The y-intercept is not equal to zero.
A good model rejects the null hypothesis in all three categories: (1) association of
independent and dependent variables, (2) slope, and (3) y-intercept. The two coefficients
generated by a simple linear regression (slope and intercept) can be input into the
regression equation Y = b0 + b1 X1 to predict values of Y from X and to describe the
regression line. Before a linear regression can be applied, the following assumptions must
first be satisfied.
Assumptions
1. Data type – A simple linear regression assumes that both the independent and
dependent variable are continuous.
2. Distribution – The regression analysis assumes a normal distribution of the
residuals (errors). The residuals can be tested using the skewness and kurtosis tests
described in Chapter 2. Also, there are no significant outliers.
3. Independent samples – Observations should be independent.
4. Homogeneous variance – In addition to meeting the normality assumption, a
regression analysis assumes a homogenous variance. When graphed, the
distribution of the residuals should not display heteroscedasticity (sideways cone
shape). Instead, they should be homoscedastic (points relatively evenly distributed to
each other across a graph because they have approximately the same variance). This
can be confirmed through graphing the residuals in a residual plot (refer to the box on
the following page).
5. There is a linear relationship between the independent and dependent variables.
As a reminder, running an analysis without meeting the assumption(s) may mean that
the results are not valid. In the case that one or more of these assumptions are not fully
satisfied, then either transforming one of the variables or invoking the nonparametric
version of the regression would be best.
11.2 Case Study
An undergraduate group conducted a study examining the number of Streptomycin
resistant strains of bacteria in a number of water sites found at varying distances from a
Colorado cattle farm. A total of 12 different sites were examined, starting from Greeley,
CO (the site of cattle farming) to Nederland, CO (73 km from cattle farming). Cattle farms
are known to use large amounts of antibiotics and the concern is that an increase in
antibiotic usage may also increase antibiotic resistant bacteria found in the watershed
downstream from the farm. The undergraduate group was interested to know if bacterial
resistance could be predicted based on distance from Greeley, CO.
Formulate a question about the data that can be addressed by performing a linear
regression.
Question: Does distance from Greeley, CO predict the number of resistant bacteria?