Tải bản đầy đủ - 0 (trang)
8 Pearson/Spearman's Correlation R Tutorial

# 8 Pearson/Spearman's Correlation R Tutorial

Tải bản đầy đủ - 0trang

produce enough sperm to compete with other males’ sperm when females copulate with

two males in close succession. The more sperm a male produces, the greater the

likelihood his sperm will fertilize the ovum.

However, monogamous groups, like Hamadryas baboons, are usually associated with

smaller testicular volume in relation to body mass, as they do not need to regularly

participate in sperm competition for successful production of offspring. When comparing

primate testicular volume across species, it is often done so in conjunction with body

mass in order to make the testicular volumes comparable across species of different sizes.

Before utilizing the measures together, it is best to run a correlation to verify if there is a

relationship between testicular volume and body size. Usually, age is also taken into

account to ensure males are at the age of reproduction when testing correlation

hypotheses between testicular volume and body mass.

*The

data used for this tutorial were taken from the study conducted by Jolly and PhillipsConroy (2003).

Formulate a question about the data that can be addressed by performing a

correlation.

Question: Are testicular volume and body mass of Hamadryas baboons associated

with one another?

Based on the question, formulate the null and alternative hypotheses that

Null Hypothesis (H0): There is no relationship between Hamadryas baboon's

testicular volume and body mass.

Alternative Hypothesis (H1): There is a relationship between Hamadryas baboon's

testicular volume and body mass.

Now that an appropriate testable question has been developed along with a set of testable

hypotheses, you can run the statistical analysis.

This tutorial focuses on running correlations in R.

Refer to Chapter 15 for R-specific terminology and instructions on how to invoke and

construct code.

Check all assumptions prior to running the test.

Pearson's and Spearman's Correlation R Tutorial

1. At the command prompt, type the following testicular volumes (reported in cubic

centimeters or cc) into a vector and store it to a name that describes its contents (e.g.,

tv). Press enter/return.

2. Create a second vector for body mass (in kilogram) by typing in the body mass

measurements from the Hamadryas baboons and storing it to a name such as

bodymass. Press enter/return.

3. The Hmisc package contains a function that calculates both the correlation value (r

or ρ) and the p-value. If you already have it, load the Hmisc package by using the

library() function. If you are unsure if you have installed the package, type in

library(Hmisc) and press enter/return.

If you do not have the package installed, you should see an error message similar to

the one below. To install the package, follow the procedures outlined in Chapter 15.

If the package is installed, you will see the function execute normally, as in the image

below.

4. Apply the rcorr() function to both vectors and select the type of correlation to run

with the type= argument. The type= argument allows for the practitioner to select a

pearson or spearman correlation. Type in your selection of correlation procedure

within quotation marks using the knowledge from examining your assumptions (in

this case, the data are not normally distributed; therefore, the statistical analysis

should be the nonparametric Spearman's rank-order correlation). Press enter/return.

Note: To switch to Pearson's correlation, simply type in “pearson” instead of

“spearman” in the script.

5. The output will look similar to the one below.

The values on the diagonal opposite of the line of 1.0 in the upper matrix are the ρ-value

(0.41), indicating the strength of the correlation (which is weak). The lower matrix

contains the p-value (0.0238).

In this case, the p-value is 0.0238 and is significant when using α = 0.05, causing us to

reject the null hypothesis and state that in this sample of baboons, testicular volume is

correlated with body mass. The correlation coefficient ρ of 0.41 is positive, indicating a

positive correlation between the two variables. Therefore, as body mass increases, so does

testicular volume.

Concluding Statement

In this sample of baboons, testicular volume (cubic centimeter) is significantly weakly

positively correlated (ρ = 0.41, p-value = 0.0238) to body mass (kilogram).

11

Linear Regression

Learning Outcomes:

By the end of this chapter, you should be able to:

1. Evaluate the relationship between independent and dependent variables.

2. Use statistical programs to run a regression analysis and determine the

significance of F and interpret the R2 value for your analysis.

3. Use the skills acquired to analyze and evaluate your own dataset from

independent research.

11.1 Linear Regression Background

As previously explained in Chapter 10, a correlation analysis can potentially determine

whether a linear relationship exists between two variables, as well as indicate the strength

of the relationship (r, ρ). It is important to remember that the output generated from the

correlation analysis can give the researcher a quantifiable value describing the

relationship, but it is unable to determine if X (the independent variable) can predict Y

(the dependent variable). If researchers are interested in prediction, then a simple linear

regression is an appropriate test to run.

A simple linear regression is part of the general linear model (GLM), which also includes

an ANOVA, an analysis of covariance (ANCOVA), and t-tests, among others. Simple linear

regression examines the level of change of one variable (independent or explanatory) due

to another variable (dependent or response). In addition, a linear regression quantifies

this change and provides a measure (R2, or the coefficient of determination) of the

variation in the dependent variable (Y) explained by the independent variable (X), see

Figure 11.1 for a graphical representation of a typical regression analysis. The linear model

can be utilized to predict Y based on the values of X. Under these circumstances,

regression is typically applied to develop a linear equation that best describes the

relationship between two variables. In other words, addressing the questions: “Is there a

relationship between X and Y. If so, what is the linear equation that best describes the

relationship between X and Y?”

Figure 11.1 Scatter plot with regression line representing a typical regression analysis.

If you have more than two independent variables, then multiple regression can be used to

analyze the linear relationship among multiple variables. For example, an epidemiologist

may consider running a simple linear regression on how the level of alcohol consumption

determines the degree of liver dysfunction. However, because alcohol consumption is

often linked with cigarette smoking, the same epidemiologist may consider looking at an

additional independent variable, cigarette smoking, to see how alcohol affects the liver

when cigarette smoking is taken into account. Throughout this chapter, we will be

referring to a simple linear regression when referencing linear regression or regression

analysis.

The equation for a line is depicted in the following equation:

This relationship is illustrated in the following graph:

Terms and variables in the equation are:

X = Independent (explanatory) variable

Y = Dependent (response) variable

m = Slope (change in y/change in x)

b = y-intercept (the point where the line crosses the y-axis)

The analogous regression equation is

Where

X1 = Value of first observation of independent variable

Y = Dependent variable

b0 = y-intercept

b1 = Value of first observation of slope

The regression equation describes the regression line that is fit through the data points. It

also allows you to predict values of Y from values of X. Similar to depicting the data

points of a correlation analysis in the form of an XY scatter plot, a dataset can also be

expressed as an XY scatter plot for linear regression, along with the best-fit line plotted

through the points. In other words, the regression line reflects the best linear model

associated with the data. It is important to be able to draw conclusions from a graph,

specifically from the orientation of the slope (whether it is positive or negative). The

orientation of the slope gives insight into the type of relationship x and y have with one

another. We will be discussing this in more detail with respect to the R2 value. Figure 11.2

depicts different slope orientations while examining the relationship between x and y.

Figure 11.2 Graphs depicting the spread around the trend line. Orientation of the slope

determines the type of relationship between x and y and R2 describes the strength of the

relationship.

Regression Nuts and Bolts

A regression analysis determines whether there is an existing relationship between the

independent and dependent variables, and if there is, a regression can also describe the

quantifiable change in the dependent variable due to the change in the independent

variable.

The p-value associated with the F-value in the ANOVA table reflects if the independent

variable in the simple linear regression model has a relationship with the dependent

variable. If p ≤ 0.05, then the relationship between the independent and dependent

variable is significant and there is an association between the two variables. If p > 0.05,

then there is no significant relationship between the independent and dependent variable.

There are three coefficients that are evaluated by a p-value: (1) the ANOVA F-value (as

described above), (2) the t-value for slope, and (3) the t-value for the intercept. The tvalue in the coefficients table is from a t-test that estimates the significance of the

regression coefficients in the model (i.e., the slope and y-intercept). In simple linear

regression, the F value will be identical to the slope t-value. In other types of regression,

these values will differ and interpretations will become more complex.

Another important factor to consider is the R2 value. The R2 value is defined as the square

of the correlation coefficient and is an indication of how well the linear model describes

the data. Specifically, R2 explains how much of the variation in the Y variable can be

explained by the variation in the X variable, which is related to the ability of the

independent variable to predict the dependent variable.

The closer R2 is to 1, the better the fit and the closer the data points are to the regression

line. Alternatively, the closer R2 is to 0, the worse the fit and the further away the data

points are to the regression line. Any outliers (or data points that do not follow the trend)

will possibly result in a lower R2 value. For example, an R2 = 0.59 indicates that the data

points are not closely located along the regression line. Another way to look at R2 is to

turn the R2 value into a percentage. If R2 = 0.59, this means that 59% of the variation in

the dependent variable (y) is explained by the independent variable (x). The remaining

41% is due to unconsidered factors influencing the data. The magnitude of R2 is

interpreted differently by various disciplines. Thus, check with your advisor for what is

considered weak and strong in your field.

Generalized Hypotheses

Based on the linear model, Y = b0 + b1 X1, a simple linear regression analysis attempts to

predict the dependent variable given one independent variable. If the dependent variable

can be predicted from the independent variable, then the slope and intercept of the linear

model increases or decreases based on the X-value. With this in mind, the following

general hypotheses can be formulated for a regression analysis.

Association of the independent variable and the dependent variable (from

ANOVA table, assessed with an F-value):

H0: The independent variable (X) has no association with the dependent variable (Y).

H1: The independent variable (X) has an association with the dependent variable (Y).

Slope (from coefficients table, assessed with a t-value):

H0: The slope of the linear model is zero; therefore, the independent variable (X) has

no relationship with the dependent variable (Y).

H1: The slope of the linear model is not zero; therefore, there is a relationship between

the independent and dependent variable.

y-intercept (from coefficients table, assessed with a t-value):

H0: The y-intercept is equal to zero.

H1: The y-intercept is not equal to zero.

A good model rejects the null hypothesis in all three categories: (1) association of

independent and dependent variables, (2) slope, and (3) y-intercept. The two coefficients

generated by a simple linear regression (slope and intercept) can be input into the

regression equation Y = b0 + b1 X1 to predict values of Y from X and to describe the

regression line. Before a linear regression can be applied, the following assumptions must

first be satisfied.

Assumptions

1. Data type – A simple linear regression assumes that both the independent and

dependent variable are continuous.

2. Distribution – The regression analysis assumes a normal distribution of the

residuals (errors). The residuals can be tested using the skewness and kurtosis tests

described in Chapter 2. Also, there are no significant outliers.

3. Independent samples – Observations should be independent.

4. Homogeneous variance – In addition to meeting the normality assumption, a

regression analysis assumes a homogenous variance. When graphed, the

distribution of the residuals should not display heteroscedasticity (sideways cone

shape). Instead, they should be homoscedastic (points relatively evenly distributed to

each other across a graph because they have approximately the same variance). This

can be confirmed through graphing the residuals in a residual plot (refer to the box on

the following page).

5. There is a linear relationship between the independent and dependent variables.

As a reminder, running an analysis without meeting the assumption(s) may mean that

the results are not valid. In the case that one or more of these assumptions are not fully

satisfied, then either transforming one of the variables or invoking the nonparametric

version of the regression would be best.

11.2 Case Study

An undergraduate group conducted a study examining the number of Streptomycin

resistant strains of bacteria in a number of water sites found at varying distances from a

Colorado cattle farm. A total of 12 different sites were examined, starting from Greeley,

CO (the site of cattle farming) to Nederland, CO (73 km from cattle farming). Cattle farms

are known to use large amounts of antibiotics and the concern is that an increase in

antibiotic usage may also increase antibiotic resistant bacteria found in the watershed

downstream from the farm. The undergraduate group was interested to know if bacterial

resistance could be predicted based on distance from Greeley, CO.

Formulate a question about the data that can be addressed by performing a linear

regression.

Question: Does distance from Greeley, CO predict the number of resistant bacteria? ### Tài liệu bạn tìm kiếm đã sẵn sàng tải về

8 Pearson/Spearman's Correlation R Tutorial

Tải bản đầy đủ ngay(0 tr)

×