Chapter 3. A Short R Tutorial
Tải bản đầy đủ - 0trang
of the first item displayed in the row is 1. In each of these cases, there is also only
one element in the vector.
You can construct longer vectors using the c(...) function. (c stands for “combine.”) For example:
> c(0, 1, 1, 2, 3, 5, 8)
[1] 0 1 1 2 3 5 8
is a vector that contains the first seven elements of the Fibonacci sequence. As an
example of a vector that spans multiple lines, let’s use the sequence operator to
produce a vector with every integer between 1 and 50:
> 1:50
[1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22
[23] 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44
[45] 45 46 47 48 49 50
Notice the numbers in the brackets on the left-hand side of the results. These indicate
the index of the first element shown in each row.
When you perform an operation on two vectors, R will match the elements of the
two vectors pairwise and return a vector. For example:
> c(1, 2, 3,
[1] 11 22 33
> c(1, 2, 3,
[1] 10 40
> c(1, 2, 3,
[1] 0 1 2 3
4)
44
4)
90
4)
+ c(10, 20, 30, 40)
* c(10, 20, 30, 40)
160
- c(1, 1, 1, 1)
If the two vectors aren’t the same size, R will repeat the smaller sequence multiple
times:
> c(1, 2, 3, 4) + 1
[1] 2 3 4 5
> 1 / c(1, 2, 3, 4, 5)
[1] 1.0000000 0.5000000 0.3333333 0.2500000 0.2000000
> c(1, 2, 3, 4) + c(10, 100)
[1] 11 102 13 104
> c(1, 2, 3, 4, 5) + c(10, 100)
[1] 11 102 13 104 15
Warning message:
In c(1, 2, 3, 4, 5) + c(10, 100) :
longer object length is not a multiple of shorter object length
Note the warning if the second sequence isn’t a multiple of the first.
In R, you can also enter expressions with characters:
> "Hello world."
[1] "Hello world."
This is called a character vector in R. This example is actually a character vector of
length 1. Here is an example of a character vector of length 2:
> c("Hello world", "Hello R interpreter")
[1] "Hello world"
"Hello R interpreter"
20 | Chapter 3: A Short R Tutorial
www.it-ebooks.info
(In other languages, like C, “character” refers to a single character, and an ordered
set of characters is called a string. A string in C is equivalent to a character value in R.)
You can add comments to R code. Anything after a pound sign (“#”) on a line is
ignored:
> # Here is an example of a comment at the beginning of a line
> 1 + 2 + # and here is an example in the middle
+ 3
[1] 6
Functions
f(argument1, argument2, ...)
Where f is the name of the function, and argument1, argument2, . . . are the arguments
to the function. Here are a few more examples:
> exp(1)
[1] 2.718282
> cos(3.141593)
[1] -1
> log2(1)
[1] 0
In each of these examples, the functions took only one argument. Many functions
require more than one argument. You can specify the arguments by name:
> log(x=64, base=4)
[1] 3
Or, if you give the arguments in the default order, you can omit the names:
> log(64,4)
[1] 3
Not all functions are of the form f(...). Some of them are in the form of operators.1 For example, we used the addition operator (“+”) above. Here are a few examples of operators:
> 17 + 2
[1] 19
> 2 ^ 10
[1] 1024
> 3 == 4
[1] FALSE
1. When you enter a binary or unary operator into R, the R interpreter will actually translate the
operator into a function; there is a function equivalent for each operator. We’ll talk about this
more in Chapter 5.
Functions | 21
www.it-ebooks.info
A Short R Tutorial
In R, the operations that do all the work are called functions. We’ve already used a
few functions above (you can’t do anything interesting in R without them). Functions are just like what you remember from math class. Most functions are in the
following form:
We’ve seen the first one already: it’s just addition. The second operator is the exponentiation operator, which is interesting because it’s not a commutative operator.
The third operator is the equality operator. (Notice that the result returned is
FALSE; R has a Boolean data type.)
Variables
Like most other languages, R lets you assign values to variables and refer to them by
name. In R, the assignment operator is <-. Usually, this is pronounced as “gets.” For
example, the statement:
x <- 1
is usually read as “x gets 1.” (If you’ve ever done any work with theoretical computer
science, you’ll probably like this notation: it looks just like algorithm pseudocode.)
After you assign a value to a variable, the R interpreter will substitute that value in
place of the variable name when it evaluates an expression. Here’s a simple example:
> x
> y
> z
> #
> z
[1]
<- 1
<- 2
<- c(x,y)
evaluate z to see what's stored as z
1 2
Notice that the substitution is done at the time that the value is assigned to z, not
at the time that z is evaluated. Suppose that you were to type in the preceding three
expressions and then change the value of y. The value of z would not change:
> y <- 4
> z
[1] 1 2
I’ll talk more about the subtleties of variables and how they’re evaluated in Chapter 8.
R provides several different ways to refer to a member (or set of members) of a
vector. You can refer to elements by location in a vector:
> b <- c(1,2,3,4,5,6,7,8,9,10,11,12)
> b
[1] 1 2 3 4 5 6 7 8 9 10 11 12
> # let's fetch the 7th item in vector b
> b[7]
[1] 7
> # fetch items 1 through 6
> b[1:6]
[1] 1 2 3 4 5 6
> # fetch only members of b that are congruent to zero (mod 3)
> # (in non-math speak, members that are multiples of 3)
> b[b %% 3 == 0]
[1] 3 6 9 12
You can fetch multiple items in a vector by specifying the indices of each item as an
integer vector:
22 | Chapter 3: A Short R Tutorial
www.it-ebooks.info
> # fetch items 1 through 6
> b[1:6]
[1] 1 2 3 4 5 6
> # fetch 1, 6, 11
> b[c(1,6,11)]
[1] 1 6 11
You can fetch items out of order. Items are returned in the order they are
referenced:
> b[c(8,4,9)]
[1] 8 4 9
You can also specify which items to fetch through a logical vector. As an example,
let’s fetch only multiples of 3 (by selecting items that are congruent to 0 mod 3):
TRUE FALSE FALSE
TRUE FALSE FALSE
TRUE FALSE FALSE
In R, there are two additional operators that can be used for assigning values to
symbols. First, you can use a single equals sign (“=”) for assignment.2 This operator
assigns the symbol on the left to the object on the right. In many other languages,
all assignment statements use equals signs. If you are more comfortable with this
notation, you are free to use it. However, I will be using only the <- assignment
operator in this book because I think it is easier to read. Whichever notation you
prefer, be careful because the = operator does not mean “equals.” For that, you need
to use the == operator:
> one <- 1
> two <- 2
> # This means: assign the value of "two" to the variable "one"
> one = two
> one
[1] 2
> two
[1] 2
> # let's start again
> one <- 1
> two <- 2
> # This means: does the value of "one" equal the value of "two"
> one == two
[1] FALSE
In R, you can also assign an object on the left to a symbol on the right:
> 3 -> three
> three
[1] 3
2. Note that you cannot use the <- operator when passing arguments to a function; you need to
map values to argument names using the “=” symbol. Using the <- operator in a function will
assign the value to the variable in the current environment and then pass the value returned
to the function. This might be what you want, but it probably isn’t.
Variables | 23
www.it-ebooks.info
A Short R Tutorial
> b %% 3 == 0
[1] FALSE FALSE
[12] TRUE
> b[b %% 3 == 0]
[1] 3 6 9 12
In some programming contexts, this notation might help you write clearer code. (It
may also be convenient if you type in a long expression and then realize that you
have forgotten to assign the result to a symbol.)
A function in R is just another object that is assigned to a symbol. You can define
your own functions in R, assign them a name, and then call them just like the builtin functions:
> f <- function(x,y) {c(x+1, y+1)}
> f(1,2)
[1] 2 3
This leads to a very useful trick. You can often type the name of a function to see
the code for it. Here’s an example:
> f
function(x,y) {c(x+1, y+1)}
Introduction to Data Structures
In R, you can construct more complicated data structures than just vectors. An
array is a multidimensional vector. Vectors and arrays are stored the same way internally, but an array may be displayed differently and accessed differently. An array
object is just a vector that’s associated with a dimension attribute. Here’s a simple
example.
First, let’s define an array explicitly:
> a <- array(c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12), dim=c(3, 4))
Here is what the array looks like:
> a
[1,]
[2,]
[3,]
[,1] [,2] [,3] [,4]
1
4
7
10
2
5
8
11
3
6
9
12
And here is how you reference one cell:
> a[2,2]
[1] 5
Now, let’s define a vector with the same contents:
> v <- c(1,2,3,4,5,6,7,8,9,10,11,12)
> v
[1] 1 2 3 4 5 6 7 8 9 10 11 12
A matrix is just a two-dimensional array:
> m <- matrix(data=c(1,2,3,4,5,6,7,8,9,10,11,12),nrow=3,ncol=4)
> m
[,1] [,2] [,3] [,4]
[1,]
1
4
7
10
[2,]
2
5
8
11
[3,]
3
6
9
12
24 | Chapter 3: A Short R Tutorial
www.it-ebooks.info
Arrays can have more than two dimensions. For example:
> w <- array(c(1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18),dim=c(3,3,2))
> w
, , 1
[,1] [,2] [,3]
1
4
7
2
5
8
3
6
9
[1,]
[2,]
[3,]
, , 2
[,1] [,2] [,3]
10
13
16
11
14
17
12
15
18
> w[1,1,1]
[1] 1
R uses very clean syntax for referring to part of an array. You specify separate indices
for each dimension, separated by commas:
> a[1,2]
[1] 4
> a[1:2,1:2]
[,1] [,2]
[1,]
1
4
[2,]
2
5
To get all rows (or columns) from a dimension, simply omit the indices:
> # first row only
> a[1,]
[1] 1 4 7 10
> # first column only
> a[,1]
[1] 1 2 3
> # you can also refer to a range of rows
> a[1:2,]
[,1] [,2] [,3] [,4]
[1,]
1
4
7
10
[2,]
2
5
8
11
> # you can even refer to a noncontiguous set of rows
> a[c(1,3),]
[,1] [,2] [,3] [,4]
[1,]
1
4
7
10
[2,]
3
6
9
12
In all the examples above, we’ve just looked at data structures based on a single
underlying data type. In R, it’s possible to construct more complicated structures
with multiple data types. R has a built-in data type for mixing objects of different
types, called lists. Lists in R are subtly different from lists in many other languages.
Lists in R may contain a heterogeneous selection of objects. You can name each
component in a list. Items in a list may be referred to by either location or name.
Introduction to Data Structures | 25
www.it-ebooks.info
A Short R Tutorial
[1,]
[2,]
[3,]
Here is an example of a list with two named components:
> # a list containing two strings
> e <- list(thing="hat", size="8.25")
> e
$thing
[1] "hat"
$size
[1] "8.25"
You may access an item in the list in multiple ways:
> e$thing
[1] "hat"
> e[1]
$thing
[1] "hat"
> e[[1]]
[1] "hat"
A list can even contain other lists:
> g <- list("this list references another list", e)
> g
[[1]]
[1] "this list references another list"
[[2]]
[[2]]$thing
[1] "hat"
[[2]]$size
[1] "8.25"
A data frame is a list that contains multiple named vectors that are the same length.
A data frame is a lot like a spreadsheet or a database table. Data frames are particularly good for representing experimental data. As an example, I’m going to use
some baseball data. Let’s construct a data frame with the win/loss results in the
National League (NL) East in 2008:
>
>
>
>
>
1
2
3
4
5
teams <- c("PHI","NYM","FLA","ATL","WSN")
w <- c(92, 89, 94, 72, 59)
l <- c(70, 73, 77, 90, 102)
nleast <- data.frame(teams,w,l)
nleast
teams w
l
PHI 92 70
NYM 89 73
FLA 94 77
ATL 72 90
WSN 59 102
26 | Chapter 3: A Short R Tutorial
www.it-ebooks.info
You can refer to the components of a data frame (or items in a list) by name using
the $ operator:
> nleast$w
[1] 92 89 94 72 59
Here’s one way to find a specific value in a data frame. Suppose that you wanted to
find the number of losses by the Florida Marlins (FLA). One way to select a member
of an array is by using a vector of Boolean values to specify which item to return
from a list. You can calculate an appropriate vector like this:
> nleast$teams=="FLA"
[1] FALSE FALSE TRUE FALSE FALSE
Then you can use this vector to refer to the right element in the losses vector:
You can import data into R from another file or from a database. See Chapter 11 for
more information on how to do this.
In addition to lists, R has other types of data structures for holding a heterogeneous
collection of objects, including formal class definitions through S4 objects.
Objects and Classes
R is an object-oriented language. Every object in R has a type. Additionally, every
object in R is a member of a class. We have already encountered several different
classes: character vectors, numeric vectors, data frames, lists, and arrays.
You can use the class function to determine the class of an object. For example:
> class(teams)
[1] "character"
> class(w)
[1] "numeric"
> class(nleast)
[1] "data.frame"
> class(class)
[1] "function"
Notice the last example: a function is an object in R with the class function.
Some functions are associated with a specific class. These are called methods. (Not
all functions are tied closely to a particular class; the class system in R is much less
formal than that in a language like Java.)
In R, methods for different classes can share the same name. These are called generic
functions. Generic functions serve two purposes. First, they make it easy to guess
the right function name for an unfamiliar class. Second, generic functions make it
possible to use the same code for objects of different types.
Objects and Classes | 27
www.it-ebooks.info
A Short R Tutorial
> nleast$l[nleast$teams=="FLA"]
[1] 77
For example, + is a generic function for adding objects. You can add numbers together with the + operator:
> 17 + 6
[1] 23
You might guess that the addition operator would work similarly with other types
of objects. For example, you can also use the + operator with a date object and a
number:
> as.Date("2009-09-08") + 7
[1] "2009-09-15"
By the way, the R interpreter calls the generic function print on any object returned
on the R console. Suppose that you define x as:
> x <- 1 + 2 + 3 + 4
When you type:
> x
[1] 10
the interpreter actually calls the function print(x) to print the results. This means
that if you define a new class, you can define a print method to specify how objects
from that new class are printed on the console. Some functions take advantage of
this functionality to do other things when you enter an expression on the console.3
I’ll talk about objects in more depth in Chapter 7 and classes in Chapter 10.
Models and Formulas
To statisticians, a model is a concise way to describe a set of data, usually with a
mathematical formula. Sometimes, the goal is to build a predictive model with
training data to predict values based on other data. Other times, the goal is to build
a descriptive model that helps you understand the data better.
R has a special notation for describing relationships between variables. Suppose that
you are assuming a linear model for a variable y, predicted from the variables x1,
x2, ..., xn. (Statisticians usually refer to y as the dependent variable, and x1, x2, ...,
xn as the independent variables.) In equation form, this implies a relationship like:
In R, you would write the relationship as y ~ x1 + x2 + ... + xn, which is a formula
object.
3. A very important example of this is lattice graphics. Plotting functions in the lattice library
return lattice objects but don’t plot results. If you call a lattice function on the R console, the
console will print the object, thus plotting the results. However, if you call a lattice function
within another function, or in a script, R will not plot the results unless you explicitly print
the lattice object.
28 | Chapter 3: A Short R Tutorial
www.it-ebooks.info
As an example, let’s use the cars data set (which is included in the base package).
This data set was created during the 1920s and shows the speed and stopping distance for a set of different cars. We’ll look at the relationship between speed and
stopping distance. We’ll assume that the stopping distance is a linear function of
speed. So let’s try to use a linear regression to estimate the relationship. The formula
is dist~speed. We’ll use the lm function to estimate the parameters of a linear model.
The lm function returns an object of class lm, which we will assign to a variable called
cars.lm:
> cars.lm <- lm(formula=dist~speed,data=cars)
Now, let’s take a quick look at the results returned:
> cars.lm
A Short R Tutorial
Call:
lm(formula = dist ~ speed, data = cars)
Coefficients:
(Intercept)
-17.579
speed
3.932
As you can see, printing an lm object shows you the original function call (and thus
the data set and formula) and the estimated coefficients. For some more information,
we can use the summary function:
> summary(cars.lm)
Call:
lm(formula = dist ~ speed, data = cars)
Residuals:
Min
1Q
-29.069 -9.525
Median
-2.272
3Q
9.215
Max
43.201
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -17.5791
6.7584 -2.601
0.0123 *
speed
3.9324
0.4155
9.464 1.49e-12 ***
--Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 15.38 on 48 degrees of freedom
Multiple R-squared: 0.6511,
Adjusted R-squared: 0.6438
F-statistic: 89.57 on 1 and 48 DF, p-value: 1.490e-12
As you can see, the summary option shows you the function call, the distribution
of the residuals from the fit, the coefficients, and information about the fit. By the
way, it is possible to simply call the lm function or to call summary(lm(...)) and not
assign a name to the model object:
> lm(dist~speed,data=cars)
Call:
lm(formula = dist ~ speed, data = cars)
Models and Formulas | 29
www.it-ebooks.info
Coefficients:
(Intercept)
-17.579
speed
3.932
> summary(lm(dist~speed,data=cars))
Call:
lm(formula = dist ~ speed, data = cars)
Residuals:
Min
1Q
-29.069 -9.525
Median
-2.272
3Q
9.215
Max
43.201
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -17.5791
6.7584 -2.601
0.0123 *
speed
3.9324
0.4155
9.464 1.49e-12 ***
--Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 15.38 on 48 degrees of freedom
Multiple R-squared: 0.6511,
Adjusted R-squared: 0.6438
F-statistic: 89.57 on 1 and 48 DF, p-value: 1.490e-12
In some cases, this can be more convenient. However, you often want to perform
additional analyses, such as plotting residuals, calculating additional statistics, or
updating a model to add or subtract variables. By assigning a name to the model,
you can make your code easier to understand and modify. Additionally, refitting a
model can be very time consuming for complex models and large data sets. By assigning the model to a variable name, you can avoid these problems.
Charts and Graphics
R includes several packages for visualizing data: graphics, grid, and lattice. Usually, you’ll find that functions within the graphics and lattice packages are the most
useful.4 If you’re familiar with Microsoft Excel, you’ll find that R can generate all of
the charts that you’re familiar with: column charts, bar charts, line plots, pie charts,
and scatter plots. Even if that’s all you need, R makes it much easier than Excel to
automate the creation of charts and to customize them. However, there are many,
many more types of charts available in R, many of them quite intuitive and elegant.
To make this a little more interesting, let’s work with some real data. We’re going
to look at all field goal attempts in the National Football League (NFL) in 2005.5
For those of you who aren’t familiar with American football, here’s a quick explanation. A team can attempt to kick a football between a set of goalposts to receive 3
4. Other packages are available for visualizing data. For example, the RGobi package provides
tools for creating interactive graphics.
5. The data was provided by Aaron Schatz of Pro Football Prospectus. For more information, see
the Football Outsiders website at http://www.footballoutsiders.com/, or you can find its annual
books at most bookstores—both online and “brick and mortar.”
30 | Chapter 3: A Short R Tutorial
www.it-ebooks.info