Tải bản đầy đủ - 0 (trang)
Chapter 3. A Short R Tutorial

Chapter 3. A Short R Tutorial

Tải bản đầy đủ - 0trang

of the first item displayed in the row is 1. In each of these cases, there is also only

one element in the vector.

You can construct longer vectors using the c(...) function. (c stands for “combine.”) For example:

> c(0, 1, 1, 2, 3, 5, 8)

[1] 0 1 1 2 3 5 8

is a vector that contains the first seven elements of the Fibonacci sequence. As an

example of a vector that spans multiple lines, let’s use the sequence operator to

produce a vector with every integer between 1 and 50:

> 1:50

[1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22

[23] 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44

[45] 45 46 47 48 49 50

Notice the numbers in the brackets on the left-hand side of the results. These indicate

the index of the first element shown in each row.

When you perform an operation on two vectors, R will match the elements of the

two vectors pairwise and return a vector. For example:

> c(1, 2, 3,

[1] 11 22 33

> c(1, 2, 3,

[1] 10 40

> c(1, 2, 3,

[1] 0 1 2 3

4)

44

4)

90

4)

+ c(10, 20, 30, 40)

* c(10, 20, 30, 40)

160

- c(1, 1, 1, 1)

If the two vectors aren’t the same size, R will repeat the smaller sequence multiple

times:

> c(1, 2, 3, 4) + 1

[1] 2 3 4 5

> 1 / c(1, 2, 3, 4, 5)

[1] 1.0000000 0.5000000 0.3333333 0.2500000 0.2000000

> c(1, 2, 3, 4) + c(10, 100)

[1] 11 102 13 104

> c(1, 2, 3, 4, 5) + c(10, 100)

[1] 11 102 13 104 15

Warning message:

In c(1, 2, 3, 4, 5) + c(10, 100) :

longer object length is not a multiple of shorter object length

Note the warning if the second sequence isn’t a multiple of the first.

In R, you can also enter expressions with characters:

> "Hello world."

[1] "Hello world."

This is called a character vector in R. This example is actually a character vector of

length 1. Here is an example of a character vector of length 2:

> c("Hello world", "Hello R interpreter")

[1] "Hello world"

"Hello R interpreter"

20 | Chapter 3: A Short R Tutorial

www.it-ebooks.info

(In other languages, like C, “character” refers to a single character, and an ordered

set of characters is called a string. A string in C is equivalent to a character value in R.)

ignored:

> # Here is an example of a comment at the beginning of a line

> 1 + 2 + # and here is an example in the middle

+ 3

[1] 6

Functions

f(argument1, argument2, ...)

Where f is the name of the function, and argument1, argument2, . . . are the arguments

to the function. Here are a few more examples:

> exp(1)

[1] 2.718282

> cos(3.141593)

[1] -1

> log2(1)

[1] 0

In each of these examples, the functions took only one argument. Many functions

require more than one argument. You can specify the arguments by name:

> log(x=64, base=4)

[1] 3

Or, if you give the arguments in the default order, you can omit the names:

> log(64,4)

[1] 3

Not all functions are of the form f(...). Some of them are in the form of operators.1 For example, we used the addition operator (“+”) above. Here are a few examples of operators:

> 17 + 2

[1] 19

> 2 ^ 10

[1] 1024

> 3 == 4

[1] FALSE

1. When you enter a binary or unary operator into R, the R interpreter will actually translate the

more in Chapter 5.

Functions | 21

www.it-ebooks.info

A Short R Tutorial

In R, the operations that do all the work are called functions. We’ve already used a

few functions above (you can’t do anything interesting in R without them). Functions are just like what you remember from math class. Most functions are in the

following form:

We’ve seen the first one already: it’s just addition. The second operator is the exponentiation operator, which is interesting because it’s not a commutative operator.

The third operator is the equality operator. (Notice that the result returned is

FALSE; R has a Boolean data type.)

Variables

Like most other languages, R lets you assign values to variables and refer to them by

name. In R, the assignment operator is <-. Usually, this is pronounced as “gets.” For

example, the statement:

x <- 1

is usually read as “x gets 1.” (If you’ve ever done any work with theoretical computer

science, you’ll probably like this notation: it looks just like algorithm pseudocode.)

After you assign a value to a variable, the R interpreter will substitute that value in

place of the variable name when it evaluates an expression. Here’s a simple example:

> x

> y

> z

> #

> z

[1]

<- 1

<- 2

<- c(x,y)

evaluate z to see what's stored as z

1 2

Notice that the substitution is done at the time that the value is assigned to z, not

at the time that z is evaluated. Suppose that you were to type in the preceding three

expressions and then change the value of y. The value of z would not change:

> y <- 4

> z

[1] 1 2

I’ll talk more about the subtleties of variables and how they’re evaluated in Chapter 8.

R provides several different ways to refer to a member (or set of members) of a

vector. You can refer to elements by location in a vector:

> b <- c(1,2,3,4,5,6,7,8,9,10,11,12)

> b

[1] 1 2 3 4 5 6 7 8 9 10 11 12

> # let's fetch the 7th item in vector b

> b[7]

[1] 7

> # fetch items 1 through 6

> b[1:6]

[1] 1 2 3 4 5 6

> # fetch only members of b that are congruent to zero (mod 3)

> # (in non-math speak, members that are multiples of 3)

> b[b %% 3 == 0]

[1] 3 6 9 12

You can fetch multiple items in a vector by specifying the indices of each item as an

integer vector:

22 | Chapter 3: A Short R Tutorial

www.it-ebooks.info

> # fetch items 1 through 6

> b[1:6]

[1] 1 2 3 4 5 6

> # fetch 1, 6, 11

> b[c(1,6,11)]

[1] 1 6 11

You can fetch items out of order. Items are returned in the order they are

referenced:

> b[c(8,4,9)]

[1] 8 4 9

You can also specify which items to fetch through a logical vector. As an example,

let’s fetch only multiples of 3 (by selecting items that are congruent to 0 mod 3):

TRUE FALSE FALSE

TRUE FALSE FALSE

TRUE FALSE FALSE

In R, there are two additional operators that can be used for assigning values to

symbols. First, you can use a single equals sign (“=”) for assignment.2 This operator

assigns the symbol on the left to the object on the right. In many other languages,

all assignment statements use equals signs. If you are more comfortable with this

notation, you are free to use it. However, I will be using only the <- assignment

operator in this book because I think it is easier to read. Whichever notation you

prefer, be careful because the = operator does not mean “equals.” For that, you need

to use the == operator:

> one <- 1

> two <- 2

> # This means: assign the value of "two" to the variable "one"

> one = two

> one

[1] 2

> two

[1] 2

> # let's start again

> one <- 1

> two <- 2

> # This means: does the value of "one" equal the value of "two"

> one == two

[1] FALSE

In R, you can also assign an object on the left to a symbol on the right:

> 3 -> three

> three

[1] 3

2. Note that you cannot use the <- operator when passing arguments to a function; you need to

map values to argument names using the “=” symbol. Using the <- operator in a function will

assign the value to the variable in the current environment and then pass the value returned

to the function. This might be what you want, but it probably isn’t.

Variables | 23

www.it-ebooks.info

A Short R Tutorial

> b %% 3 == 0

[1] FALSE FALSE

[12] TRUE

> b[b %% 3 == 0]

[1] 3 6 9 12

In some programming contexts, this notation might help you write clearer code. (It

may also be convenient if you type in a long expression and then realize that you

have forgotten to assign the result to a symbol.)

A function in R is just another object that is assigned to a symbol. You can define

your own functions in R, assign them a name, and then call them just like the builtin functions:

> f <- function(x,y) {c(x+1, y+1)}

> f(1,2)

[1] 2 3

This leads to a very useful trick. You can often type the name of a function to see

the code for it. Here’s an example:

> f

function(x,y) {c(x+1, y+1)}

Introduction to Data Structures

In R, you can construct more complicated data structures than just vectors. An

array is a multidimensional vector. Vectors and arrays are stored the same way internally, but an array may be displayed differently and accessed differently. An array

object is just a vector that’s associated with a dimension attribute. Here’s a simple

example.

First, let’s define an array explicitly:

> a <- array(c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12), dim=c(3, 4))

Here is what the array looks like:

> a

[1,]

[2,]

[3,]

[,1] [,2] [,3] [,4]

1

4

7

10

2

5

8

11

3

6

9

12

And here is how you reference one cell:

> a[2,2]

[1] 5

Now, let’s define a vector with the same contents:

> v <- c(1,2,3,4,5,6,7,8,9,10,11,12)

> v

[1] 1 2 3 4 5 6 7 8 9 10 11 12

A matrix is just a two-dimensional array:

> m <- matrix(data=c(1,2,3,4,5,6,7,8,9,10,11,12),nrow=3,ncol=4)

> m

[,1] [,2] [,3] [,4]

[1,]

1

4

7

10

[2,]

2

5

8

11

[3,]

3

6

9

12

24 | Chapter 3: A Short R Tutorial

www.it-ebooks.info

Arrays can have more than two dimensions. For example:

> w <- array(c(1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18),dim=c(3,3,2))

> w

, , 1

[,1] [,2] [,3]

1

4

7

2

5

8

3

6

9

[1,]

[2,]

[3,]

, , 2

[,1] [,2] [,3]

10

13

16

11

14

17

12

15

18

> w[1,1,1]

[1] 1

R uses very clean syntax for referring to part of an array. You specify separate indices

for each dimension, separated by commas:

> a[1,2]

[1] 4

> a[1:2,1:2]

[,1] [,2]

[1,]

1

4

[2,]

2

5

To get all rows (or columns) from a dimension, simply omit the indices:

> # first row only

> a[1,]

[1] 1 4 7 10

> # first column only

> a[,1]

[1] 1 2 3

> # you can also refer to a range of rows

> a[1:2,]

[,1] [,2] [,3] [,4]

[1,]

1

4

7

10

[2,]

2

5

8

11

> # you can even refer to a noncontiguous set of rows

> a[c(1,3),]

[,1] [,2] [,3] [,4]

[1,]

1

4

7

10

[2,]

3

6

9

12

In all the examples above, we’ve just looked at data structures based on a single

underlying data type. In R, it’s possible to construct more complicated structures

with multiple data types. R has a built-in data type for mixing objects of different

types, called lists. Lists in R are subtly different from lists in many other languages.

Lists in R may contain a heterogeneous selection of objects. You can name each

component in a list. Items in a list may be referred to by either location or name.

Introduction to Data Structures | 25

www.it-ebooks.info

A Short R Tutorial

[1,]

[2,]

[3,]

Here is an example of a list with two named components:

> # a list containing two strings

> e <- list(thing="hat", size="8.25")

> e

\$thing

[1] "hat"

\$size

[1] "8.25"

You may access an item in the list in multiple ways:

> e\$thing

[1] "hat"

> e[1]

\$thing

[1] "hat"

> e[[1]]

[1] "hat"

A list can even contain other lists:

> g <- list("this list references another list", e)

> g

[[1]]

[1] "this list references another list"

[[2]]

[[2]]\$thing

[1] "hat"

[[2]]\$size

[1] "8.25"

A data frame is a list that contains multiple named vectors that are the same length.

A data frame is a lot like a spreadsheet or a database table. Data frames are particularly good for representing experimental data. As an example, I’m going to use

some baseball data. Let’s construct a data frame with the win/loss results in the

National League (NL) East in 2008:

>

>

>

>

>

1

2

3

4

5

teams <- c("PHI","NYM","FLA","ATL","WSN")

w <- c(92, 89, 94, 72, 59)

l <- c(70, 73, 77, 90, 102)

nleast <- data.frame(teams,w,l)

nleast

teams w

l

PHI 92 70

NYM 89 73

FLA 94 77

ATL 72 90

WSN 59 102

26 | Chapter 3: A Short R Tutorial

www.it-ebooks.info

You can refer to the components of a data frame (or items in a list) by name using

the \$ operator:

> nleast\$w

[1] 92 89 94 72 59

Here’s one way to find a specific value in a data frame. Suppose that you wanted to

find the number of losses by the Florida Marlins (FLA). One way to select a member

of an array is by using a vector of Boolean values to specify which item to return

from a list. You can calculate an appropriate vector like this:

> nleast\$teams=="FLA"

[1] FALSE FALSE TRUE FALSE FALSE

Then you can use this vector to refer to the right element in the losses vector:

You can import data into R from another file or from a database. See Chapter 11 for

In addition to lists, R has other types of data structures for holding a heterogeneous

collection of objects, including formal class definitions through S4 objects.

Objects and Classes

R is an object-oriented language. Every object in R has a type. Additionally, every

object in R is a member of a class. We have already encountered several different

classes: character vectors, numeric vectors, data frames, lists, and arrays.

You can use the class function to determine the class of an object. For example:

> class(teams)

[1] "character"

> class(w)

[1] "numeric"

> class(nleast)

[1] "data.frame"

> class(class)

[1] "function"

Notice the last example: a function is an object in R with the class function.

Some functions are associated with a specific class. These are called methods. (Not

all functions are tied closely to a particular class; the class system in R is much less

formal than that in a language like Java.)

In R, methods for different classes can share the same name. These are called generic

functions. Generic functions serve two purposes. First, they make it easy to guess

the right function name for an unfamiliar class. Second, generic functions make it

possible to use the same code for objects of different types.

Objects and Classes | 27

www.it-ebooks.info

A Short R Tutorial

> nleast\$l[nleast\$teams=="FLA"]

[1] 77

For example, + is a generic function for adding objects. You can add numbers together with the + operator:

> 17 + 6

[1] 23

You might guess that the addition operator would work similarly with other types

of objects. For example, you can also use the + operator with a date object and a

number:

> as.Date("2009-09-08") + 7

[1] "2009-09-15"

By the way, the R interpreter calls the generic function print on any object returned

on the R console. Suppose that you define x as:

> x <- 1 + 2 + 3 + 4

When you type:

> x

[1] 10

the interpreter actually calls the function print(x) to print the results. This means

that if you define a new class, you can define a print method to specify how objects

from that new class are printed on the console. Some functions take advantage of

this functionality to do other things when you enter an expression on the console.3

I’ll talk about objects in more depth in Chapter 7 and classes in Chapter 10.

Models and Formulas

To statisticians, a model is a concise way to describe a set of data, usually with a

mathematical formula. Sometimes, the goal is to build a predictive model with

training data to predict values based on other data. Other times, the goal is to build

a descriptive model that helps you understand the data better.

R has a special notation for describing relationships between variables. Suppose that

you are assuming a linear model for a variable y, predicted from the variables x1,

x2, ..., xn. (Statisticians usually refer to y as the dependent variable, and x1, x2, ...,

xn as the independent variables.) In equation form, this implies a relationship like:

In R, you would write the relationship as y ~ x1 + x2 + ... + xn, which is a formula

object.

3. A very important example of this is lattice graphics. Plotting functions in the lattice library

return lattice objects but don’t plot results. If you call a lattice function on the R console, the

console will print the object, thus plotting the results. However, if you call a lattice function

within another function, or in a script, R will not plot the results unless you explicitly print

the lattice object.

28 | Chapter 3: A Short R Tutorial

www.it-ebooks.info

As an example, let’s use the cars data set (which is included in the base package).

This data set was created during the 1920s and shows the speed and stopping distance for a set of different cars. We’ll look at the relationship between speed and

stopping distance. We’ll assume that the stopping distance is a linear function of

speed. So let’s try to use a linear regression to estimate the relationship. The formula

is dist~speed. We’ll use the lm function to estimate the parameters of a linear model.

The lm function returns an object of class lm, which we will assign to a variable called

cars.lm:

> cars.lm <- lm(formula=dist~speed,data=cars)

Now, let’s take a quick look at the results returned:

> cars.lm

A Short R Tutorial

Call:

lm(formula = dist ~ speed, data = cars)

Coefficients:

(Intercept)

-17.579

speed

3.932

As you can see, printing an lm object shows you the original function call (and thus

the data set and formula) and the estimated coefficients. For some more information,

we can use the summary function:

> summary(cars.lm)

Call:

lm(formula = dist ~ speed, data = cars)

Residuals:

Min

1Q

-29.069 -9.525

Median

-2.272

3Q

9.215

Max

43.201

Coefficients:

Estimate Std. Error t value Pr(>|t|)

(Intercept) -17.5791

6.7584 -2.601

0.0123 *

speed

3.9324

0.4155

9.464 1.49e-12 ***

--Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 15.38 on 48 degrees of freedom

Multiple R-squared: 0.6511,

F-statistic: 89.57 on 1 and 48 DF, p-value: 1.490e-12

As you can see, the summary option shows you the function call, the distribution

of the residuals from the fit, the coefficients, and information about the fit. By the

way, it is possible to simply call the lm function or to call summary(lm(...)) and not

assign a name to the model object:

> lm(dist~speed,data=cars)

Call:

lm(formula = dist ~ speed, data = cars)

Models and Formulas | 29

www.it-ebooks.info

Coefficients:

(Intercept)

-17.579

speed

3.932

> summary(lm(dist~speed,data=cars))

Call:

lm(formula = dist ~ speed, data = cars)

Residuals:

Min

1Q

-29.069 -9.525

Median

-2.272

3Q

9.215

Max

43.201

Coefficients:

Estimate Std. Error t value Pr(>|t|)

(Intercept) -17.5791

6.7584 -2.601

0.0123 *

speed

3.9324

0.4155

9.464 1.49e-12 ***

--Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 15.38 on 48 degrees of freedom

Multiple R-squared: 0.6511,

F-statistic: 89.57 on 1 and 48 DF, p-value: 1.490e-12

In some cases, this can be more convenient. However, you often want to perform

updating a model to add or subtract variables. By assigning a name to the model,

you can make your code easier to understand and modify. Additionally, refitting a

model can be very time consuming for complex models and large data sets. By assigning the model to a variable name, you can avoid these problems.

Charts and Graphics

R includes several packages for visualizing data: graphics, grid, and lattice. Usually, you’ll find that functions within the graphics and lattice packages are the most

useful.4 If you’re familiar with Microsoft Excel, you’ll find that R can generate all of

the charts that you’re familiar with: column charts, bar charts, line plots, pie charts,

and scatter plots. Even if that’s all you need, R makes it much easier than Excel to

automate the creation of charts and to customize them. However, there are many,

many more types of charts available in R, many of them quite intuitive and elegant.

To make this a little more interesting, let’s work with some real data. We’re going

to look at all field goal attempts in the National Football League (NFL) in 2005.5

For those of you who aren’t familiar with American football, here’s a quick explanation. A team can attempt to kick a football between a set of goalposts to receive 3

4. Other packages are available for visualizing data. For example, the RGobi package provides

tools for creating interactive graphics.

5. The data was provided by Aaron Schatz of Pro Football Prospectus. For more information, see

the Football Outsiders website at http://www.footballoutsiders.com/, or you can find its annual

books at most bookstores—both online and “brick and mortar.”

30 | Chapter 3: A Short R Tutorial

www.it-ebooks.info

Tài liệu bạn tìm kiếm đã sẵn sàng tải về

Chapter 3. A Short R Tutorial

Tải bản đầy đủ ngay(0 tr)

×