Tải bản đầy đủ - 0 (trang)
14 Pattern matching and replacement (character search and replace)

14 Pattern matching and replacement (character search and replace)

Tải bản đầy đủ - 0trang

INTRODUCTION TO R



25



# select only those 'SITE' values that contain an 'A'

> grep("A", EXPERIMENT$SITE)

[1] 1 2

> EXPERIMENT$SITE[grep("A", EXPERIMENT$SITE)]

[1] "A1" "A2"



By default, the pattern comprises any valid regular expressionh which provides great

pattern searching flexibility.

# convert the EXPERIMENT list into a data frame

> EXP <- as.data.frame(EXPERIMENT)

# select only those rows that contain correspond to a 'SITE'

value of either an A, B or C followed by a '1'

> grep("[A-C]1", EXP$SITE)

[1] 1 3 5

> EXP[grep("[A-C]1", EXP$SITE), ]

SITE COORDINATES TEMPERATURE SHADE

Q1

A1 16.92,8.37

36.1

no

Q3

B1 7.61,16.65

31.0

no

Q5

C1 11.77,13.12

39.9

no



1.14.2 regexpr - position and length of match

Rather than return the indexes of matching entries, the regexpr() function returns

the position of the match within each string as well as the length of the pattern

within each string (-1 values correspond to entries in which the pattern is not

found).

#recall the AUST character vector that lists the Australian

capital cities

> AUST

[1] "Adelaide" "Brisbane" "Canberra" "Darwin"

[5] "Hobart"

"Melbourne" "Perth"

"Sydney"

#get the position and length of string of characters containing

an 'a' and an 'e' separated by any number of characters

> regexpr("a.*e", AUST)

[1] 5 6 2 -1 -1 -1 -1 -1

attr(,"match.length")

[1] 4 3 4 -1 -1 -1 -1 -1



h



A regular expression is a formal computer language consisting of normal printing characters and

special metacharacters (which represent wildcards and other features) that together provide a concise

yet flexible way of matching strings.



26



CHAPTER 1



1.14.3 gsub - pattern replacement

The gsub() function replaces all instancesi of an identified pattern within a character

vector with an alternative set of characters.

> gsub("no", "Not

[1] "Not shaded"

[5] "Not shaded"

[9] "Not shaded"



shaded", EXP$SHADE)

"full"

"Not shaded" "full"

"full"

"Not shaded" "full"

"full"



It is also possible to extend the functionality to accomodate perl-compatible regular

expressions.

#convert all the capital values entries into uppercase identify

(and store) all words (\\w) convert stored pattern (\\1) to

uppercase (\\U)

> gsub("(\\w)", "\\U\\1", AUST, perl = TRUE)

[1] "ADELAIDE" "BRISBANE" "CANBERRA" "DARWIN"

[5] "HOBART"

"MELBOURNE" "PERTH"

"SYDNEY"



1.15



Data manipulation



1.15.1 Sorting

The sort() function is used to sort vector entries in increasing (or decreasing)

order. Note that the elements of the TEMPERATURE vector were earlier named (see

section 1.10.2). This assists in the distinction of the following functions, however it

does result in slightly different format (each element has a name above it, and the

braced index is absent).

> sort(TEMPERATURE)

Q6

Q9

Q7

Q8 Q10

Q2

Q3

Q1

Q4

Q5

6.5 9.7 11.2 12.8 15.9 30.6 31.0 36.1 36.3 39.9

> sort(TEMPERATURE, decreasing = T)

Q5

Q4

Q1

Q3

Q2 Q10

Q8

Q7

39.9 36.3 36.1 31.0 30.6 15.9 12.8 11.2



Q9

9.7



Q6

6.5



The order() function is also used to sort vector entries in increasing (or decreasing)

order, but rather than return a sorted vector, it returns the position (order) or the

sorted entries in the original vector. For example:

> order(TEMPERATURE)

[1] 6 9 7 8 10 2

i



3



1



4



5



The similar sub() function replaces only the first match of a pattern within a vector.



INTRODUCTION TO R



27



Indicating that the smallest entry in the TEMPERATURE vector was at position (index)

6 and so on.

The rank() function is used to indicate the ranking of each entry in a vector:

> rank(TEMPERATURE)

Q1 Q2 Q3 Q4 Q5

8

6

7

9 10



Q6

1



Q7

3



Q8

4



Q9 Q10

2

5



Indicating that the first entry in the TEMPERATURE vector was ranked eighth in

increasing order. Ranks from decreasing order can be produced by then reversing the

returned vector using the rev() function.

> rev(rank(TEMPERATURE))

Q10 Q9 Q8 Q7 Q6 Q5 Q4

5

2

4

3

1 10

9



Q3

7



Q2

6



Q1

8



1.15.2 Formatting data

Rounding



The ceiling() function rounds vector entries up to the nearest integer

> ceiling(TEMPERATURE)

Q1 Q2 Q3 Q4 Q5 Q6

37 31 31 37 40

7



Q7

12



Q8

13



Q9 Q10

10 16



The floor() function rounds vector entries down to the nearest integer

> floor(TEMPERATURE)

Q1 Q2 Q3 Q4 Q5 Q6

36 30 31 36 39

6



Q7

11



Q8

12



Q9 Q10

9 15



The trunc() function rounds vector entries to the nearest integer towards ‘0’ (zero)

> trunc(seq(-2, 2, by = 0.5))

[1] -2 -1 -1 0 0 0 1 1 2



The round() function rounds vector entries to the nearest numeric with the specified

number of decimal places. Digits of 5 are rounded off to the nearest even digit.

> round(TEMPERATURE)

Q1 Q2 Q3 Q4 Q5 Q6

36 31 31 36 40

6



Q7

11



Q8

13



> round(seq(-2, 2, by = 0.5))

[1] -2 -2 -1 0 0 0 1 2 2



Q9 Q10

10 16



28



CHAPTER 1



> round(TEMPERATURE/2.2, 2)

Q1

Q2

Q3

Q4

Q5

16.41 13.91 14.09 16.50 18.14

> round(TEMPERATURE, -1)

Q1 Q2 Q3 Q4 Q5 Q6 Q7

40 30 30 40 40 10 10



Q6

2.95



Q8

10



Q7

5.09



Q8

5.82



Q9

4.41



Q10

7.23



Q9 Q10

10 20



Other formating



Occasionally (mainly for graphical displays), it is necessary to be able to adjust the

other aspects of the formatting of vector entries. For example, you may wish to have

numbers expressed in scientific notation (2.93e-04 rather than 0.000293) or insert

commas every 3 digits left of the decimal point. These procedures are supported via

the formatC() function.

> seq(pi, pi * 10000, length = 5)

[1]

3.141593 7856.337828 15709.534064 23562.730300

[5] 31415.926536

# scientific notation

> formatC(seq(pi, pi * 10000, length = 5), format = "e",

+

digits = 2)

[1] "3.14e+00" "7.86e+03" "1.57e+04" "2.36e+04" "3.14e+04"

# scientific notation only if it saves space

> formatC(seq(pi, pi * 10000, length = 5), format = "g",

+

digits = 2)

[1] "3.1"

"7.9e+03" "1.6e+04" "2.4e+04" "3.1e+04"

# floating point format with 1000's indicators

> formatC(seq(pi, pi * 10000, length = 5), format = "f",

+

big.mark = ",", digits = 2)

[1] "3.14"

"7,856.34" "15,709.53" "23,562.73"

[5] "31,415.93"



1.16



Functions that perform other functions repeatedly



The replicate() function repeatedly performs the function specified in the second

argument the number of times indicated by the first argument. The important

distinction between the replicate() function and the rep() functions described in

section 1.10.1, is that the former repeatedly performs the function whereas the later

performs the function only once and then duplicates the result multiple times. Since

most functions produce the same result each time they are performed, for many uses,



29



INTRODUCTION TO R



both functions produce identical results. The one group of functions that do not

produce identical results each time, are those involved in random number generation.

Hence, the replicate() function is usually used in conjunction with random number

generators (such as runif(), which will be described in greater detail in chapter 4)

to produce sets of random numbers. Consider first the difference between rep() and

replicate():

> rep(runif(1), 5)

[1] 0.4194366 0.4194366 0.4194366 0.4194366 0.4194366

> replicate(5, runif(1))

[1] 0.467324683 0.727337794 0.797764456 0.007025032

[5] 0.155971928



When the function being run within runif() itself produces a vector of length > 1,

the runif() function combines each of the vectors together as separate columns in a

matrix:

> replicate(5,

[,1]

[1,] 0.3266058

[2,] 0.5241960

[3,] 0.1894848

[4,] 0.1464055

[5,] 0.5491748



runif(5))

[,2]

0.3313832

0.9801652

0.8300792

0.6758495

0.4052211



[,3]

0.2113326

0.6642341

0.7178351

0.9940731

0.9923927



[,4]

0.4744742

0.5292882

0.7262750

0.3015559

0.4074775



[,5]

0.257732622

0.799982207

0.698298026

0.288537242

0.002170782



1.16.1 Along matrix margins

The apply() function applies a function to the margins (1=row margins and 2=column

margins) of a matrix. For example, we might have a matrix that represents the

abundance of three species of moth from three habitat types:

> MOTH <- cbind(SpA = c(25, 6, 3), SpB = c(12, 12,

+

3), SpC = c(7, 2, 19))

> rownames(MOTH) <- paste("Habitat", 1:3, sep = "")

> MOTH

SpA SpB SpC

Habitat1 25 12

7

Habitat2

6 12

2

Habitat3

3

3 19



The apply() function could be used to calculate the column means (mean abundance

of each species across habitat types):

> apply(MOTH, 2, mean)

SpA

SpB

SpC

11.333333 9.000000 9.333333



30



CHAPTER 1



1.16.2 By factorial groups

The tapply() function applies a function to the vector separately for each level of a

factor combination. This provides a convenient way to calculate group statistics (pivot

tables). For example, if we wanted to calculate the mean TEMPERATURE for each level

of the SHADE factor:

> tapply(TEMPERATURE, SHADE, mean)

no full

25.58 20.42



1.16.3 By objects

The lapply() and sapply() functions apply a function separately to each of the

objects in a list and return a list and vector/matrix respectively. For example, to find

out the length of each of the objects within the EXPERIMENT list:

> lapply(EXPERIMENT, length)

$SITE

[1] 10

$COORDINATES

[1] 5

$TEMPERATURE

[1] 10

$SHADE

[1] 10

> sapply(EXPERIMENT, length)

SITE COORDINATES TEMPERATURE

10

5

10



1.17



SHADE

10



Programming in R



Although the library of built-in and add-on tools available for the R environment

is extensive (and continues to grow at an incredible rate), occasionally there is the

need to perform a task for which there are no existing functions. Since R is itself

a programming language (in fact most of the available functions are written in R),

extending its functionality to accommodate additional procedures can be a relatively

simple exercise (depending of course, on the complexity of the procedure and your

level of R proficiency).



INTRODUCTION TO R



31



1.17.1 Grouped expressions

Multiple commands can be issued on a single line by separating each command by

a semicolon (;). When doing so, commands are evaluated in order from left to

right:

> A <- 1;

> C

[1] 3



B <- 2;



C <- A + B



When a series of commands are grouped together between braces (such as {command1;

command2;...}), the whole group of commands are evaluated as a single expression

and the value of the last evaluated command within the group is returned:

> D <- {A <- 1; 2 -> B; C <- A + B}

> D

[1] 3



Grouped expressions are useful for wrapping up sets of commands that work together

to produce a single result and since they are treated as a single expression, they too can

be further nested within braces as part of a larger grouped expression.



1.17.2 Conditional execution – if and ifelse

Conditional execution is when a sequence of tasks is determined by whether a condition

is met (TRUE) or not (FALSE), and is useful when writing code that needs to be able to

accommodate more than one set of circumstances. In R, conditional execution has the

forms:

if(condition) true.task

if(condition) true.task else false.task

ifelse(condition) true.task false.task



If condition returns a TRUE, the statement true.task is evaluated, otherwise the

false.task is evaluated (if provided). If condition cannot be coerced into a logical

(a yes/no answer), an error will be reported.

To illustrate the use of the if conditional execution, imagine that you were writing

code to calculate means and you anticipated that you may have to accommodate two

different classes of objects (vectors and matrices). I will use the vector TEMPERATURE

and the matrix MOTH:

> NEW.OBJECT <- TEMPERATURE

> if (is.vector(NEW.OBJECT)) mean(NEW.OBJECT)

+

else apply(NEW.OBJECT, 2, mean)

[1] 23



32



CHAPTER 1



> NEW.OBJECT <- MOTH

> ifelse(is.vector(NEW.OBJECT), mean(NEW.OBJECT),

+

apply(NEW.OBJECT, 2, mean))

[1] 11.33333



1.17.3 Repeated execution – looping

Looping enables sets of commands to be performed (executed) repeatedly.

for



A for loop iteratively loops through a vector of integers (a counter), each time

executing the set of commands, and takes on the general form of:

for (counter in sequence) task



where counter is a loop variable, whose value is incremented according to the

integer vector defined by sequence. The task is a single expression or grouped

expression (see section 1.17.1) that utilizes the incrementing variable to perform a

specific operation on a sequence of objects. For a simple example of a for loop, consider

the following snippet that counts to six:

> for (i in 1:6) print(i)

[1] 1

[1] 2

[1] 3

[1] 4

[1] 5

[1] 6



As a more applied example, let’s say we wanted to calculate the distances between

each pair of sites in the XY matrix generated in section 1.11.1. The distance between any

two sites (e.g. 'A' and 'B') could be determined using Pythagoras’ theorem

(a2 + b2 = c2 ).

> sqrt((XY["A", "X"] - XY["B", "X"])^2 + (XY["A",

+

"Y"] - XY["B", "Y"])^2)

# OR equivalently

> sqrt((XY[1, 1] - XY[2, 1])^2 + (XY[1, 2] - XY[2,

+

2])^2)

[1] 8.446638



A for loop can be used to produce a 5 × 5 matrix of pairwise distances between each of

the sites:

# Create empty object

> DISTANCES <- NULL



33



INTRODUCTION TO R



>

+

+

+

+

+

>

>

A

B

C

D

E



for (i in 1:5) {

X.DIST <- (XY[i, 1] - XY[, 1])^2

Y.DIST <- (XY[i, 2] - XY[, 2])^2

DISTANCES <- cbind(DISTANCES, sqrt(X.DIST +

Y.DIST))

}

colnames(DISTANCES) <- rownames(DISTANCES)

DISTANCES

A

B

C

D

E

0.000000 8.446638 12.459314 4.088251 7.006069

8.446638 0.000000 16.836116 8.571143 12.261472

12.459314 16.836116 0.000000 9.049691 5.455868

4.088251 8.571143 9.049691 0.000000 3.832075

7.006069 12.261472 5.455868 3.832075 0.000000



while



A while loop executes a set of commands repeatedly while a condition is TRUE and

exits when the condition evaluates to FALSE, and takes the general form:

> while (condition) task



where task is a single expression or grouped expression (see section 1.17.1) that

performs a specific operation as long as condition evaluates to TRUE.

To illustrate the use of a while loop, consider the situation where a procedure needs

to generate a temporary object, but you want to be sure that no existing objects are

overwritten. A simple solution is to append the object name with a number. A while

loop can be used to repeatedly assess whether an object name (TEMP) already exists in

the current R environment (each time incrementing a suffix) and eventually generate

a unique name. The first three commands in the following syntax are included purely

to generate a couple of existing names and confirm their existence.

> TEMP <- NULL

> TEMP1 <- NULL

> ls()

[1] "A"

"AUST"

"B"

[5] "D"

"DISTANCES"

"EXP"

[9] "i"

"MOTH"

"NEW.OBJECT"

[13] "QUADRATS"

"SHADE"

"SITE"

[17] "TEMP1"

"TEMPERATURE" "X"

[21] "XY"

"Y"

"Y.DIST"

#object name suffix, initially empty

> j <- NULL

# proposed temporary object

> NAME <- "TEMP"

# iteratively search for a unique name



"C"

"EXPERIMENT"

"op"

"TEMP"

"X.DIST"



34



CHAPTER 1



>

+

+

#

>

#

>



while (exists(Nm <- paste(NAME, j, sep = ""))) {

ifelse(is.null(j), j <- 1, j <- j + 1)

}

assign the unique name to a numeric vector

assign(Nm, c(1, 3, 3))

Reexamine list of objects, note the new object, TEMP2

ls()

[1] "A"

"AUST"

"B"

"C"

[5] "D"

"DISTANCES"

"EXP"

"EXPERIMENT"

[9] "i"

"j"

"MOTH"

"NAME"

[13] "NEW.OBJECT" "Nm"

"op"

"QUADRATS"

[17] "SHADE"

"SITE"

"TEMP"

"TEMP1"

[21] "TEMP2"

"TEMPERATURE" "X"

"X.DIST"

[25] "XY"

"Y"

"Y.DIST"



The exists() function assesses whether an object of the given name already exists and

assign() function makes the first argument an object name and assigns it the value of

the second argument.



1.17.4 Writing functions

For all but the most trivial cases, lines of R code should be organized into a new function

which can then be used in the same way as the built in functions. Functions are defined

using the function() function:

> name <- function(argument1, argument2, ...) expression



The new function (called name) will use the arguments (argument1, argument2,

...) to evaluate the expression (usually grouped expressions – see section 1.17.1) and

return the result of the evaluated expression. Once defined, the function is called by

issuing a statement in the form:

> name(argument1, argument2, ...)



Functions not only provide a more elegant way to interact with a procedure (as all

arguments are provided in one location, and the internal workings are hidden from

view), they form a reusable extension of the R environment. As such, there are a couple

of general programming conventions that are worth adhering to. Firstly, each function

should only perform a single task. If a series of tasks are required, consider writing a

number of functions that in turn are called from another function. Secondly, where

possible, provide default options, thereby simplifying the use of the function for most

regular occasions. Thirdly, user defined functions should be in either upper case or

camel case so as to avoid conflicting with functions built into R or one of the many

extension packages.

For example, we could extend the functionality of R by writing a function that

estimates the standard error of the

√ mean. The standard error of the mean can be

estimated using the formula sd/ n − 1, where sd is the standard deviation of the

sample and n is the number of observations.



INTRODUCTION TO R



35



> SEM <- function(x, na.rm = FALSE) {

+

if (na.rm == TRUE)

+

VAR <- x[!is.na(x)]

+

else VAR <- x

+

SD <- sd(VAR)

+

N <- length(VAR)

+

SD/sqrt(N - 1)

+ }



The function first assesses whether missing values (values of 'NA') should be removed

(based on the value of na.rm supplied by the function user). If the function is called

with na.rm=TRUE, the is.na() function is used to deselect such values, before the

standard deviation and length are calculated using the sdj and length functions.

Finally, the standard error of the mean is calculated and returned. This function

could then be used to calculate the standard error of the mean for the TEMPERATURE

vector:

> SEM(TEMPERATURE)

[1] 4.30145



1.18



An introduction to the R graphical environment



In addition to providing a highly adaptable statistical environment, R is also a graphical

environment in which figures suitable for publication can be generated. The R graphical

environment consists of one or more graphical devices along with an extensive library

of functions for manipulating objects on these devices. A graphical device is an output

stream such as a window, file or printer that is capable of receiving and interpreting

graphical/plotting instructions. The exhaustive number of graphical functions can be

broadly broken down into three categories:

• High-level graphics (plotting) functions are used to generate a new plot on a graphical

device, and, unless directed otherwise, accompanying axes, labels and the appropriate (yet

basic) points/bars/boxes etc are also automatically generated. When these functions are

issued, a graphical device (a window unless otherwise specified) is opened and activated.

If the device is already active, the previous plot will be overwritten. Whilst these functions

form the basis of all graphics in R, they are rarely used in isolation to produced graphs, as

they offer only limited potential for customization.

• Low-level graphics functions are used to customize and enhance existing plots by adding

more objects and information, such as additional points, lines, words, axes, colors etc.

• Interactive graphics functions allow information to be added or extracted interactively

from existing plots using the mouse. For example, a label may be added to a plot at

the location of the mouse pointer, thereby simplifying the interaction with the graphical

device’s coordinate system.

j



The sd function returns a 'NA' when a vector containing missing values is encountered.



Tài liệu bạn tìm kiếm đã sẵn sàng tải về

14 Pattern matching and replacement (character search and replace)

Tải bản đầy đủ ngay(0 tr)

×