14 Pattern matching and replacement (character search and replace)
Tải bản đầy đủ - 0trang
INTRODUCTION TO R
25
# select only those 'SITE' values that contain an 'A'
> grep("A", EXPERIMENT$SITE)
[1] 1 2
> EXPERIMENT$SITE[grep("A", EXPERIMENT$SITE)]
[1] "A1" "A2"
By default, the pattern comprises any valid regular expressionh which provides great
pattern searching ﬂexibility.
# convert the EXPERIMENT list into a data frame
> EXP <- as.data.frame(EXPERIMENT)
# select only those rows that contain correspond to a 'SITE'
value of either an A, B or C followed by a '1'
> grep("[A-C]1", EXP$SITE)
[1] 1 3 5
> EXP[grep("[A-C]1", EXP$SITE), ]
SITE COORDINATES TEMPERATURE SHADE
Q1
A1 16.92,8.37
36.1
no
Q3
B1 7.61,16.65
31.0
no
Q5
C1 11.77,13.12
39.9
no
1.14.2 regexpr - position and length of match
Rather than return the indexes of matching entries, the regexpr() function returns
the position of the match within each string as well as the length of the pattern
within each string (-1 values correspond to entries in which the pattern is not
found).
#recall the AUST character vector that lists the Australian
capital cities
> AUST
[1] "Adelaide" "Brisbane" "Canberra" "Darwin"
[5] "Hobart"
"Melbourne" "Perth"
"Sydney"
#get the position and length of string of characters containing
an 'a' and an 'e' separated by any number of characters
> regexpr("a.*e", AUST)
[1] 5 6 2 -1 -1 -1 -1 -1
attr(,"match.length")
[1] 4 3 4 -1 -1 -1 -1 -1
h
A regular expression is a formal computer language consisting of normal printing characters and
special metacharacters (which represent wildcards and other features) that together provide a concise
yet ﬂexible way of matching strings.
26
CHAPTER 1
1.14.3 gsub - pattern replacement
The gsub() function replaces all instancesi of an identiﬁed pattern within a character
vector with an alternative set of characters.
> gsub("no", "Not
[1] "Not shaded"
[5] "Not shaded"
[9] "Not shaded"
shaded", EXP$SHADE)
"full"
"Not shaded" "full"
"full"
"Not shaded" "full"
"full"
It is also possible to extend the functionality to accomodate perl-compatible regular
expressions.
#convert all the capital values entries into uppercase identify
(and store) all words (\\w) convert stored pattern (\\1) to
uppercase (\\U)
> gsub("(\\w)", "\\U\\1", AUST, perl = TRUE)
[1] "ADELAIDE" "BRISBANE" "CANBERRA" "DARWIN"
[5] "HOBART"
"MELBOURNE" "PERTH"
"SYDNEY"
1.15
Data manipulation
1.15.1 Sorting
The sort() function is used to sort vector entries in increasing (or decreasing)
order. Note that the elements of the TEMPERATURE vector were earlier named (see
section 1.10.2). This assists in the distinction of the following functions, however it
does result in slightly different format (each element has a name above it, and the
braced index is absent).
> sort(TEMPERATURE)
Q6
Q9
Q7
Q8 Q10
Q2
Q3
Q1
Q4
Q5
6.5 9.7 11.2 12.8 15.9 30.6 31.0 36.1 36.3 39.9
> sort(TEMPERATURE, decreasing = T)
Q5
Q4
Q1
Q3
Q2 Q10
Q8
Q7
39.9 36.3 36.1 31.0 30.6 15.9 12.8 11.2
Q9
9.7
Q6
6.5
The order() function is also used to sort vector entries in increasing (or decreasing)
order, but rather than return a sorted vector, it returns the position (order) or the
sorted entries in the original vector. For example:
> order(TEMPERATURE)
[1] 6 9 7 8 10 2
i
3
1
4
5
The similar sub() function replaces only the ﬁrst match of a pattern within a vector.
INTRODUCTION TO R
27
Indicating that the smallest entry in the TEMPERATURE vector was at position (index)
6 and so on.
The rank() function is used to indicate the ranking of each entry in a vector:
> rank(TEMPERATURE)
Q1 Q2 Q3 Q4 Q5
8
6
7
9 10
Q6
1
Q7
3
Q8
4
Q9 Q10
2
5
Indicating that the ﬁrst entry in the TEMPERATURE vector was ranked eighth in
increasing order. Ranks from decreasing order can be produced by then reversing the
returned vector using the rev() function.
> rev(rank(TEMPERATURE))
Q10 Q9 Q8 Q7 Q6 Q5 Q4
5
2
4
3
1 10
9
Q3
7
Q2
6
Q1
8
1.15.2 Formatting data
Rounding
The ceiling() function rounds vector entries up to the nearest integer
> ceiling(TEMPERATURE)
Q1 Q2 Q3 Q4 Q5 Q6
37 31 31 37 40
7
Q7
12
Q8
13
Q9 Q10
10 16
The floor() function rounds vector entries down to the nearest integer
> floor(TEMPERATURE)
Q1 Q2 Q3 Q4 Q5 Q6
36 30 31 36 39
6
Q7
11
Q8
12
Q9 Q10
9 15
The trunc() function rounds vector entries to the nearest integer towards ‘0’ (zero)
> trunc(seq(-2, 2, by = 0.5))
[1] -2 -1 -1 0 0 0 1 1 2
The round() function rounds vector entries to the nearest numeric with the speciﬁed
number of decimal places. Digits of 5 are rounded off to the nearest even digit.
> round(TEMPERATURE)
Q1 Q2 Q3 Q4 Q5 Q6
36 31 31 36 40
6
Q7
11
Q8
13
> round(seq(-2, 2, by = 0.5))
[1] -2 -2 -1 0 0 0 1 2 2
Q9 Q10
10 16
28
CHAPTER 1
> round(TEMPERATURE/2.2, 2)
Q1
Q2
Q3
Q4
Q5
16.41 13.91 14.09 16.50 18.14
> round(TEMPERATURE, -1)
Q1 Q2 Q3 Q4 Q5 Q6 Q7
40 30 30 40 40 10 10
Q6
2.95
Q8
10
Q7
5.09
Q8
5.82
Q9
4.41
Q10
7.23
Q9 Q10
10 20
Other formating
Occasionally (mainly for graphical displays), it is necessary to be able to adjust the
other aspects of the formatting of vector entries. For example, you may wish to have
numbers expressed in scientiﬁc notation (2.93e-04 rather than 0.000293) or insert
commas every 3 digits left of the decimal point. These procedures are supported via
the formatC() function.
> seq(pi, pi * 10000, length = 5)
[1]
3.141593 7856.337828 15709.534064 23562.730300
[5] 31415.926536
# scientific notation
> formatC(seq(pi, pi * 10000, length = 5), format = "e",
+
digits = 2)
[1] "3.14e+00" "7.86e+03" "1.57e+04" "2.36e+04" "3.14e+04"
# scientific notation only if it saves space
> formatC(seq(pi, pi * 10000, length = 5), format = "g",
+
digits = 2)
[1] "3.1"
"7.9e+03" "1.6e+04" "2.4e+04" "3.1e+04"
# floating point format with 1000's indicators
> formatC(seq(pi, pi * 10000, length = 5), format = "f",
+
big.mark = ",", digits = 2)
[1] "3.14"
"7,856.34" "15,709.53" "23,562.73"
[5] "31,415.93"
1.16
Functions that perform other functions repeatedly
The replicate() function repeatedly performs the function speciﬁed in the second
argument the number of times indicated by the ﬁrst argument. The important
distinction between the replicate() function and the rep() functions described in
section 1.10.1, is that the former repeatedly performs the function whereas the later
performs the function only once and then duplicates the result multiple times. Since
most functions produce the same result each time they are performed, for many uses,
29
INTRODUCTION TO R
both functions produce identical results. The one group of functions that do not
produce identical results each time, are those involved in random number generation.
Hence, the replicate() function is usually used in conjunction with random number
generators (such as runif(), which will be described in greater detail in chapter 4)
to produce sets of random numbers. Consider ﬁrst the difference between rep() and
replicate():
> rep(runif(1), 5)
[1] 0.4194366 0.4194366 0.4194366 0.4194366 0.4194366
> replicate(5, runif(1))
[1] 0.467324683 0.727337794 0.797764456 0.007025032
[5] 0.155971928
When the function being run within runif() itself produces a vector of length > 1,
the runif() function combines each of the vectors together as separate columns in a
matrix:
> replicate(5,
[,1]
[1,] 0.3266058
[2,] 0.5241960
[3,] 0.1894848
[4,] 0.1464055
[5,] 0.5491748
runif(5))
[,2]
0.3313832
0.9801652
0.8300792
0.6758495
0.4052211
[,3]
0.2113326
0.6642341
0.7178351
0.9940731
0.9923927
[,4]
0.4744742
0.5292882
0.7262750
0.3015559
0.4074775
[,5]
0.257732622
0.799982207
0.698298026
0.288537242
0.002170782
1.16.1 Along matrix margins
The apply() function applies a function to the margins (1=row margins and 2=column
margins) of a matrix. For example, we might have a matrix that represents the
abundance of three species of moth from three habitat types:
> MOTH <- cbind(SpA = c(25, 6, 3), SpB = c(12, 12,
+
3), SpC = c(7, 2, 19))
> rownames(MOTH) <- paste("Habitat", 1:3, sep = "")
> MOTH
SpA SpB SpC
Habitat1 25 12
7
Habitat2
6 12
2
Habitat3
3
3 19
The apply() function could be used to calculate the column means (mean abundance
of each species across habitat types):
> apply(MOTH, 2, mean)
SpA
SpB
SpC
11.333333 9.000000 9.333333
30
CHAPTER 1
1.16.2 By factorial groups
The tapply() function applies a function to the vector separately for each level of a
factor combination. This provides a convenient way to calculate group statistics (pivot
tables). For example, if we wanted to calculate the mean TEMPERATURE for each level
of the SHADE factor:
> tapply(TEMPERATURE, SHADE, mean)
no full
25.58 20.42
1.16.3 By objects
The lapply() and sapply() functions apply a function separately to each of the
objects in a list and return a list and vector/matrix respectively. For example, to ﬁnd
out the length of each of the objects within the EXPERIMENT list:
> lapply(EXPERIMENT, length)
$SITE
[1] 10
$COORDINATES
[1] 5
$TEMPERATURE
[1] 10
$SHADE
[1] 10
> sapply(EXPERIMENT, length)
SITE COORDINATES TEMPERATURE
10
5
10
1.17
SHADE
10
Programming in R
Although the library of built-in and add-on tools available for the R environment
is extensive (and continues to grow at an incredible rate), occasionally there is the
need to perform a task for which there are no existing functions. Since R is itself
a programming language (in fact most of the available functions are written in R),
extending its functionality to accommodate additional procedures can be a relatively
simple exercise (depending of course, on the complexity of the procedure and your
level of R proﬁciency).
INTRODUCTION TO R
31
1.17.1 Grouped expressions
Multiple commands can be issued on a single line by separating each command by
a semicolon (;). When doing so, commands are evaluated in order from left to
right:
> A <- 1;
> C
[1] 3
B <- 2;
C <- A + B
When a series of commands are grouped together between braces (such as {command1;
command2;...}), the whole group of commands are evaluated as a single expression
and the value of the last evaluated command within the group is returned:
> D <- {A <- 1; 2 -> B; C <- A + B}
> D
[1] 3
Grouped expressions are useful for wrapping up sets of commands that work together
to produce a single result and since they are treated as a single expression, they too can
be further nested within braces as part of a larger grouped expression.
1.17.2 Conditional execution – if and ifelse
Conditional execution is when a sequence of tasks is determined by whether a condition
is met (TRUE) or not (FALSE), and is useful when writing code that needs to be able to
accommodate more than one set of circumstances. In R, conditional execution has the
forms:
if(condition) true.task
if(condition) true.task else false.task
ifelse(condition) true.task false.task
If condition returns a TRUE, the statement true.task is evaluated, otherwise the
false.task is evaluated (if provided). If condition cannot be coerced into a logical
(a yes/no answer), an error will be reported.
To illustrate the use of the if conditional execution, imagine that you were writing
code to calculate means and you anticipated that you may have to accommodate two
different classes of objects (vectors and matrices). I will use the vector TEMPERATURE
and the matrix MOTH:
> NEW.OBJECT <- TEMPERATURE
> if (is.vector(NEW.OBJECT)) mean(NEW.OBJECT)
+
else apply(NEW.OBJECT, 2, mean)
[1] 23
32
CHAPTER 1
> NEW.OBJECT <- MOTH
> ifelse(is.vector(NEW.OBJECT), mean(NEW.OBJECT),
+
apply(NEW.OBJECT, 2, mean))
[1] 11.33333
1.17.3 Repeated execution – looping
Looping enables sets of commands to be performed (executed) repeatedly.
for
A for loop iteratively loops through a vector of integers (a counter), each time
executing the set of commands, and takes on the general form of:
for (counter in sequence) task
where counter is a loop variable, whose value is incremented according to the
integer vector deﬁned by sequence. The task is a single expression or grouped
expression (see section 1.17.1) that utilizes the incrementing variable to perform a
speciﬁc operation on a sequence of objects. For a simple example of a for loop, consider
the following snippet that counts to six:
> for (i in 1:6) print(i)
[1] 1
[1] 2
[1] 3
[1] 4
[1] 5
[1] 6
As a more applied example, let’s say we wanted to calculate the distances between
each pair of sites in the XY matrix generated in section 1.11.1. The distance between any
two sites (e.g. 'A' and 'B') could be determined using Pythagoras’ theorem
(a2 + b2 = c2 ).
> sqrt((XY["A", "X"] - XY["B", "X"])^2 + (XY["A",
+
"Y"] - XY["B", "Y"])^2)
# OR equivalently
> sqrt((XY[1, 1] - XY[2, 1])^2 + (XY[1, 2] - XY[2,
+
2])^2)
[1] 8.446638
A for loop can be used to produce a 5 × 5 matrix of pairwise distances between each of
the sites:
# Create empty object
> DISTANCES <- NULL
33
INTRODUCTION TO R
>
+
+
+
+
+
>
>
A
B
C
D
E
for (i in 1:5) {
X.DIST <- (XY[i, 1] - XY[, 1])^2
Y.DIST <- (XY[i, 2] - XY[, 2])^2
DISTANCES <- cbind(DISTANCES, sqrt(X.DIST +
Y.DIST))
}
colnames(DISTANCES) <- rownames(DISTANCES)
DISTANCES
A
B
C
D
E
0.000000 8.446638 12.459314 4.088251 7.006069
8.446638 0.000000 16.836116 8.571143 12.261472
12.459314 16.836116 0.000000 9.049691 5.455868
4.088251 8.571143 9.049691 0.000000 3.832075
7.006069 12.261472 5.455868 3.832075 0.000000
while
A while loop executes a set of commands repeatedly while a condition is TRUE and
exits when the condition evaluates to FALSE, and takes the general form:
> while (condition) task
where task is a single expression or grouped expression (see section 1.17.1) that
performs a speciﬁc operation as long as condition evaluates to TRUE.
To illustrate the use of a while loop, consider the situation where a procedure needs
to generate a temporary object, but you want to be sure that no existing objects are
overwritten. A simple solution is to append the object name with a number. A while
loop can be used to repeatedly assess whether an object name (TEMP) already exists in
the current R environment (each time incrementing a sufﬁx) and eventually generate
a unique name. The ﬁrst three commands in the following syntax are included purely
to generate a couple of existing names and conﬁrm their existence.
> TEMP <- NULL
> TEMP1 <- NULL
> ls()
[1] "A"
"AUST"
"B"
[5] "D"
"DISTANCES"
"EXP"
[9] "i"
"MOTH"
"NEW.OBJECT"
[13] "QUADRATS"
"SHADE"
"SITE"
[17] "TEMP1"
"TEMPERATURE" "X"
[21] "XY"
"Y"
"Y.DIST"
#object name suffix, initially empty
> j <- NULL
# proposed temporary object
> NAME <- "TEMP"
# iteratively search for a unique name
"C"
"EXPERIMENT"
"op"
"TEMP"
"X.DIST"
34
CHAPTER 1
>
+
+
#
>
#
>
while (exists(Nm <- paste(NAME, j, sep = ""))) {
ifelse(is.null(j), j <- 1, j <- j + 1)
}
assign the unique name to a numeric vector
assign(Nm, c(1, 3, 3))
Reexamine list of objects, note the new object, TEMP2
ls()
[1] "A"
"AUST"
"B"
"C"
[5] "D"
"DISTANCES"
"EXP"
"EXPERIMENT"
[9] "i"
"j"
"MOTH"
"NAME"
[13] "NEW.OBJECT" "Nm"
"op"
"QUADRATS"
[17] "SHADE"
"SITE"
"TEMP"
"TEMP1"
[21] "TEMP2"
"TEMPERATURE" "X"
"X.DIST"
[25] "XY"
"Y"
"Y.DIST"
The exists() function assesses whether an object of the given name already exists and
assign() function makes the ﬁrst argument an object name and assigns it the value of
the second argument.
1.17.4 Writing functions
For all but the most trivial cases, lines of R code should be organized into a new function
which can then be used in the same way as the built in functions. Functions are deﬁned
using the function() function:
> name <- function(argument1, argument2, ...) expression
The new function (called name) will use the arguments (argument1, argument2,
...) to evaluate the expression (usually grouped expressions – see section 1.17.1) and
return the result of the evaluated expression. Once deﬁned, the function is called by
issuing a statement in the form:
> name(argument1, argument2, ...)
Functions not only provide a more elegant way to interact with a procedure (as all
arguments are provided in one location, and the internal workings are hidden from
view), they form a reusable extension of the R environment. As such, there are a couple
of general programming conventions that are worth adhering to. Firstly, each function
should only perform a single task. If a series of tasks are required, consider writing a
number of functions that in turn are called from another function. Secondly, where
possible, provide default options, thereby simplifying the use of the function for most
regular occasions. Thirdly, user deﬁned functions should be in either upper case or
camel case so as to avoid conﬂicting with functions built into R or one of the many
extension packages.
For example, we could extend the functionality of R by writing a function that
estimates the standard error of the
√ mean. The standard error of the mean can be
estimated using the formula sd/ n − 1, where sd is the standard deviation of the
sample and n is the number of observations.
INTRODUCTION TO R
35
> SEM <- function(x, na.rm = FALSE) {
+
if (na.rm == TRUE)
+
VAR <- x[!is.na(x)]
+
else VAR <- x
+
SD <- sd(VAR)
+
N <- length(VAR)
+
SD/sqrt(N - 1)
+ }
The function ﬁrst assesses whether missing values (values of 'NA') should be removed
(based on the value of na.rm supplied by the function user). If the function is called
with na.rm=TRUE, the is.na() function is used to deselect such values, before the
standard deviation and length are calculated using the sdj and length functions.
Finally, the standard error of the mean is calculated and returned. This function
could then be used to calculate the standard error of the mean for the TEMPERATURE
vector:
> SEM(TEMPERATURE)
[1] 4.30145
1.18
An introduction to the R graphical environment
In addition to providing a highly adaptable statistical environment, R is also a graphical
environment in which ﬁgures suitable for publication can be generated. The R graphical
environment consists of one or more graphical devices along with an extensive library
of functions for manipulating objects on these devices. A graphical device is an output
stream such as a window, ﬁle or printer that is capable of receiving and interpreting
graphical/plotting instructions. The exhaustive number of graphical functions can be
broadly broken down into three categories:
• High-level graphics (plotting) functions are used to generate a new plot on a graphical
device, and, unless directed otherwise, accompanying axes, labels and the appropriate (yet
basic) points/bars/boxes etc are also automatically generated. When these functions are
issued, a graphical device (a window unless otherwise speciﬁed) is opened and activated.
If the device is already active, the previous plot will be overwritten. Whilst these functions
form the basis of all graphics in R, they are rarely used in isolation to produced graphs, as
they offer only limited potential for customization.
• Low-level graphics functions are used to customize and enhance existing plots by adding
more objects and information, such as additional points, lines, words, axes, colors etc.
• Interactive graphics functions allow information to be added or extracted interactively
from existing plots using the mouse. For example, a label may be added to a plot at
the location of the mouse pointer, thereby simplifying the interaction with the graphical
device’s coordinate system.
j
The sd function returns a 'NA' when a vector containing missing values is encountered.