Tải bản đầy đủ
Part II. The Data Analysis Workflow

Part II. The Data Analysis Workflow

Tải bản đầy đủ

CHAPTER 12

Getting Data

Data can come from many sources. R comes with many datasets built in, and there is
more data in many of the add-on packages. R can read data from a wide variety of sources
and in a wide variety of formats. This chapter covers importing data from text files
(including spreadsheet-like data in comma- or tab-delimited format, XML, and JSON),
binary files (Excel spreadsheets and data from other analysis software), websites, and
databases.

Chapter Goals
After reading this chapter, you should:
• Be able to access datasets provided with R packages
• Be able to import data from text files
• Be able to import data from binary files
• Be able to download data from websites
• Be able to import data from a database

Built-in Datasets
One of the packages in the base R distribution is called datasets, and it is entirely filled
with example datasets. While you’ll be lucky if any of them are suited to your particular
area of research, they are ideal for testing your code and for exploring new techniques.
Many other packages also contain datasets. You can see all the datasets that are available
in the packages that you have loaded using the data function:
data()

169

For a more complete list, including data from all packages that have been installed, use
this invocation:
data(package = .packages(TRUE))

To access the data in any of these datasets, call the data function, this time passing the
name of the dataset and the package where the data is found (if the package has been
loaded, then you can omit the package argument):
data("kidney", package = "survival")

Now the kidney data frame can be used just like your own variables:
head(kidney)
##
##
##
##
##
##
##

1
2
3
4
5
6

id time status
1
8
1
1
16
1
2
23
1
2
13
0
3
22
1
3
28
1

age sex disease frail
28
1
Other
2.3
28
1
Other
2.3
48
2
GN
1.9
48
2
GN
1.9
32
1
Other
1.2
32
1
Other
1.2

Reading Text Files
There are many, many formats and standards of text documents for storing data. Com‐
mon formats for storing data are delimiter-separated values (CSV or tab-delimited),
eXtensible Markup Language (XML), JavaScript Object Notation (JSON), and YAML
(which recursively stands for YAML Ain’t Markup Language). Other sources of text data
are less well-structured—a book, for example, contains text data without any formal
(that is, standardized and machine parsable) structure.
The main advantage of storing data in text files is that they can be read by more or less
all other data analysis software and by humans. This makes your data more widely
reusable by others.

CSV and Tab-Delimited Files
Rectangular (spreadsheet-like) data is commonly stored in delimited-value files, par‐
ticularly comma-separated values (CSV) and tab-delimited values files. The read.table
function reads these delimited files and stores the results in a data frame. In its simplest
form, it just takes the path to a text file and imports the contents.
RedDeerEndocranialVolume.dlm is a whitespace-delimited file containing measure‐
ments of the endocranial volume of some red deer, measured using different techniques.
(For those of you with an interest in deer skulls, the methods are computer tomography;
filling the skull with glass beads; measuring the length, width, and height with calipers;
and using Finarelli’s equation. A second measurement was taken in some cases to get

170

|

Chapter 12: Getting Data

an idea of the accuracy of the techniques. I’ve been assured that the deer were already
long dead before they had their skulls filled with beads!) The data file can be found
inside the extdata folder in the learningr package. The first few rows are shown in
Table 12-1.
Table 12-1. Sample data from RedDeerEndocranialVolume.dlm
SkullID

VolCT

VolBead

VolLWH

VolFinarelli

VolCT2

VolBead2

VolLWH2

DIC44

389

375

1484

337

B11

389

370

1722

377

DIC90

352

345

1495

328

DIC83

388

370

1683

377

DIC787

375

355

1458

328

DIC1573

325

320

1363

291

C120

346

335

1250

289

346

330

1264

C25

302

295

F7

379

360

1011

250

303

295

1009

1621

347

375

365

1647

The data has a header row, so we need to pass the argument header = TRUE to
read.table. Since a second measurement wasn’t always taken, not all the lines are
complete. Passing the argument fill = TRUE makes read.table substitute NA values
for the missing fields. The system.file function in the following example is used to
locate files that are inside a package (in this case, the RedDeerEndocranialVolume.dlm
file in the extdata folder of the package learningr).
library(learningr)
deer_file <- system.file(
"extdata",
"RedDeerEndocranialVolume.dlm",
package = "learningr"
)
deer_data <- read.table(deer_file, header = TRUE, fill = TRUE)
str(deer_data, vec.len = 1)
#vec.len alters the amount of output
## 'data.frame':
## $ SkullID
:
## $ VolCT
:
## $ VolBead
:
## $ VolLWH
:
## $ VolFinarelli:
## $ VolCT2
:
## $ VolBead2
:
## $ VolLWH2
:

33 obs. of 8 variables:
Factor w/ 33 levels "A4","B11","B12",..: 14
int 389 389 ...
int 375 370 ...
int 1484 1722 ...
int 337 377 ...
int NA NA ...
int NA NA ...
int NA NA ...

...

Reading Text Files

|

171

head(deer_data)
##
##
##
##
##
##
##

SkullID VolCT VolBead VolLWH VolFinarelli VolCT2 VolBead2 VolLWH2
1
DIC44
389
375
1484
337
NA
NA
NA
2
B11
389
370
1722
377
NA
NA
NA
3
DIC90
352
345
1495
328
NA
NA
NA
4
DIC83
388
370
1683
377
NA
NA
NA
5 DIC787
375
355
1458
328
NA
NA
NA
6 DIC1573
325
320
1363
291
NA
NA
NA

Notice that the class of each column has been automatically determined, and row and
column names have been automatically assigned. The column names are (by default)
forced to be valid variable names (via make.names), and if row names aren’t provided
the rows are simply numbered 1, 2, 3, and so on.
There are lots of arguments to specify how the file will be read; perhaps the most im‐
portant is sep, which determines the character to use as a separator between fields. You
can also specify how many lines of data to read via nrow, and how many lines at the start
of the file to skip. More advanced options include the ability to override the default row
names, column names, and classes, and to specify the character encoding of the input
file and how string input columns should be declared.
There are several convenience wrapper functions to read.table. read.csv sets the
default separator to a comma, and assumes that the data has a header row. read.csv2
is its European cousin, using a comma for decimal places and a semicolon as a separator.
Likewise read.delim and read.delim2 import tab-delimited files with full stops1 or
commas for decimal places, respectively.
Back in August 2008, scientists from the Centre for Environment, Fisheries, and Aqua‐
culture Science (CEFAS) in Lowestoft, UK, attached a tag with a pressure sensor and a
temperature sensor to a brown crab and dropped it into the North Sea. The crab then
spent just over a year doing crabby things2 before being caught by fishermen, who re‐
turned the tag to CEFAS.
The data from this tag is stored in a CSV file, along with some metadata. The first few
rows of the file look like this:

1. Or periods, if you are American.
2. Specifically, migrating from the eastern North Sea near Germany to the western North Sea near the UK.

172

| Chapter 12: Getting Data

Comment :- clock reset to download data
The following data are the ID block contents
Firmware Version No

2

Firmware Build Level

70

The following data are the Tag notebook contents
Mission Day

405

Last Deployment Date

08/08/2008 09:55:00

Deployed by Host Version

5.2.0

Downloaded by Host Version

6.0.0

Last Clock Set Date

05/01/2010 10:34:00

The following data are the Lifetime notebook contents
Tag ID

A03401

Pressure Range

10

No of sensors

2

In this case, we can’t just read everything with a single call to read.csv, since different
pieces of data have different numbers of fields, and indeed different fields. We need to
use the skip and nrow arguments of read.csv to specify which bits of the file to read:
crab_file <- system.file(
"extdata",
"crabtag.csv",
package = "learningr"
)
(crab_id_block <- read.csv(
crab_file,
header = FALSE,
skip = 3,
nrow = 2
))
##
V1 V2
## 1 Firmware Version No 2
## 2 Firmware Build Level 70
(crab_tag_notebook <- read.csv(
crab_file,
header = FALSE,
skip = 8,
nrow = 5
))
##
## 1
## 2
## 3

V1
V2
Mission Day
405
Last Deployment Date 08/08/2008 09:55:00
Deployed by Host Version
5.2.0

Reading Text Files

|

173

## 4 Downloaded by Host Version
6.0.0
## 5
Last Clock Set Date 05/01/2010 10:34:00
(crab_lifetime_notebook <- read.csv(
crab_file,
header = FALSE,
skip = 15,
nrow = 3
))
##
V1
V2
## 1
Tag ID A03401
## 2 Pressure Range
10
## 3 No of sensors
2

The colbycol and sqldf packages contain functions that allow you to
read part of a CSV file into R. This can provide a useful speed-up if
you don’t need all the columns or all the rows.

For really low-level control when importing this sort of file, you can use the scan
function, on which read.table is based. Ordinarily, you should never have to resort to
scan, but it can be useful for malformed or nonstandard files.
If your data has been exported from another language, you may need
to pass the na.strings argument to read.table. For data exported
from SQL, use na.strings = "NULL". For data exported from SAS or
Stata, use na.strings = ".". For data exported from Excel, use
na.strings = c("", "#N/A", "#DIV/0!", "#NUM!").

The opposite task, writing files, is generally simpler than reading files, since you don’t
need to worry about oddities in the file—you usually want to create something standard.
read.table and read.csv have the obviously named counterparts write.table and
write.csv.
Both functions take a data frame and a file path to write to. They also provide a few
options to customize the output (whether or not to include row names and what the
character encoding of the output file should be, for example):
write.csv(
crab_lifetime_notebook,
"Data/Cleaned/crab lifetime data.csv",
row.names
= FALSE,
fileEncoding = "utf8"
)

174

|

Chapter 12: Getting Data

Unstructured Text Files
Not all text files have a well-defined structure like delimited files do. If the structure of
the file is weak, it is often easier to read in the file as lines of text and then parse or
manipulate the contents afterward. readLines (notice the capital “L”) provides such a
facility. It accepts a path to a file (or a file connection) and, optionally, a maximum
number of lines to read. Here we import the Project Gutenberg version of Shakespeare’s
The Tempest:
text_file <- system.file(
"extdata",
"Shakespeare's The Tempest, from Project Gutenberg pg2235.txt",
package = "learningr"
)
the_tempest <- readLines(text_file)
the_tempest[1926:1927]
## [1] "
Ste. Foure legges and two voyces; a most delicate"
## [2] "Monster: his forward voyce now is to speake well of"

writeLines performs the reverse operation to readLines. It takes a character vector

and a file to write to:

writeLines(
rev(text_file),
#rev reverses vectors
"Shakespeare's The Tempest, backwards.txt"
)

XML and HTML Files
XML files are widely used for storing nested data. Many standard file types and protocols
are based upon it, such as RSS (Really Simple Syndication) for news feeds, SOAP (Simple
Object Access Protocol) for passing structured data across computer networks, and the
XHTML flavor of web pages.
Base R doesn’t ship with the capability to read XML files, but the XML package is devel‐
oped by an R Core member. Install it now!
install.packages("XML")

When you import an XML file, the XML package gives you a choice of storing the result
using internal nodes (that is, objects are stored with C code, the default) or R nodes.
Usually you want to store things with internal nodes, because it allows you to query the
node tree using XPath (more on this in a moment).
There are several functions for importing XML data, such as xmlParse and some other
wrapper functions that use slightly different defaults:
library(XML)
xml_file <- system.file("extdata", "options.xml", package = "learningr")
r_options <- xmlParse(xml_file)

Reading Text Files

|

175

One of the problems with using internal nodes is that summary functions like str and
head don’t work with them. To use R-level nodes, set useInternalNodes = FALSE (or
use xmlTreeParse, which sets this as the default):
xmlParse(xml_file, useInternalNodes = FALSE)
xmlTreeParse(xml_file)
#the same

XPath is a language for interrogating XML documents, letting you find nodes that cor‐
respond to some filter. In this next example we look anywhere in the document (//) for
a node named variable where ([]) the name attribute (@) contains the string warn.
xpathSApply(r_options, "//variable[contains(@name, 'warn')]")
##
##
##
##
##
##
##
##
##
##
##
##
##
##

[[1]]

50

[[2]]

0

[[3]]

1000


This sort of querying is very useful for extracting data from web pages. The equivalent
functions for importing HTML pages are named as you might expect, htmlParse and
htmlTreeParse, and they behave in the same way.
XML is also very useful for serializing (a.k.a. “saving”) objects in a format that can be
read by most other pieces of software. The XML package doesn’t provide serialization
functionality, but it is available via the makexml function in the Runiversal package.
The options.xml file was created with this code:
library(Runiversal)
ops <- as.list(options())
cat(makexml(ops), file = "options.xml")

JSON and YAML Files
The main problems with XML are that it is very verbose, and you need to explicitly
specify the type of the data (it can’t tell the difference between a string and a number by
default), which makes it even more verbose. When file sizes are important (such as when
you are transferring big datasets across a network), this verbosity becomes problematic.
YAML and its subset JSON were invented to solve these problems. They are much better
suited to transporting many datasets—particularly numeric data and arrays—over
176

|

Chapter 12: Getting Data

networks. JSON is the de facto standard for web applications to pass data between
themselves.
There are two packages for dealing with JSON data: RJSONIO and rjson. For a long time
rjson had performance problems, so the only package that could be recommended was
RJSONIO. The performance issues have now been fixed, so it’s a much closer call. For
most cases, it now doesn’t matter which package you use. The differences occur when
you encounter malformed or nonstandard JSON.
RJSONIO is generally more forgiving than rjson when reading incorrect JSON. Whether

this is a good thing or not depends upon your use case. If you think it is better to import
JSON data with minimal fuss, then RJSONIO is best. If you want to be alerted to problems
with the JSON data (perhaps it was generated by a colleague—I’m sure you would never
generate malformed JSON), then rjson is best.
Fortunately, both packages have identically named functions for reading and writing
JSON data, so it is easy to swap between them. In the following example, the double
colons, ::, are used to distinguish which package each function should be taken from
(if you only load one of the two packages, then you don’t need the double colons):
library(RJSONIO)
library(rjson)
jamaican_city_file <- system.file(
"extdata",
"Jamaican Cities.json",
package = "learningr"
)
(jamaican_cities_RJSONIO <- RJSONIO::fromJSON(jamaican_city_file))
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##

$Kingston
$Kingston$population
[1] 587798
$Kingston$coordinates
longitude latitude
17.98
76.80
$`Montego Bay`
$`Montego Bay`$population
[1] 96488
$`Montego Bay`$coordinates
longitude latitude
18.47
77.92

Reading Text Files

|

177

(jamaican_cities_rjson <- rjson::fromJSON(file = jamaican_city_file))
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##

$Kingston
$Kingston$population
[1] 587798
$Kingston$coordinates
$Kingston$coordinates$longitude
[1] 17.98
$Kingston$coordinates$latitude
[1] 76.8

$`Montego Bay`
$`Montego Bay`$population
[1] 96488
$`Montego Bay`$coordinates
$`Montego Bay`$coordinates$longitude
[1] 18.47
$`Montego Bay`$coordinates$latitude
[1] 77.92

Notice that RJSONIO simplifies the coordinates for each city to be a vector. This behavior
can be turned off with simplify = FALSE, resulting in exactly the same object as the
one generated by rjson.
Annoyingly, the JSON spec doesn’t allow infinite or NaN values, and it’s a little fuzzy on
what a missing number should look like. The two packages deal with these values dif‐
ferently—RJSONIO maps NaN and NA to JSON’s null value but preserves positive and
negative infinity, while rjson converts all these values to strings:
special_numbers <- c(NaN, NA, Inf, -Inf)
RJSONIO::toJSON(special_numbers)
## [1] "[ null, null,

Inf,

-Inf ]"

rjson::toJSON(special_numbers)
## [1] "[\"NaN\",\"NA\",\"Inf\",\"-Inf\"]"

Since both these methods are hacks to deal with JSON’s limited spec, if you find yourself
dealing with these special number types a lot (or want to write comments in your data
object), then you are better off using YAML. The yaml package has two functions for
importing YAML data. yaml.load accepts a string of YAML and converts it to an R
object, and yaml.load_file does the same, but treats its string input as a path to a file
containing YAML:

178

| Chapter 12: Getting Data