Tải bản đầy đủ - 0trang
Chapter 10. Importing and Exporting Data
Validation by Pattern Matching
Using Patterns to Match Broad Content Types
Using Patterns to Match Numeric Values
Using Patterns to Match Dates or Times
Using Patterns to Match Email Addresses and URLs
Validation Using Table Metadata
Validation Using a Lookup Table
Converting Two-Digit Year Values to Four-Digit Form
Performing Validity Checking on Date or Time Subparts
Writing Date-Processing Utilities
Using Dates with Missing Components
Performing Date Conversion Using SQL
Using Temporary Tables for Data Transformation
Dealing with NULL Values
Guessing Table Structure from a Datafile
A LOAD DATA Diagnostic Utility
Exchanging Data Between MySQL and Microsoft Access
Exchanging Data Between MySQL and Microsoft Excel
Exchanging Data Between MySQL and FileMaker Pro
Exporting Query Results as XML
Importing XML into MySQL
Suppose you have a file named somedata.csv that contains 12 columns of data in commaseparated values (CSV) format. From this file you want to extract only columns 2, 11, 5, and
9, and use them to create database records in a MySQL table that contains name, birth,
height, and weight columns. You need to make sure that the height and weight are positive
integers, and convert the birth dates from MM/DD/YY format to CCYY-MM-DD format. How can
you do this?
In one sense, that problem is very specialized. But in another, it's not at all atypical, because
data transfer problems with specific requirements occur frequently when transferring data into
MySQL. It would be nice if datafiles were always nicely formatted and ready to load into
MySQL with no preparation, but that is frequently not so. As a result, it's often necessary to
preprocess information to put it into a format that MySQL finds acceptable. The reverse also is
true; data exported from MySQL may need massaging to be useful for other programs.
Some data transfer operations are so difficult that they require a great deal of hand checking
and reformatting, but in most cases, you can do at least part of the job automatically. Virtually
all transfer problems involve at least some elements of a common set of conversion issues.
This chapter discusses what these issues are, how to deal with them by taking advantage of
the existing tools at your disposal, and how to write your own tools when necessary. The idea
is not to cover all possible import and export situations (an impossible task), but to show
some representative techniques and utilities, You can use them as is, or adapt them for
problems that they don't handle. (There are also commercial conversion tools that may assist
you, but my purpose here is to help you do things yourself.)
The first recipes in the chapter cover MySQL's native facilities for importing data (the LOAD
DATA statement and the mysqlimport command-line program), and for exporting data (the
SELECT ... INTO OUTFILE statement and the mysqldump program). For operations that don't
require any data validation or reformatting, these facilities often are sufficient.
For situations where MySQL's native import and export capabilities do not suffice, the chapter
moves on to cover techniques for using external supporting utilities and for writing your own.
To some extent, you can avoid writing your own tools by using existing programs. For
example, cut can extract columns from a file, and sed and tr can be used as postprocessors to
convert query output into other formats. But you'll probably eventually reach the point where
you decide to write your own programs. When you do, there are two broad sets of issues to
How to manipulate the structure of datafiles. When a file is in a format that isn't
suitable for import, you'll need to convert it to a different format. This many involve
issues such as changing the column delimiters or line-ending sequences, or removing
or rearranging columns in the file.
How to manipulate the content of datafiles. If you don't know whether the values
contained in a file are legal, you may want to preprocess it to check or reformat them.
Numeric values may need to be verified as lying within a specific range, dates may
need to be converted to or from ISO format, and so forth.
Source code for the program fragments and scripts discussed in this chapter is located in the
transfer directory of the recipes distribution, with the exception that some of the utility
functions are contained in library files located in the lib directory. The code for some of the
shorter utilities is shown in full. For the longer ones, the chapter generally discusses only how
they work and how to use them, but you have access to the source if you wish to investigate
in more detail how they're written.
The problems addressed in this chapter involve a lot of text processing and pattern matching.
These are particular strengths of Perl, so the program fragments and utilities shown here are
written mainly in Perl. PHP and Python provide pattern-matching capabilities, too, so they can
of course do many of the same things. If you want to adapt the techniques described here for
Java, you'll need to get a library that provides classes for regular expression-based pattern
matching. See Appendix A for suggestions.
10.1.1 General Import and Export Issues
Incompatible datafile formats and differing rules for interpreting various kinds of values lead
to many headaches when transferring data between programs. Nevertheless, certain issues
recur frequently. By being aware of them, you'll be able to identify more easily just what you
need to do to solve particular import or export problems.
In its most basic form, an input stream is just a set of bytes with no particular meaning.
Successful import into MySQL requires being able to recognize which bytes represent
structural information, and which represent the data values framed by that structure. Because
such recognition is key to decomposing the input into appropriate units, the most fundamental
import issues are these:
What is the record separator? Knowing this allows the input stream to be partitioned
What is the field delimiter? Knowing this allows each record to be partitioned into field
values. Recovering the original data values also may include stripping off quotes from
around the values or recognizing escape sequences within them.
The ability to break apart the input into records and fields is important for extracting the data
values from it. However, the values still might not be in a form that can be used directly, and
you may need to consider other issues:
Do the order and number of columns match the structure of the database table?
Mismatches require columns to be rearranged or skipped.
Do data values need to be validated or reformatted? If the values are in a format that
matches MySQL's expectations, no further processing is necessary. Otherwise, they
need to be checked and possibly rewritten.
How should NULL or empty values be handled? Are they allowed? Can NULL values
even be detected? (Some systems export NULL values as empty strings, making it
impossible to distinguish one from the other.)
For export from MySQL, the issues are somewhat the reverse. You probably can assume that
values stored in the database are valid, but they may require reformatting, and it's necessary
to add column and record delimiters to form an output stream that has a structure another
program can recognize.
The chapter deals with these issues primarily within the context of performing bulk transfers of
entire files, but many of the techniques discussed here can be applied in other situations as
well. Consider a web-based application that presents a form for a user to fill in, then processes
its contents to create a new record in the database. That is a data import situation. Web APIs
generally make form contents available as a set of already-parsed discrete values, so the
application may not need to deal with record and column delimiters, On the other hand,
validation issues remain paramount. You really have no idea what kind of values a user is
sending your script, so it's important to check them.
10.1.2 File Formats
Datafiles come in many formats, two of which are used frequently in this chapter:
Tab-delimited format. This is one of the simplest file structures; lines contain values
separated by tab characters. A short tab-delimited file might look like this, where the
whitespace between column values represents single tab characters:
Comma-separated values (CSV) format. Files written in CSV format vary somewhat,
because there is apparently no actual standard describing the format. However, the
general idea is that lines consist of values separated by commas, and values
containing internal commas are surrounded by quotes to prevent the commas from
being interpreted as value delimiters. It's also common for values containing spaces to
be quoted as well. Here is an example, where each line contains three values:
It's trickier to process CSV files than tab-delimited files, because characters like
quotes and commas have a dual meaning: they may represent file structure or be part
of data values.
Another important datafile characteristic is the line-ending sequence. The most common
sequences are carriage returns, linefeeds, and carriage return/linefeed pairs, sometimes
referred to here by the abbreviations CR, LF, and CRLF.
Datafiles often begin with a row of column labels. In fact, a CSV file that begins with a row of
names is what FileMaker Pro refers to as merge format. For some import operations, the row
of labels is an annoyance because you must discard it to avoid having the labels be loaded into
your table as a data record. But in other cases, the labels are quite useful:
For import into existing tables, the labels can be used to match up datafile columns
with the table columns if they are not necessarily in the same order.
The labels can be used for column names when creating a new table automatically or
semi-automatically from a datafile. For example, Recipe 10.37 later in the chapter
discusses a utility that examines a datafile and guesses the CREATE TABLE statement
that should be used to create a table from the file. If a label row is present, the utility
uses them for column names. Otherwise, it's necessary to make up generic names like
c1, c2, and so forth, which isn't very descriptive.
Tab-Delimited, Linefeed-Terminated Format
Although datafiles may be written in many formats, it's unlikely that you'll want to
include machinery for reading several different formats within each file-processing
utility that you write. I don't want to, either, so for that reason, many of the utilities
described in this chapter assume for simplicity that their input is in tab-delimited,
linefeed-terminated format. (This is also the default format for MySQL's LOAD DATA
statement.) By making this assumption, it becomes easier to write programs that
On the other hand, something has to be able to read data in other formats. To
handle that problem, we'll develop a cvt_file.pl script that can read or write several
types of files. (See Recipe 10.19.) The script is based on the Perl Text::CSV_XS
module, which despite its name can be used for more than just CSV data. cvt_file.pl
can convert between many file types, making it possible for other programs that
require tab-delimited lines to be used with files not originally written in that format.
10.1.3 Notes on Invoking Shell Commands
This chapter shows a number of programs that you invoke from the command line using a
shell like bash or tcsh under Unix or CMD.EXE ("the DOS prompt") under Windows. Many of
the example commands for these programs use quotes around option values, and sometimes
an option value is itself a quote character. Quoting conventions vary from one shell to
another, but rules that seem to work with most of them (including CMD.EXE under Windows)
are as follows:
For an argument that contains spaces, surround it with double quotes to prevent the
shell from interpreting it as multiple separate arguments. The shell will strip off the
quotes, then pass the argument to the command intact.
To include a double quote character in the argument itself, precede it with a
Some shell commands are so long that they're shown as you would enter them using several
lines, with a backslash character as the line-continuation character:
% prog_name \
That works for Unix, but not for Windows, where you'll need to omit the continuation
characters and type the entire command on one line:
C:\> prog_name argument1 argument2 ...
10.2 Importing Data with LOAD DATA and mysqlimport
You want to load a datafile into a table using MySQL's built in import capabilities.
Use the LOAD DATA statement or the mysqlimport command-line program.
MySQL provides a LOAD DATA statement that acts as a bulk data loader. Here's an example
statement that reads a file mytbl.txt from your current directory and loads it into the table
mytbl in the current database:
mysql> LOAD DATA LOCAL INFILE 'mytbl.txt' INTO TABLE mytbl;
MySQL also includes a utility program named mysqlimport that acts as a wrapper around LOAD
DATA so that you can load input files directly from the command line. The mysqlimport
command that is equivalent to the preceding LOAD DATA statement looks like this, assuming
that mytbl is in the cookbook database:
For mysqlimport, as with other MySQL programs, you may need to specify
connection parameter options such as --user or --host. If so, they should
precede the database name argument.
% mysqlimport --local cookbook mytbl.txt
The following list describes LOAD DATA's general characteristics and capabilities; mysqlimport
shares most of these behaviors. There are some differences that we'll note as we go along,
but for the most part you can read "LOAD DATA" as "LOAD DATA or mysqlimport." LOAD DATA
provides options to address many of the import issues mentioned in the chapter introduction,
such as the line-ending sequence for recognizing how to break input into records, the column
value delimiter that allows records to be broken into separate values, the quoting character
that may surround column values, quoting and escaping issues within values, and NULL value
By default, LOAD DATA expects the datafile to contain the same number of columns as
the table into which you're loading data, and the datafile columns must be present in
the same order as in the table. If the file doesn't contain a value for every column or
the values aren't in the proper order, you can specify which columns are present and
the order in which they appear. If the datafile contains fewer columns than the table,
MySQL assigns default values to columns for which no values are present in the
LOAD DATA assumes that data values are separated by tab characters and that lines
end with linefeeds (newlines). You can specify the data format explicitly if a file
doesn't conform to these conventions.
You can indicate that data values may have quotes around them that should be
stripped, and you can specify what the quote character is.
Several special escape sequences are recognized and converted during input
processing. The default escape character is backslash (\), but you can change it if you
like. The \N sequence is taken to represent a NULL value. The \b, \n, \r, \t, \\, and
\0 sequences are interpreted as backspace, linefeed, carriage return, tab, backslash,
and ASCII NUL characters. (NUL is a zero-valued byte, which is different than the SQL
LOAD DATA provides diagnostic information, but it's a summary that doesn't give you
specific information about which input lines may have caused problems. There is work
in progress for MySQL 4 on providing improved feedback. In the meantime, see Recipe
10.38, which describes a LOAD DATA diagnostic utility.
The next few sections describe how to import datafiles into MySQL tables using LOAD DATA or
mysqlimport. They assume your files contain legal data values that are acceptable to MySQL.
Why make this assumption? Because although LOAD DATA has several options that control
how it reads the datafile, they're concerned only with the structure of the file. LOAD DATA
won't validate or reformat data values for you. It's necessary to perform such operations
either by preprocessing the datafile before loading it, or by issuing SQL statements after
loading it. If you need to check or reformat an input file first to make sure it's legal, several
sections later in the chapter show how to do that.
10.3 Specifying the Datafile Location
You're not sure how to tell LOAD DATA where to look for your datafile, particularly if it's
located in another directory.
It's a matter of knowing the rules that determine where MySQL looks for the file.
When you issue a LOAD DATA statement, the MySQL server normally assumes the datafile is
located on the server host. However, you may not be able to load data that way:
If you access the MySQL server from a remote client host and have no means of
transferring your file to the server host (such as a login account there), you won't be
able to put the file on the server.
Even if you have a login account on the server host, your MySQL account must be
enabled with the FILE privilege, and the file to be loaded must be either world
readable or located in the data directory for the current database. Most MySQL users
do not have the FILE privilege (because it allows you to do dangerous things), and
you may not want to make the file world readable (for security reasons) or be able to
put it in the database directory.
Fortunately, if you have MySQL 3.22.15 or later, you can load local files that are located on
the client host by using LOAD DATA LOCAL rather than LOAD DATA. The only permission you
need to import a local file is the ability to read the file yourself.
As of MySQL 3.23.49, use of the LOCAL keyword may be disabled by
default. You may be able to turn it on using the --local-infile option for mysql.
If that doesn't work, your server has been configured not to allow LOAD
DATA LOCAL at all.
If the LOCAL keyword is not present, MySQL looks for the datafile on the server host using the
An absolute pathname fully specifies the location of the file, beginning from the root of
the filesystem. MySQL reads the file from the given location.
A relative pathname is interpreted two ways, depending on whether it has a single
component or multiple components. For a single-component filename like mytbl.txt,
MySQL looks for the file in the database directory for the current database. For a
multiple-component filename like xyz/mytbl.txt, MySQL looks for the file beginning in
the MySQL data directory. (It expects to find mytbl.txt in a directory named xyz.)
Database directories are located directly under the data directory, so these two statements are
equivalent if the current database is cookbook:
mysql> LOAD DATA INFILE 'mytbl.txt' INTO TABLE mytbl;
mysql> LOAD DATA INFILE 'cookbook/mytbl.txt' INTO TABLE mytbl;
If the LOCAL keyword is specified, MySQL looks for the file on the client host, and interprets
the pathname the same way your command interpreter does:
An absolute pathname fully specifies the location of the file, beginning from the root of
A relative pathname is interpreted relative to your current directory.
If your file is located on the client host, but you forget to indicate that it's local, you'll get an
mysql> LOAD DATA 'mytbl.txt' INTO TABLE mytbl;
ERROR 1045: Access denied for user: 'cbuser@localhost' (Using password:
That Access denied message can be confusing, given that if you're able to connect to the
server and issue the LOAD DATA statement, it would seem that you've already gained access
to MySQL. What the error message means is that the MySQL tried to open mytbl.txt on the
server host and could not access it.
mysqlimport uses the same rules for finding files as LOAD DATA. By default, it assumes the
datafile is located on the server host. To use a local file, specify the --local (or -L) option on
the command line.
LOAD DATA assumes the table is located in the current database unless you specify the
database name explicitly. mysqlimport always requires a database argument:
% mysqlimport --local cookbook mytbl.txt
If you want to use LOAD DATA to load a file into a database other than the current one, you
can qualify the table name with the database name. The following statement does this,
indicating that the mytbl table is located in the other_db database:
mysql> LOAD DATA LOCAL 'mytbl.txt' INTO TABLE other_db.mytbl;
LOAD DATA assumes no relationship between the name of the datafile and the name of the
table into which you're loading the file's contents. mysqlimport assumes a fixed relationship
between the datafile name and the table name. Specifically, it uses the last component of the
filename to determine the table name. For example, mysqlimport would interpret mytbl.txt,
mytbl.dat, /tmp/mytbl.txt, /u/paul/data/mytbl.csv, and D:\projects\mytbl.txt all as files
containing data for the mytbl table.
Naming Datafiles Under Windows
Windows systems use \ as the pathname separator in filenames. That's a bit of a
problem, because MySQL interprets backslash as the escape character in string
values. To specify a Windows pathname, either use doubled backslashes, or use
forward slashes instead. These two statements show two ways of referring to the
same Windows file:
mysql> LOAD DATA LOCAL INFILE 'D:\\projects\\mydata.txt' INTO mytbl;
mysql> LOAD DATA LOCAL INFILE 'D:/projects/mydata.txt' INTO mytbl;
10.4 Specifying the Datafile Format
You have a datafile that's not in LOAD DATA's default format.
Use FIELDS and LINES clauses to tell LOAD DATA how to interpret the file.
By default, LOAD DATA assumes that datafiles contain lines that are terminated by linefeeds
(newlines) and that data values within a line are separated by tabs. The following statement
does not specify anything about the format of the datafile, so MySQL assumes the default
mysql> LOAD DATA LOCAL INFILE 'mytbl.txt' INTO TABLE mytbl;
To specify a file format explicitly, use a FIELDS clause to describe the characteristics of fields
within a line, and a LINES clause to specify the line-ending sequence. The following LOAD
DATA statement specifies that the datafile contains values separated by colons and lines
terminated by carriage returns:
mysql> LOAD DATA LOCAL INFILE 'mytbl.txt' INTO TABLE mytbl
-> FIELDS TERMINATED BY ':'
-> LINES TERMINATED BY '\r';
Each clause follows the table name. If both are present, the FIELDS clause must precede the
LINES clause. The line and field termination indicators can contain multiple characters. For
example, \r\n indicates that lines are terminated by carriage return/linefeed pairs.
If you use mysqlimport, command-line options provide the format specifiers. mysqlimport
commands that correspond to the preceding two LOAD DATA statements look like this:
% mysqlimport --local cookbook mytbl.txt
% mysqlimport --local --fields-terminated-by=":" --lines-terminated-by="\r"
The order in which you specify the options doesn't matter for mysqlimport, except that they
should all precede the database name.
Specifying Binary Format Option Characters
As of MySQL 3.22.10, you can use hex notation to specify arbitrary format
characters for FIELDS and LINES clauses. Suppose a datafile has lines with Ctrl-A
between fields and Ctrl-B at the end of lines. The ASCII values for Ctrl-A and Ctrl-B
are 1 and 2, so you represent them as 0x01 and 0x02:
FIELDS TERMINATED BY 0x01 LINES TERMINATED BY 0x02
mysqlimport understands hex constants for format specifiers as of MySQL 3.23.30.
You may find this capability helpful if you don't like remembering how to type escape
sequences on the command line or when it's necessary to use quotes around them.
Tab is 0x09, linefeed is 0x0a, and carriage return is 0x0d. Here's an example that
indicates that the datafile contains tab-delimited lines terminated by CRLF pairs:
% mysqlimport --local --lines-terminated-by=0x0d0a \
--fields-terminated-by=0x09 cookbook mytbl.txt
10.5 Dealing with Quotes and Special Characters
Your datafile contains quoted values or escaped characters.
Tell LOAD DATA to be aware of them so that it doesn't load the values into the database
The FIELDS clause can specify other format options besides TERMINATED BY. By default,
LOAD DATA assumes that values are unquoted, and interprets the backslash (\) as an escape
character for special characters. To indicate the value quoting character explicitly, use
ENCLOSED BY; MySQL will strip that character from the ends of data values during input
processing. To change the default escape character, use ESCAPED BY.
The three subclauses of the FIELDS clause (ENCLOSED BY, ESCAPED BY, and TERMINATED
BY) may be present in any order if you specify more than one of them. For example, these
FIELDS clauses are equivalent: