Tải bản đầy đủ - 0 (trang)
Chapter 10. Importing and Exporting Data

Chapter 10. Importing and Exporting Data

Tải bản đầy đủ - 0trang

Validation by Pattern Matching

Using Patterns to Match Broad Content Types

Using Patterns to Match Numeric Values

Using Patterns to Match Dates or Times

Using Patterns to Match Email Addresses and URLs

Validation Using Table Metadata

Validation Using a Lookup Table

Converting Two-Digit Year Values to Four-Digit Form

Performing Validity Checking on Date or Time Subparts

Writing Date-Processing Utilities

Using Dates with Missing Components

Performing Date Conversion Using SQL

Using Temporary Tables for Data Transformation

Dealing with NULL Values

Guessing Table Structure from a Datafile

A LOAD DATA Diagnostic Utility

Exchanging Data Between MySQL and Microsoft Access

Exchanging Data Between MySQL and Microsoft Excel

Exchanging Data Between MySQL and FileMaker Pro

Exporting Query Results as XML

Importing XML into MySQL

Epilog



10.1 Introduction

Suppose you have a file named somedata.csv that contains 12 columns of data in commaseparated values (CSV) format. From this file you want to extract only columns 2, 11, 5, and

9, and use them to create database records in a MySQL table that contains name, birth,



height, and weight columns. You need to make sure that the height and weight are positive

integers, and convert the birth dates from MM/DD/YY format to CCYY-MM-DD format. How can

you do this?

In one sense, that problem is very specialized. But in another, it's not at all atypical, because

data transfer problems with specific requirements occur frequently when transferring data into

MySQL. It would be nice if datafiles were always nicely formatted and ready to load into

MySQL with no preparation, but that is frequently not so. As a result, it's often necessary to

preprocess information to put it into a format that MySQL finds acceptable. The reverse also is

true; data exported from MySQL may need massaging to be useful for other programs.

Some data transfer operations are so difficult that they require a great deal of hand checking

and reformatting, but in most cases, you can do at least part of the job automatically. Virtually

all transfer problems involve at least some elements of a common set of conversion issues.

This chapter discusses what these issues are, how to deal with them by taking advantage of

the existing tools at your disposal, and how to write your own tools when necessary. The idea

is not to cover all possible import and export situations (an impossible task), but to show

some representative techniques and utilities, You can use them as is, or adapt them for

problems that they don't handle. (There are also commercial conversion tools that may assist

you, but my purpose here is to help you do things yourself.)

The first recipes in the chapter cover MySQL's native facilities for importing data (the LOAD



DATA statement and the mysqlimport command-line program), and for exporting data (the

SELECT ... INTO OUTFILE statement and the mysqldump program). For operations that don't

require any data validation or reformatting, these facilities often are sufficient.

For situations where MySQL's native import and export capabilities do not suffice, the chapter

moves on to cover techniques for using external supporting utilities and for writing your own.

To some extent, you can avoid writing your own tools by using existing programs. For

example, cut can extract columns from a file, and sed and tr can be used as postprocessors to

convert query output into other formats. But you'll probably eventually reach the point where

you decide to write your own programs. When you do, there are two broad sets of issues to

consider:







How to manipulate the structure of datafiles. When a file is in a format that isn't

suitable for import, you'll need to convert it to a different format. This many involve

issues such as changing the column delimiters or line-ending sequences, or removing

or rearranging columns in the file.







How to manipulate the content of datafiles. If you don't know whether the values

contained in a file are legal, you may want to preprocess it to check or reformat them.



Numeric values may need to be verified as lying within a specific range, dates may

need to be converted to or from ISO format, and so forth.

Source code for the program fragments and scripts discussed in this chapter is located in the

transfer directory of the recipes distribution, with the exception that some of the utility

functions are contained in library files located in the lib directory. The code for some of the

shorter utilities is shown in full. For the longer ones, the chapter generally discusses only how

they work and how to use them, but you have access to the source if you wish to investigate

in more detail how they're written.

The problems addressed in this chapter involve a lot of text processing and pattern matching.

These are particular strengths of Perl, so the program fragments and utilities shown here are

written mainly in Perl. PHP and Python provide pattern-matching capabilities, too, so they can

of course do many of the same things. If you want to adapt the techniques described here for

Java, you'll need to get a library that provides classes for regular expression-based pattern

matching. See Appendix A for suggestions.



10.1.1 General Import and Export Issues

Incompatible datafile formats and differing rules for interpreting various kinds of values lead

to many headaches when transferring data between programs. Nevertheless, certain issues

recur frequently. By being aware of them, you'll be able to identify more easily just what you

need to do to solve particular import or export problems.

In its most basic form, an input stream is just a set of bytes with no particular meaning.

Successful import into MySQL requires being able to recognize which bytes represent

structural information, and which represent the data values framed by that structure. Because

such recognition is key to decomposing the input into appropriate units, the most fundamental

import issues are these:







What is the record separator? Knowing this allows the input stream to be partitioned

into records.







What is the field delimiter? Knowing this allows each record to be partitioned into field

values. Recovering the original data values also may include stripping off quotes from

around the values or recognizing escape sequences within them.



The ability to break apart the input into records and fields is important for extracting the data

values from it. However, the values still might not be in a form that can be used directly, and

you may need to consider other issues:







Do the order and number of columns match the structure of the database table?

Mismatches require columns to be rearranged or skipped.







Do data values need to be validated or reformatted? If the values are in a format that

matches MySQL's expectations, no further processing is necessary. Otherwise, they

need to be checked and possibly rewritten.







How should NULL or empty values be handled? Are they allowed? Can NULL values

even be detected? (Some systems export NULL values as empty strings, making it

impossible to distinguish one from the other.)



For export from MySQL, the issues are somewhat the reverse. You probably can assume that

values stored in the database are valid, but they may require reformatting, and it's necessary

to add column and record delimiters to form an output stream that has a structure another

program can recognize.

The chapter deals with these issues primarily within the context of performing bulk transfers of

entire files, but many of the techniques discussed here can be applied in other situations as

well. Consider a web-based application that presents a form for a user to fill in, then processes

its contents to create a new record in the database. That is a data import situation. Web APIs

generally make form contents available as a set of already-parsed discrete values, so the

application may not need to deal with record and column delimiters, On the other hand,

validation issues remain paramount. You really have no idea what kind of values a user is

sending your script, so it's important to check them.



10.1.2 File Formats

Datafiles come in many formats, two of which are used frequently in this chapter:







Tab-delimited format. This is one of the simplest file structures; lines contain values

separated by tab characters. A short tab-delimited file might look like this, where the

whitespace between column values represents single tab characters:







a

a,b,c







Comma-separated values (CSV) format. Files written in CSV format vary somewhat,



b

d e



c

f



because there is apparently no actual standard describing the format. However, the

general idea is that lines consist of values separated by commas, and values

containing internal commas are surrounded by quotes to prevent the commas from

being interpreted as value delimiters. It's also common for values containing spaces to

be quoted as well. Here is an example, where each line contains three values:







a,b,c

"a,b,c","d e",f

It's trickier to process CSV files than tab-delimited files, because characters like

quotes and commas have a dual meaning: they may represent file structure or be part

of data values.



Another important datafile characteristic is the line-ending sequence. The most common

sequences are carriage returns, linefeeds, and carriage return/linefeed pairs, sometimes

referred to here by the abbreviations CR, LF, and CRLF.



Datafiles often begin with a row of column labels. In fact, a CSV file that begins with a row of

names is what FileMaker Pro refers to as merge format. For some import operations, the row

of labels is an annoyance because you must discard it to avoid having the labels be loaded into

your table as a data record. But in other cases, the labels are quite useful:







For import into existing tables, the labels can be used to match up datafile columns

with the table columns if they are not necessarily in the same order.







The labels can be used for column names when creating a new table automatically or

semi-automatically from a datafile. For example, Recipe 10.37 later in the chapter

discusses a utility that examines a datafile and guesses the CREATE TABLE statement

that should be used to create a table from the file. If a label row is present, the utility

uses them for column names. Otherwise, it's necessary to make up generic names like



c1, c2, and so forth, which isn't very descriptive.



Tab-Delimited, Linefeed-Terminated Format

Although datafiles may be written in many formats, it's unlikely that you'll want to

include machinery for reading several different formats within each file-processing

utility that you write. I don't want to, either, so for that reason, many of the utilities

described in this chapter assume for simplicity that their input is in tab-delimited,

linefeed-terminated format. (This is also the default format for MySQL's LOAD DATA

statement.) By making this assumption, it becomes easier to write programs that

read files.

On the other hand, something has to be able to read data in other formats. To

handle that problem, we'll develop a cvt_file.pl script that can read or write several

types of files. (See Recipe 10.19.) The script is based on the Perl Text::CSV_XS

module, which despite its name can be used for more than just CSV data. cvt_file.pl

can convert between many file types, making it possible for other programs that

require tab-delimited lines to be used with files not originally written in that format.



10.1.3 Notes on Invoking Shell Commands

This chapter shows a number of programs that you invoke from the command line using a

shell like bash or tcsh under Unix or CMD.EXE ("the DOS prompt") under Windows. Many of

the example commands for these programs use quotes around option values, and sometimes

an option value is itself a quote character. Quoting conventions vary from one shell to

another, but rules that seem to work with most of them (including CMD.EXE under Windows)

are as follows:







For an argument that contains spaces, surround it with double quotes to prevent the

shell from interpreting it as multiple separate arguments. The shell will strip off the

quotes, then pass the argument to the command intact.







To include a double quote character in the argument itself, precede it with a

backslash.



Some shell commands are so long that they're shown as you would enter them using several

lines, with a backslash character as the line-continuation character:



% prog_name \



argument1 \

argument2 ...

That works for Unix, but not for Windows, where you'll need to omit the continuation

characters and type the entire command on one line:



C:\> prog_name argument1 argument2 ...



10.2 Importing Data with LOAD DATA and mysqlimport

10.2.1 Problem

You want to load a datafile into a table using MySQL's built in import capabilities.



10.2.2 Solution

Use the LOAD DATA statement or the mysqlimport command-line program.



10.2.3 Discussion

MySQL provides a LOAD DATA statement that acts as a bulk data loader. Here's an example

statement that reads a file mytbl.txt from your current directory and loads it into the table



mytbl in the current database:

mysql> LOAD DATA LOCAL INFILE 'mytbl.txt' INTO TABLE mytbl;

MySQL also includes a utility program named mysqlimport that acts as a wrapper around LOAD



DATA so that you can load input files directly from the command line. The mysqlimport

command that is equivalent to the preceding LOAD DATA statement looks like this, assuming

[1]



that mytbl is in the cookbook database:

[1]



For mysqlimport, as with other MySQL programs, you may need to specify

connection parameter options such as --user or --host. If so, they should

precede the database name argument.



% mysqlimport --local cookbook mytbl.txt

The following list describes LOAD DATA's general characteristics and capabilities; mysqlimport

shares most of these behaviors. There are some differences that we'll note as we go along,

but for the most part you can read "LOAD DATA" as "LOAD DATA or mysqlimport." LOAD DATA

provides options to address many of the import issues mentioned in the chapter introduction,

such as the line-ending sequence for recognizing how to break input into records, the column

value delimiter that allows records to be broken into separate values, the quoting character



that may surround column values, quoting and escaping issues within values, and NULL value

representation:







By default, LOAD DATA expects the datafile to contain the same number of columns as

the table into which you're loading data, and the datafile columns must be present in

the same order as in the table. If the file doesn't contain a value for every column or

the values aren't in the proper order, you can specify which columns are present and

the order in which they appear. If the datafile contains fewer columns than the table,

MySQL assigns default values to columns for which no values are present in the

datafile.







LOAD DATA assumes that data values are separated by tab characters and that lines

end with linefeeds (newlines). You can specify the data format explicitly if a file

doesn't conform to these conventions.







You can indicate that data values may have quotes around them that should be

stripped, and you can specify what the quote character is.







Several special escape sequences are recognized and converted during input

processing. The default escape character is backslash (\), but you can change it if you

like. The \N sequence is taken to represent a NULL value. The \b, \n, \r, \t, \\, and



\0 sequences are interpreted as backspace, linefeed, carriage return, tab, backslash,

and ASCII NUL characters. (NUL is a zero-valued byte, which is different than the SQL



NULL value.)





LOAD DATA provides diagnostic information, but it's a summary that doesn't give you

specific information about which input lines may have caused problems. There is work

in progress for MySQL 4 on providing improved feedback. In the meantime, see Recipe

10.38, which describes a LOAD DATA diagnostic utility.



The next few sections describe how to import datafiles into MySQL tables using LOAD DATA or

mysqlimport. They assume your files contain legal data values that are acceptable to MySQL.

Why make this assumption? Because although LOAD DATA has several options that control

how it reads the datafile, they're concerned only with the structure of the file. LOAD DATA

won't validate or reformat data values for you. It's necessary to perform such operations

either by preprocessing the datafile before loading it, or by issuing SQL statements after

loading it. If you need to check or reformat an input file first to make sure it's legal, several

sections later in the chapter show how to do that.



10.3 Specifying the Datafile Location

10.3.1 Problem

You're not sure how to tell LOAD DATA where to look for your datafile, particularly if it's

located in another directory.



10.3.2 Solution

It's a matter of knowing the rules that determine where MySQL looks for the file.



10.3.3 Discussion

When you issue a LOAD DATA statement, the MySQL server normally assumes the datafile is

located on the server host. However, you may not be able to load data that way:







If you access the MySQL server from a remote client host and have no means of

transferring your file to the server host (such as a login account there), you won't be

able to put the file on the server.







Even if you have a login account on the server host, your MySQL account must be

enabled with the FILE privilege, and the file to be loaded must be either world

readable or located in the data directory for the current database. Most MySQL users

do not have the FILE privilege (because it allows you to do dangerous things), and

you may not want to make the file world readable (for security reasons) or be able to

put it in the database directory.



Fortunately, if you have MySQL 3.22.15 or later, you can load local files that are located on

the client host by using LOAD DATA LOCAL rather than LOAD DATA. The only permission you

[2]



need to import a local file is the ability to read the file yourself.

[2]



As of MySQL 3.23.49, use of the LOCAL keyword may be disabled by

default. You may be able to turn it on using the --local-infile option for mysql.

If that doesn't work, your server has been configured not to allow LOAD



DATA LOCAL at all.

If the LOCAL keyword is not present, MySQL looks for the datafile on the server host using the

following rules:







An absolute pathname fully specifies the location of the file, beginning from the root of

the filesystem. MySQL reads the file from the given location.







A relative pathname is interpreted two ways, depending on whether it has a single

component or multiple components. For a single-component filename like mytbl.txt,

MySQL looks for the file in the database directory for the current database. For a

multiple-component filename like xyz/mytbl.txt, MySQL looks for the file beginning in

the MySQL data directory. (It expects to find mytbl.txt in a directory named xyz.)



Database directories are located directly under the data directory, so these two statements are

equivalent if the current database is cookbook:



mysql> LOAD DATA INFILE 'mytbl.txt' INTO TABLE mytbl;

mysql> LOAD DATA INFILE 'cookbook/mytbl.txt' INTO TABLE mytbl;

If the LOCAL keyword is specified, MySQL looks for the file on the client host, and interprets

the pathname the same way your command interpreter does:







An absolute pathname fully specifies the location of the file, beginning from the root of

the filesystem.







A relative pathname is interpreted relative to your current directory.



If your file is located on the client host, but you forget to indicate that it's local, you'll get an

error.



mysql> LOAD DATA 'mytbl.txt' INTO TABLE mytbl;

ERROR 1045: Access denied for user: 'cbuser@localhost' (Using password:

YES)

That Access denied message can be confusing, given that if you're able to connect to the

server and issue the LOAD DATA statement, it would seem that you've already gained access

to MySQL. What the error message means is that the MySQL tried to open mytbl.txt on the

server host and could not access it.

mysqlimport uses the same rules for finding files as LOAD DATA. By default, it assumes the

datafile is located on the server host. To use a local file, specify the --local (or -L) option on

the command line.



LOAD DATA assumes the table is located in the current database unless you specify the

database name explicitly. mysqlimport always requires a database argument:



% mysqlimport --local cookbook mytbl.txt

If you want to use LOAD DATA to load a file into a database other than the current one, you

can qualify the table name with the database name. The following statement does this,

indicating that the mytbl table is located in the other_db database:



mysql> LOAD DATA LOCAL 'mytbl.txt' INTO TABLE other_db.mytbl;



LOAD DATA assumes no relationship between the name of the datafile and the name of the

table into which you're loading the file's contents. mysqlimport assumes a fixed relationship

between the datafile name and the table name. Specifically, it uses the last component of the

filename to determine the table name. For example, mysqlimport would interpret mytbl.txt,

mytbl.dat, /tmp/mytbl.txt, /u/paul/data/mytbl.csv, and D:\projects\mytbl.txt all as files

containing data for the mytbl table.



Naming Datafiles Under Windows

Windows systems use \ as the pathname separator in filenames. That's a bit of a

problem, because MySQL interprets backslash as the escape character in string

values. To specify a Windows pathname, either use doubled backslashes, or use

forward slashes instead. These two statements show two ways of referring to the

same Windows file:



mysql> LOAD DATA LOCAL INFILE 'D:\\projects\\mydata.txt' INTO mytbl;

mysql> LOAD DATA LOCAL INFILE 'D:/projects/mydata.txt' INTO mytbl;



10.4 Specifying the Datafile Format

10.4.1 Problem

You have a datafile that's not in LOAD DATA's default format.



10.4.2 Solution

Use FIELDS and LINES clauses to tell LOAD DATA how to interpret the file.



10.4.3 Discussion

By default, LOAD DATA assumes that datafiles contain lines that are terminated by linefeeds

(newlines) and that data values within a line are separated by tabs. The following statement

does not specify anything about the format of the datafile, so MySQL assumes the default

format:



mysql> LOAD DATA LOCAL INFILE 'mytbl.txt' INTO TABLE mytbl;

To specify a file format explicitly, use a FIELDS clause to describe the characteristics of fields

within a line, and a LINES clause to specify the line-ending sequence. The following LOAD



DATA statement specifies that the datafile contains values separated by colons and lines

terminated by carriage returns:



mysql> LOAD DATA LOCAL INFILE 'mytbl.txt' INTO TABLE mytbl

-> FIELDS TERMINATED BY ':'

-> LINES TERMINATED BY '\r';

Each clause follows the table name. If both are present, the FIELDS clause must precede the



LINES clause. The line and field termination indicators can contain multiple characters. For

example, \r\n indicates that lines are terminated by carriage return/linefeed pairs.

If you use mysqlimport, command-line options provide the format specifiers. mysqlimport

commands that correspond to the preceding two LOAD DATA statements look like this:



% mysqlimport --local cookbook mytbl.txt



% mysqlimport --local --fields-terminated-by=":" --lines-terminated-by="\r"

\

cookbook mytbl.txt

The order in which you specify the options doesn't matter for mysqlimport, except that they

should all precede the database name.



Specifying Binary Format Option Characters

As of MySQL 3.22.10, you can use hex notation to specify arbitrary format

characters for FIELDS and LINES clauses. Suppose a datafile has lines with Ctrl-A

between fields and Ctrl-B at the end of lines. The ASCII values for Ctrl-A and Ctrl-B

are 1 and 2, so you represent them as 0x01 and 0x02:



FIELDS TERMINATED BY 0x01 LINES TERMINATED BY 0x02

mysqlimport understands hex constants for format specifiers as of MySQL 3.23.30.

You may find this capability helpful if you don't like remembering how to type escape

sequences on the command line or when it's necessary to use quotes around them.

Tab is 0x09, linefeed is 0x0a, and carriage return is 0x0d. Here's an example that

indicates that the datafile contains tab-delimited lines terminated by CRLF pairs:



% mysqlimport --local --lines-terminated-by=0x0d0a \

--fields-terminated-by=0x09 cookbook mytbl.txt



10.5 Dealing with Quotes and Special Characters

10.5.1 Problem

Your datafile contains quoted values or escaped characters.



10.5.2 Solution

Tell LOAD DATA to be aware of them so that it doesn't load the values into the database

uninterpreted.



10.5.3 Discussion

The FIELDS clause can specify other format options besides TERMINATED BY. By default,



LOAD DATA assumes that values are unquoted, and interprets the backslash (\) as an escape

character for special characters. To indicate the value quoting character explicitly, use



ENCLOSED BY; MySQL will strip that character from the ends of data values during input

processing. To change the default escape character, use ESCAPED BY.

The three subclauses of the FIELDS clause (ENCLOSED BY, ESCAPED BY, and TERMINATED



BY) may be present in any order if you specify more than one of them. For example, these

FIELDS clauses are equivalent:



Tài liệu bạn tìm kiếm đã sẵn sàng tải về

Chapter 10. Importing and Exporting Data

Tải bản đầy đủ ngay(0 tr)

×