Tải bản đầy đủ - 0 (trang)
2-15. Handling Complex Flat File Formats with a Row Prefix in SSIS

2-15. Handling Complex Flat File Formats with a Row Prefix in SSIS

Tải bản đầy đủ - 0trang

Chapter 2 ■ Flat File Data Sources



1.



Create the two destination tables corresponding to the two types of records set in the

source file (C:\SQL2012DIRecipes\CH02\tblMultipleSubsets.sql):

CREATE TABLE InvoiceHeader

(

ID INT

,InvoiceNumber VARCHAR(50)

,ClientID INT

,InvoiceDate DATETIME

,TotalDiscount NUMERIC(18,2)

,DeliveryCharge

) ;

GO

CREATE TABLE InvoiceLine

(

ID INT

InvoiceID INT

StockID INT

SalePrice NUMERIC(18,2)

Timestamp BIGINT

DateUpdated DATETIME

LineItem INT

) ;

GO



2.



Create a new SSIS package and add an OLEDB connection manager named

CarSales_Staging to connect to the CarSales_Staging database. You may reuse an

existing package-level connection if you are using SQL Server 2012.



3.



Add a Flat File connection manager that you configure to connect to the source file

(C:\SQL2012DIRecipes\CH02\MultipleSubsets.Txt). While editing the connection

manager, and assuming that the row prefix is of fixed length, set the file format to

Ragged Right. In the Columns pane, set the column marker to divide the data into two

columns. The first will contain the row prefix, while the second will contain the row

data. This should look like what’s shown in Figure 2-16.



110

www.it-ebooks.info



Chapter 2 ■ Flat File Data Sources



Figure 2-16.  Initial separation of a complex text file into two columns

4.



In the Advanced pane, give the two columns more user-friendly names (ColIdentifier

and ColData, in this example), and set the OutputColumnWidth for the column that

will hold the data to a large value. An example is given in Figure 2-17.



111

www.it-ebooks.info



Chapter 2 ■ Flat File Data Sources



Figure 2-17.  Defining column identifiers for a complex text file

5.



Confirm your modifications with OK.



6.



Add a Data Flow task. Double-click to edit it.



7.



Add a Flat File data source. Configure it to use the Flat File connection manager that

you just created.



8.



Add a Conditional Split transform to the Data Flow pane. Connect the Flat File data

source to it. Double-click to edit. Create two new outputs, configured like this:



OutputName



Condition



InvoiceHeader



ColIdentifier == “HDR:-”



InvoiceLine



ColIdentifier == “LNE:-”



112

www.it-ebooks.info



Chapter 2 ■ Flat File Data Sources



9.



Add a Derived Column transformation to the Data Flow pane and connect the

Conditional Split to it. Select InvoiceHeader as the output to use. Double-click to edit.

Add the following derived columns (the Expressions code used here and in step 11 is

in C:\SQL2012DIRecipes\CH02\ExpressionCode.Txt):



Derived Column Name



Expression



Data Type



ID



(DT_I4)SUBSTRING(ColData,1, 



4-byte signed integer



Length



FINDSTRING(ColData,",",1) - 1)

InvoiceNumber



SUBSTRING(ColData, 



Unicode string



25



FINDSTRING(ColData,",",1) 

+ 1,FINDSTRING(ColData,",",2) 

- FINDSTRING(ColData,",",1) - 1)

ClientID



(DT_I4)SUBSTRING(ColData, 



4-byte signed integer



FINDSTRING(ColData,",",2) 

+ 1,FINDSTRING(ColData,",",3) 

- FINDSTRING(ColData,",",2) - 1)

InvoiceDate



(DT_DBDATE)SUBSTRING(ColData, 



Database date



FINDSTRING(ColData,",",3) 

+ 1,FINDSTRING(ColData,",",4) 

- FINDSTRING(ColData,",",3) - 1)

TotalDiscount



(DT_DECIMAL,2) 



Decimal



SUBSTRING(ColData, 

FINDSTRING(ColData,",",4) + 1, 

FINDSTRING(ColData,",",5) 

- FINDSTRING(ColData,",",4) - 1)

DeliveryCharge



(DT_DECIMAL,2) 



Decimal



RIGHT(ColData,LEN(ColData) 

- FINDSTRING(ColData,”,”,5))

10.



Add an OLEDB destination. Connect the Derived Column transformation that you

just created to it. Name it Invoice Header. Configure the destination task to use the

CarSales_Staging connection manager and point to the InvoiceHeader destination

table. Map the derived columns (that is, not the two initial ColIdentifier and ColData

columns) to the destination table.



11.



Repeat steps 7 and 8, only using InvoiceLine as the output from the Conditional Split

transformation, and pointing to the InvoiceLine table. The derived columns to create are:



113

www.it-ebooks.info



Chapter 2 ■ Flat File Data Sources



Derived Column

Name



Expression



Data Type



ID



LEFT(ColData,FINDSTRING(ColData,",",1) - 1)



4-byte signed integer



InvoiceID



SUBSTRING(ColData,FINDSTRING(ColData,",",1) 



4-byte signed integer



+ 1,FINDSTRING(ColData,",",2) 

- FINDSTRING(ColData,",",1) - 1)

StockID



SUBSTRING(ColData,FINDSTRING(ColData,",",2) 



4-byte signed integer



+ 1,FINDSTRING(ColData,",",3) 

- FINDSTRING(ColData,",",2) - 1)

SalePrice



SUBSTRING(ColData,FINDSTRING(ColData,",",3) 



Decimal



+ 1,FINDSTRING(ColData,",",4) 

- FINDSTRING(ColData,",",3) - 1)

Timestamp



SUBSTRING(ColData,FINDSTRING(ColData,",",4) 



8- byte signed integer



+ 1,FINDSTRING(ColData,",",5) 

- FINDSTRING(ColData,",",4) - 1)

DateUpdated



SUBSTRING(ColData,FINDSTRING(ColData,",",5) 



Database Date



+ 1,FINDSTRING(ColData,",",6) 

- FINDSTRING(ColData,",",5) - 1)

LineItem



RIGHT(ColData,LEN(ColData) 

- FINDSTRING(ColData,",",6))



The package should look something like Figure 2-18.



114

www.it-ebooks.info



4-byte signed integer



Length



Chapter 2 ■ Flat File Data Sources



Figure 2-18.  SSIS data flow to handle a complex flat file

12.



Run the package.



Your complex source file will be imported into the two destination tables.



How It Works

You may meet text files that contain two or more very different types of record. In many cases, you could regard

them as being a way of sending “normalized” data in a single file, as opposed to the single table per file approach

of a standard CSV file. I have seen files like these produced by mainframes and by ERP systems, for instance.

In these cases, I am presuming that each row for these “multiformat” text files has a row prefix to allow you to

identify the type of data contained in each row.

This more convoluted approach is necessary, as just about all the techniques described so far in this chapter

will work fine with standard text files, but not with more complex formats. Now, by “standard” I mean files that

have an identical row structure from top to bottom and that can consequently be painlessly loaded into a tabular

structure because there are the same number of column separators in each text row. The reality is that all too

often a minor variation in the format of a text file—or a deliberate design decision—can make a text file difficult

to load. Or at the very least, further processing steps will be required to manage the original structure features of

the source data file, as is the case here.



■■Note This example only shows two record types. It can easily be extended to accommodate multiple record types.

Here we are using a two-phased approach.

First: Split each row into two columns, one containing the row prefix, the other the

actual data.

Second: Parse each record according to the type of data it contains.



115

www.it-ebooks.info



Chapter 2 ■ Flat File Data Sources



Parsing data using a Derived Column transformation looks more complex than it really is. The last column

uses the RIGHT function in SSIS to get the rightmost column. All other columns use SUBSTRING to isolate the data

and FINDSTRING to identify the separator characters. This is where SSIS has a wonderful advantage over T-SQL, as

FINDSTRING can indicate from which occurrence of the character that it is looking for to begin the search. Also in

SSIS 2012, the 4000-character limit no longer applies, and so you can handle much longer source records. Note

that it is important to set the data types in the Derived Column transformation for the data ingestion process to

run efficiently and without error.

If you are using SQL Server 2005 or 2008, then the first column must be isolated using SUBSTRING(ColData,

1,FINDSTRING(ColData,",",1) - 1). This is because the LEFT function is not available in these versions of SSIS.

Essentially, the first column has to use the SUBSTRING function to imitate the LEFT function.



Hints, Tips, and Traps





If the row prefix is not a fixed length, then you can set the file format as “delimited” and

specify the delimiter to be whatever character ends the row prefix—in this example a

hyphen (-). Then in the Advanced pane of the Flat File connection manager, ensure

that there are only two columns (this normally means deleting any other columns) and

specify the column delimiter for the second column to be the row delimiter (most times

this will be {CR}{LF}). This is described in more detail in Recipe 2-16.







It is nearly always preferable to remove any referential integrity constraints on the

destination tables before running the import process, as you cannot be sure that a header

record will be imported before a line record. You can always reapply them after the

import process.







In SSIS 2005 and 2008, this approach is limited to records with a maximum length of 4000

characters. Unfortunately, trying to get around this limit by using parsing a text stream is

not possible because SUBSTRING will not work with this data type.







Any records in the source file that do not begin with the row prefix(es) are ignored.



2-16. Pre-Parsing and Staging File Subsets in SSIS

Problem

You want to load a complex text file consisting of two or more different record types. You want to separate the

constituent elements first so that you can load the constituent tables in a specific order.



Solution

Use SSIS to parse the source file and stage it on disk as subfiles as part of the process. Then load the two tables

while respecting relational integrity constraints.

This recipe handles the problem of a flat “mainframe-style” text file that contains multiple record types,

but in a different way than the method used in the Recipe 2-15. It will use the same source file, however, and

consequently, I suggest that you refer to Recipe 2-15 before proceeding with the process that follows. This could

help in your understanding of the solution. The following explains how you can subset the source file into

separate staging files before carrying out the final data load.

1.



Create a new SSIS package, and add a Flat File connection manager named

MainFrame, which you configure to connect to the source file.

(C:\SQL2012DIRecipes\CH02\MultiFormatSource.Txt in this example)



116

www.it-ebooks.info



s



2.



In the Flat File Configuration Manager Editor, set the format to Delimited.



3.



Click the Advanced tab. Remove all but two columns (or add columns until you have

two columns). Name the first one ColIdentifier and the second one ColData. Set

the DataType for the first column (ColIdentifier) to String, and the ColumnDelimiter

to hyphen (-). (This is what we are using in this example; you have to use whatever

separator is used in your source file). Set the DataType for the second (ColData) column

to Unicode text stream. The dialog box should look like Figure 2-17 in Recipe 2-15.



4.



Confirm your configuration changes.



5.



Create two new Flat File connection managers, named InvoiceHeader and

InvoiceLine, respectively. Configure each to point to a text file (I have imaginatively

named mine InvoiceHeader.Txt and InvoiceLine.Txt). For each, click the Advanced

pane and add a new column, which you set to the Unicode Text stream data type (or

the Text stream data type). Name the column Coldata. The Advanced pane will look

like Figure 2-19.



Figure 2-19. Advanced pane of the Flat File connection manager



117

www.it-ebooks.info



Tài liệu bạn tìm kiếm đã sẵn sàng tải về

2-15. Handling Complex Flat File Formats with a Row Prefix in SSIS

Tải bản đầy đủ ngay(0 tr)

×