Tải bản đầy đủ - 0 (trang)
13-4. Ordering and Filtering File Loads

13-4. Ordering and Filtering File Loads

Tải bản đầy đủ - 0trang

Chapter 13 ■ Organising And Optimizing Data Loads



Solution

Use ADO.NET datasets and SSIS script tasks.

In this example, I will presume that you have a directory of flat files in CSV format that need loading into SQL

Server, and that it is important to load the oldest files first. The follow steps explain the way to do this.

1.



Open SSIS and create a new package. I suggest naming it OrderAndFilterFileLoad, as

that is what it will do.



2.



Add the following variables at package level:



3.



Variable Name



Type



Value



Comments



ADOFilteredAndSortedTable



Object



The SSIS object that holds the result

list of sorted files.



FileName



String



The variable used by the Foreach

Loop container that loads the files

(in the correct, sorted order).



FileFilter



String



The variable used to hold the file name to

obtain file system attributes for source

files, which can be used to order the final

list.



SortColumn



String



The column that acts as a sort key.



FileSource



String



The source directory.



ADOTable



Object



The SSIS object that holds the

initial unsorted list of files.



Add a Script task on to the Control Flow pane. Name it PopulateRecordset and

double-click to edit. Add the following read/write variables:

ADOFilteredAndSortedTable

ADOTable



4.



Click the Edit Script button.



5.



In the SSIS Script window, add the following directive in the Imports region:

Imports System.IO



6.



Replace the Main method with the following code. This script is explained at the end

of the recipe (C:\SQL2012DIRecipes\CH13\OrderedFilteredFileLoad.Vb):

Public Sub Main()



'Declare all variables



Dim FileSource As String = Dts.Variables("FileSource").Value.ToString

Dim FileFilter As String = Dts.Variables("FileFilter").Value.ToString

Dim SortColumn As String = Dts.Variables("SortColumn").Value.ToString





747

www.it-ebooks.info



Chapter 13 ■ Organising And Optimizing Data Loads



Dim

Dim

Dim

Dim



MainDS As New System.Data.DataSet

MainTable As New System.Data.DataTable

MainRow As System.Data.DataRow

MainCol As New System.Data.DataColumn



Dim

Dim

Dim

Dim

Dim

Dim

Dim

Dim

Dim

Dim



dirInfo As New System.IO.DirectoryInfo(FileSource)

fileSystemInfo As System.IO.FileSystemInfo

FileCounter As Int16 = 0

FileName As String

FileFullName As String

FileSize As Long

FileExtension As String

CreationTime As Date

DirectoryName As String

LastWriteTime As Date









' Define table structure



MainDS.Tables.Add(MainTable)

MainTable.Columns.Add("FileName", System.Type.GetType("System.String"))

MainTable.Columns.Add("DateAdded", System.Type.GetType("System.DateTime"))

MainTable.Columns.Add("DateLoaded", System.Type.GetType("System.DateTime"))

MainTable.Columns.Add("FileSize", System.Type.GetType("System.Int32"))

MainTable.Columns.Add("CreationTime", System.Type.GetType("System.DateTime"))

MainTable.Columns.Add("FileExtension", System.Type.GetType("System.String"))

MainTable.Columns.Add("DirectoryName", System.Type.GetType("System.String"))

MainTable.Columns.Add("LastWriteTime", System.Type.GetType("System.DateTime"))

' Loop through directory, and add records to the ADO table



For Each fileSystemInfo In dirInfo.GetFileSystemInfos(FileFilter)



FileName = fileSystemInfo.Name

FileFullName = fileSystemInfo.FullName



Dim fileDetail As New FileInfo(FileFullName)



FileSize = fileDetail.Length

CreationTime = fileDetail.CreationTime

FileExtension = fileDetail.Extension

DirectoryName = fileDetail.DirectoryName

LastWriteTime = fileDetail.LastWriteTime



MainRow = MainTable.NewRow()



MainRow(0)

MainRow(1)

MainRow(2)

MainRow(3)

MainRow(4)



=

=

=

=

=



FileName

Now()

CDate("01-01-1900")

FileSize

CreationTime



748

www.it-ebooks.info



'

'

'

'

'

'

'

'



0

1

2

3

4

5

6

7



Chapter 13 ■ Organising And Optimizing Data Loads



MainRow(5) = FileExtension

MainRow(6) = DirectoryName

MainRow(7) = LastWriteTime



MainTable.Rows.Add(MainRow)



FileCounter = CShort(FileCounter + 1)



Next



' Create ADOLoopTable - used for the actual batch loop



Dim SortedFilteredDS As System.Data.DataSet = MainDS.Clone

Dim SortedFilteredRows As DataRow() = 

MainDS.Tables(0).[Select]("FileSize > 1", & SortColumn & " ASC")

Dim SortedFilteredTable As DataTable = SortedFilteredDS.Tables(0)



For Each ClonedFilteredRow As DataRow In SortedFilteredRows

SortedFilteredTable.ImportRow(ClonedFilteredRow)

Next



' Convert the tables into SSIS objects:



Dts.Variables("ADOFilteredAndSortedTable").Value = CType(SortedFilteredDS, Object)



Dts.TaskResult = ScriptResults.Success



End Sub

7.



Close the SSIS Script window and click OK to close the Script task editor.



8.



Add a Foreach Loop container on to the Control Flow pane and name it Load Files.

Connect the Populate Recordset script task to this and double-click to edit the

Foreach Loop container.



9.



Select Collection on the left, and choose For Each ADO Enumerator as the

enumerator type. The resulting dialog box should look like Figure 13-12.



749

www.it-ebooks.info



Chapter 13 ■ Organising And Optimizing Data Loads



Figure 13-12.  Looping over an ADO.NET object

10. Select Variable Mappings on the left. Select User::FileName as the variable to use.

11. Click OK.

12. Add a Data Flow task into the Foreach Loop container, and configure it just as you

did for Recipe 12-1 (not forgetting to set the Flat File connection manager connection

string as an expression). The final package should look like Figure 13-13.



750

www.it-ebooks.info



Chapter 13 ■ Organising And Optimizing Data Loads



Figure 13-13.  The completed package to filter and sort file loads

You can now run the package. However, before running this package, remember to add the actual data for

the Directory to be processed, the file filter, and the sort column in the Variables window. In this case, it would be

as follows.

FileSource:



C:\SQL2012DIRecipes\CH13\MultipleFlatFiles



FileFilter:



Stock*.Csv



SortColumn:



CreationTime



Alternatively, if you are running the package from the command line, an SQL Server agent job, or the SSIS

catalog, then remember to add the references to the variables as appropriate.



How It Works

There will be many occasions where you are required to load data from files in a directory. These files could be

flat file data (comma separated files, for instance), XML files or even Image files.

Now, loading all the files in a directory is a standard technique, and was explained previously in Recipe 12-1.

However, as many commentators have noted, the standard Foreach Loop container does not allow you to do

anything other than specify the file name and extension patterns, and nothing more. There are occasions when

you might need to specify the load order or filter the files in a directory based on the properties that they expose

to the file system, such as creation date or file size, among others.

Although there are several ways of ordering and filtering the source files (whatever type of data they

contain), I prefer to use SSIS and the Script task for this. Such an approach relies on ADO.NET datasets to

perform sorting and filtering. This method is both resilient and easy to maintain. It will require creating and



751

www.it-ebooks.info



Chapter 13 ■ Organising and Optimizing data LOads



using two .NET datasets. One holds the names and attributes of the files in the directory you are using to store the

files to be loaded. A clone of this .NET dataset is then created—the clone that is filtered and sorted. This cloned

dataset is used to iterate through the files to load.

To make this process a little more flexible, I am suggesting passing in as variables:





The name of the directory to be processed.







The file filter.







The sort column.



Passing these items is, of course, not strictly necessary, but it makes the package more easily reusable. You

also need a string variable to hold the file name of the actual file that is being loaded and an object variable to

hold the sorted datatable.

Following is how the script code works: To make the script more dynamic, two variables are declared and

their contents passed in from SSIS variables. These (FileSource and FileFilter) will pass in the directory to be

parsed, and the file filter to be applied. Variables are declared to hold all the required attributes of the files. Of

course, if you are only sorting and/or filtering on one or two attributes, then you need only use the corresponding

variables to make the code easier to maintain. Initially a Main dataset is created. This in turn holds the Main table,

which will be populated with the list of files from the selected directory. The Main table is defined, with columns

for each file attribute that you are likely to use. The next part of the script loops through the directory (passed in

as a variable), gets the file name, and attributes according to the file filter (passed in as a variable). Each attribute

is passed to the corresponding script variable. A new row in the table is added, and each column is populated

with the attribute from the script variable. As it is not possible to sort or filter an ADO dataset directly, a cloned

datatable is created (SortedFilteredDS), which is sorted and/or filtered, using the pseudo-code:

[Select]("Selection criteria as text", "Sort Column as text ASC/DESC")

Each source row from the Main table is copied into the cloned SortedFilteredTable table. The cloned

datatable is passed out as an SSIS variable that will be used in the Foreach Loop task. I have added a file counter

that you can pass out of the SSIS task if you wish for auditing and logging.



Hints, Tips, and Traps





Remember to use a UNC path for server and directory names, to ensure portability.







You do not need to refer to the columns in the datatable by their zero-based index (as I

did in the code for this recipe). You can use the column name, for example:

MainRow("FileName") = FileName







The variable names that you choose do not have to be identical to the column names—I

merely find this simpler to both code and debug.







When (or indeed if ) sorting—as this is not compulsory—remember to use the column

name that you defined as part of the ADO.NET table to sort on.







Passing in the sort column as a variable means that you have to be careful about the

quoted strings. If you are hard-coding the sort column, then merely entering FileSize

ASC (to sort by the FileSize column in the ado.NET datatable) will suffice.







The filter criteria can be more complex, and can include multiple columns and standard

comparison operators. A more complex example could be:

"FileSize > 10000 AND LastWriteTime <= '2010-12-25'"



752

www.it-ebooks.info



Chapter 13 ■ Organising And Optimizing Data Loads







Note that all text strings used in the filter criteria must be in single quotes, and that there

are several possible date formats. I prefer to use YYYY-MM-DD.







If you want to sort without filtering, then you need to apply some cunning. The .NET

Select method requires a filter before you can sort; that is, there is not an overload that

allows you merely to enter the sort argument. So a good way to cheat is to use code as in

the following:

[Select]("FileSize >= 1", "CreationTime ASC")

This filters on all files of one byte or more (which means all files of any use in most

circumstances), but it sorts by file creation date.







If you are using SSIS 2005 or 2008, you need to add a reference to System.Xml in the

Imports directive.



13-5. Loading Multiple Flat Files in Parallel

Problem

You want to load data from identically-structured multiple flat files, from a single directory, faster

than you could by loading the files in strict sequence.



Solution

Load the data in parallel, as follows.

1.



Create a table in SQL Server to hold the file names using the following DDL

(C:\SQL2012DIRecipes\CH13\tblSimpleParallelLoad.Sql):

CREATE TABLE CarSales_Staging.dbo.SimpleParallelLoad

(

ID INT IDENTITY(1,1) NOT NULL,

FileName VARCHAR (250) NULL,

ProcessNumber AS (ID%(4))

) ;

GO



2.



Create a table in SQL Server to hold the data, once loaded, using the following DDL

(C:\SQL2012DIRecipes\CH13\tblParallelStock.Sql):

CREATE TABLE CarSales_Staging.dbo.ParallelStock

(

ID bigint IDENTITY(1,1) NOT NULL,

Make VARCHAR (50) NULL,

Marque NVARCHAR(50) NULL,

Model VARCHAR (50) NULL,

Colour TINYINT NULL,

Product_Type VARCHAR (50) NULL,

Vehicle_Type VARCHAR (20) NULL,

Cost_Price NUMERIC(18, 2) NULL

)



753

www.it-ebooks.info



Chapter 13 ■ Organising And Optimizing Data Loads



3.



Enter the list of source files into the table (SimpleParallelLoad) that you created in

step 1. The DDL for this—in the current recipe—is (C:\SQL2012DIRecipes\CH13\

PrepSimpleParallelLoad.Sql):

USE CarSales_Staging

GO

SET IDENTITY_INSERT dbo.SimpleParallelLoad ON

GO

INSERT dbo.SimpleParallelLoad (ID, FileName)

VALUES (1, N'C:\SQL2012DIRecipes\CH13\MultipleFlatFiles\Stock01.Csv')

GO

INSERT dbo.SimpleParallelLoad (ID, FileName)

VALUES (2, N'C:\SQL2012DIRecipes\CH13\MultipleFlatFiles\Stock02.Csv')

GO

INSERT dbo.SimpleParallelLoad (ID, FileName)

VALUES (3, N'C:\SQL2012DIRecipes\CH13\MultipleFlatFiles\Stock03.Csv')

GO

INSERT dbo.SimpleParallelLoad (ID, FileName)

VALUES (4, N'C:\SQL2012DIRecipes\CH13\MultipleFlatFiles\Stock04.Csv')

GO

SET IDENTITY_INSERT dbo.SimpleParallelLoad OFF

GO



4.



Create a new SSIS package and name it SimpleParallelProcessing. Add two

connection managers—one OLEDB, the other ADO.NET—that connect to the

database you will use to load the data and metadata (CarSales_Staging in this

example). I will name them CarSales_Staging_OLEDB and CarSales_Staging_

ADONET, respectively.



5.



Add the following variables at the task level, as well as the initial values given:



Variable Name



Type



Value



Comments



CreateList



Boolean



String



A flag indicating that the list is

to be deleted and re-created.



FileFilter



String



*.CSV



Allows you to specify the file

extension to use.



FileSource



String



C:\SQL2012DIRecipes\CH13



Allows you to specify the file

directory to use.



Of course, you should use your own file filter and source directory if you are not following this

example exactly.

6.



Add a Sequence container onto the Data Flow pane, and name it Create table of files

to process.



754

www.it-ebooks.info



Tài liệu bạn tìm kiếm đã sẵn sàng tải về

13-4. Ordering and Filtering File Loads

Tải bản đầy đủ ngay(0 tr)

×