Tải bản đầy đủ - 0 (trang)
13-12. Loading Files in Controlled Batches

13-12. Loading Files in Controlled Batches

Tải bản đầy đủ - 0trang

Chapter 13 ■ Organising And Optimizing Data Loads



4.



Create the following variables:



Variable Name



Scope



Type



Value/Comments



ADOTable



Package



object



n/a

The object variable that will contain the

list of files to process.



BatchQuantity



Package



Int32



50

The quantity of files to process per batch.



CreateList



Package



Boolean



True

A flag to indicate whether the list of files is

to be dropped and re-created or not.



FileFilter



Package



String



*.CSV

The file extension for all the files to be

processed.



FileSource



Package



String



C:\SQL2012DIRecipes\CH13

The source directory for the source files.



IsFinished



Package



Boolean



False

The flag used to indicate that the process

has finished.



ListConn



Package



String



CarSales_StagingADONET

The connection manager name



MaxFilesToProcess



Package



Int64



5000

The upper threshold for the maximum

number of files to process per batch.



MaxProcessDuration



Package



Int32



7200

The upper threshold for the maximum

number of seconds to run the batch before

ceasing processing.



ProcessFile



Package



String



n/a

The file currently being processed.



SortElement



Package



String



FileSize.

The indicator of how the list is sorted.



TotalFilesLoaded



Package



Int64



0

The process counter for the number of

files processed in the batch.



785

www.it-ebooks.info



Chapter 13 ■ Organising And Optimizing Data Loads



5.



Add a For Loop container to the Control Flow pane. Name it Create table of files to

process.



6.



Inside, add an Execute SQL task named Prepare Table.



7.



Then add two Script components named Loop Through Files and Write to table and

Reset Variable.



8.



Next, connect the Components in the order shown in Figure 13-26.



Figure 13-26.  Defining the list of files to process for a controlled file load

9.



Set the EvalExpression for the “Create table of files to process” task to:

@CreateList == True



10.



11.



Configure the Prepare Table task so that:

Connection Type:



ADO.NET



Connection:



CarSales_Staging_ADONET



SQL Statement:



TRUNCATE TABLE dbo.BatchFileLoad



Configure the “Loop Through Files and Write to table” Script component, as follows:

ReadOnly Variables:



12.



User::FileFilter,User::FileSource,User::ListConn



Set the language of the Script component to Microsoft Visual Basic 2010 and click Edit

Script.



786

www.it-ebooks.info



Chapter 13 ■ Organising And Optimizing Data Loads



13. Add the following to the Imports region:

Imports System.Data.SqlClient

Imports System.IO

14. Use the following to replace the Main method (C:\SQL2012DIRecipes\CH13\

ControlBatchLoad.Vb):

Public Sub Main()



Dim sqlConn As SqlConnection

Dim sqlCommand As SqlCommand



sqlConn = DirectCast(Dts.Connections(Dts.Variables("ListConn").Value.ToString). 

AcquireConnection(Dts.Transaction), SqlConnection)



Dim FileSource As String = Dts.Variables("FileSource").Value.ToString

Dim FileFilter As String = Dts.Variables("FileFilter").Value.ToString



Dim dirInfo As New System.IO.DirectoryInfo(FileSource)

Dim fileSystemInfo As System.IO.FileSystemInfo

Dim FileName As String

Dim FileFullName As String

Dim FileSize As Long

Dim FileExtension As String

Dim CreationTime As Date

Dim DirectoryName As String

Dim LastWriteTime As Date



Dim sqlText As String



For Each fileSystemInfo In dirInfo.GetFileSystemInfos(FileFilter)



FileName = fileSystemInfo.Name

FileFullName = fileSystemInfo.FullName



Dim fileDetail As New FileInfo(FileFullName)



FileSize = fileDetail.Length

CreationTime = fileDetail.CreationTime

FileExtension = fileDetail.Extension

DirectoryName = fileDetail.DirectoryName

LastWriteTime = fileDetail.LastWriteTime





787

www.it-ebooks.info



Chapter 13 ■ Organising And Optimizing Data Loads



sqlText = "INSERT INTO dbo.BatchFileLoad " _

& "(FileName, IsToload, IsLoaded, FileSize, CreationTime, " _

& "FileExtension, DirectoryName, LastWriteTime) " _

& "VALUES('" & FileName & "', 1, 0, " & FileSize & ", '" _

& String.Format("{0:s}", CreationTime) & "', '" & FileExtension & "', '" _

& DirectoryName & "', '" & String.Format("{0:s}", LastWriteTime) & "')"



sqlCommand = New SqlCommand(sqlText, sqlConn)

sqlCommand.CommandType = CommandType.Text

sqlCommand.ExecuteNonQuery()



Next



Dts.TaskResult = ScriptResults.Success



End Sub

15. Close the Script window and confirm with OK.

16. Configure the Reset Variable task with the following variable:

ReadWrite Variables:



CreateList



17. Add the following script to the Script task once you have set the language to Microsoft

Visual Basic 2010:

Public Sub Main()

Dts.Variables("CreateList").Value = False



Dts.TaskResult = ScriptResults.Success

End Sub

18. Close the Script window and confirm with OK.

This completes the first part of the package—the process to iterate through all the files to process,

and store their data in the BatchFileLoad table, as well as the possibility of restarting the process, and

regenerating the list of files.

19. Add a Script task to the Control Flow pane. Name it Initialize file counter and set the

following variable:

ReadWrite Variables:



User::TotalFilesLoaded



788

www.it-ebooks.info



Chapter 13 ■ Organising And Optimizing Data Loads



20. Add the following script to the Script task once you have set the language to Microsoft

Visual Basic 2010:

Public Sub Main()



Dts.Variables("TotalFilesLoaded").Value = 0

Dts.TaskResult = ScriptResults.Success



End Sub

21. Close the Script window and confirm with OK.

22. Add a For Loop container to the Control Flow pane. Name it Batch Process. Set its

EvalExpression to:

@IsFinished == False || @TotalFilesLoaded < @MaxFilesToProcess ||

DATEADD( "ss", @MaxProcessDuration, @[System::ContainerStartTime]





) >



GETDATE()



23. Inside this For Loop container, add the following, connected in this order:





An Execute SQL task named Get Batch.







A Foreach Loop container named Process Files while there are files to Process.







An Execute SQL task named Count remaining files to process.



24. Inside the Each Loop container, named Load Batch, add the following, connected in

this order:





A Data Flow task named Data Load.







An Execute SQL task named Log file is loaded.







A Script task named Increment file counter.



This part of the package should look like Figure 13-27.



789

www.it-ebooks.info



Chapter 13 ■ Organising And Optimizing Data Loads



Figure 13-27.  Processing a controlled bulk load

25. Configure the “Get Batch” Execute SQL task so that its attributes are as follows:

Connection:

Connection Type:



ADO.NET



Connection:



CarSales_Staging_ADONET



SQL Statement:



SELECT



TOP (@BatchQuantity) [FileName]



FROM



dbo.BatchFileLoad



WHERE



IsToload = 1



AND



IsLoaded = 0



ORDER BY



FileSize



790

www.it-ebooks.info



Chapter 13 ■ Organising And Optimizing Data Loads



ResultSet:



Full Result Set



ResultSet:



Variable Name: User::ADOTable



Result Name:



0



Parameter Mapping:

Variable Name:



User::BatchQuantity



Direction:



Input



Data Type:



Int32



Parameter Name:



@BatchQuantity



26. Configure the Data Load task so that the data flow source is the Flat File connection

named Data Source File, and the data flow destination is an OLEDB destination using

the CarSales_Staging_OLEDB connection. Ensure that you have mapped the columns.

27. Configure the “Log file is loaded” task with the following attributes:

Connection Pane:

Connection Type:



ADO.NET



Connection:



CarSales_Staging_ADONET



SQL Statement:



UPDATE



BatchFileLoad



SET



IsLoaded = 1

,DateLoaded = GETDATE()



WHERE



FileName =



@ProcessFile



Parameter Mapping Pane:

Variable Name:



User::ProcessFile



Direction:



Input



Data Type:



String



Parameter Name:



@ProcessFile



28. Configure the Script task named Increment File counter, adding the following

variable:

ReadWrite Variables:



User::TotalFilesLoaded



791

www.it-ebooks.info



Chapter 13 ■ Organising and Optimizing data LOads



29. Add the following script to the Script task once you have set the language to Microsoft

Visual Basic 2010:

Public Sub Main()

Dts.Variables("TotalFilesLoaded").Value

Dts.TaskResult = ScriptResults.Success

End Sub



= Dts.Variables("TotalFilesLoaded").Value + 1



30. Close the Script window and confirm with OK.

31. Configure the “Count remaining files to process” Execute SQL task so that it is as

follows:

Connection Pane:

Connection Type:



ADO.NET



Connection:



CarSales_Staging_ADONET



SQL Statement:



DECLARE @FileCountToProcess INT

SELECT @FileCountToProcess = COUNT(*) FROM BatchFileLoad

WHERE IsLoaded = 0

IF @FileCountToProcess = 0

BEGIN

SET @IsFinished = 1

END

ELSE

BEGIN

SET @IsFinished = 0

END



Parameter Mapping Pane:

Variable Name:



User::IsFinished



Direction:



Output



Data Type:



String



Parameter Name:



@IsFinished



You can now run the process.



792

www.it-ebooks.info



Chapter 13 ■ Organising And Optimizing Data Loads



How It Works

Another frequent requirement in my experience is to load data from multiple files, and to load them in batches.

There are several possible reasons for wanting to do this:





You cannot guarantee that the data load will finish in a required (or reasonable) time, and

you need to be able to stop the load at any time, and continue the load later.







You want to load files for a specified period, and have the process cease after the last

complete file load once the specified period has elapsed.







You need to perform intermediate processing once batches of data have been loaded.







Your ETL process is more efficient if the data is loaded in optimized batches (shredding

XML files or due to indexing constraints for instance).



Of course, a batch process that can handle these requirements must also be able to sort source files, and

log which file is loaded. Inevitable it must be sufficiently resilient to allow graceful recovery and restart from the

point of failure.

So here you have an SSIS package that fulfils these requirements. To make it more flexible, it allows you to

pass in the parameters in Table 13-1 as SSIS variables.

Table 13-1.  The SSIS Parameters Used in a Parallel Load



Parameter



Description



Batch size



The number of source files processed per batch.



Total process size



The number of source files processed before the SSIS package stops.



Maximum process duration



The number of seconds the process will run—plus the time needed to

finish the current file load.



The directory to process



The directory where source files are stored.



The file filter



This is the file extension most of the time.



This package is in two parts. First, a loop container is processed that gathers the data for all the files to

process, and writes this data to an SQL Server table. A table is used to ensure that all metadata is persisted to a

data store that can be guaranteed reliable. Second, another loop container processes the files, as long as files

remain to process—and time remains and the maximum number of files is not reached. Inside this “logic” loop

container is another that provides the batch loading.

The process needs to know if it is a completely new load, or if it is continuing an existing load. To this end,

the @CreateList variable is passed in as False if the process is to continue where it left off. The default is True,

assuming (optimistically) that the process will always finish in time, without error, and that there will never be

too many files to process. As there are three Blocking thresholds, these must be passed in when running the

process—or the defaults accepted:





Maximum time (detected by comparing the container start time to the number of

seconds added to the system date).







Maximum number of files (incremented each time a file is loaded).







All files loaded (flagged when the final batch contains no files).



793

www.it-ebooks.info



Chapter 13 ■ Organising And Optimizing Data Loads



The initial part of the process—the “Create table of files to process” For Loop container—defines the list of

files to be loaded in the BatchFileLoad table, and flags them initially as not yet loaded. The counter variable that

tracks the actual number of files loaded is set to 0. The outer (control) For Loop container checks that





Not all the files have been loaded.







The allotted time has not been exceeded.







The total number of files to process has not been exceeded.



Assuming that none of these is true, then control passes to the load process. First, the specified number

of files to load are selected from the BatchFileLoad table (or as many as remain if this is a lesser number), and

passed to an ADOTable ADO.NET object. Then, the inner (load) For Loop container processes the files in

ADOTable. This consists of





Loading the file.







Logging the successful load to the BatchFileLoad control table.







Incrementing the total files loaded counter.



The “Loop Through Files and Write to table” task loops through all the files in the specified directory,

and writes the file names and relevant attributes to the Control table. It uses an existing ADO.NET connection

manager. In this example, the SQL is sent as text, but you could use a stored procedure and send in the values as

parameters.

The SQL in “Count remaining files to process” counts the number of files in the “control” table that remain

to be processed. It then sets the IsFinished “stop” variable to True if this is the case, which is picked up by the

Foreach Loop container as the indicator to pass control on to the next task.



■■Note This process, as it has been described, has no error handling to simplify the explanations and code.

However, in a production environment, you should definitely add error trapping and handling, and at a minimum, you

should detect file load failure and log a failed file load. You should also define the MaximumErrorCount at the level of

the Load Batch container to indicate the number of errors allowed before failing the package.



Hints, Tips, and Traps





In a real-world environment, you may prefer to store the metadata (the BatchFileLoad

table) in a different database. It is entirely up to you.







Of course, you must set the variable values and connection managers to meet your

requirements.







You can extend this process to handle files from multiple folders and file filters using the

techniques described in Recipe 13-3.



794

www.it-ebooks.info



Chapter 13 ■ Organising And Optimizing Data Loads



13-13. Executing SQL Statements and Procedures in

Parallel Using SSIS

Problem

You want to accelerate an SSIS ETL process that contains several T-SQL stored procedures.



Solution

Execute the SQL using multiple concurrent tasks. This is as easy as setting up two or more Execute SQL tasks in

parallel.

1.



Add an Execute SQL task on to the Control Flow pane, under the preceding task

(if there is one) for each process to be executed in parallel.



2.



Connect the preceding task (if there is one) to all the new Execute SQL tasks.



3.



Connect all the new Execute SQL tasks to the following task (if there is one).



4.



Define the SQL (as a stored procedure call or as T-SQL code) for each of the tasks.

Create or use any connection managers you require.



A purely theoretical package could look like Figure 13-28.



Figure 13-28.  Parallel SQL tasks



795

www.it-ebooks.info



Tài liệu bạn tìm kiếm đã sẵn sàng tải về

13-12. Loading Files in Controlled Batches

Tải bản đầy đủ ngay(0 tr)

×