Tải bản đầy đủ - 0 (trang)
13-5. Loading Multiple Flat Files in Parallel

13-5. Loading Multiple Flat Files in Parallel

Tải bản đầy đủ - 0trang

Chapter 13 ■ Organising And Optimizing Data Loads



3.



Enter the list of source files into the table (SimpleParallelLoad) that you created in

step 1. The DDL for this—in the current recipe—is (C:\SQL2012DIRecipes\CH13\

PrepSimpleParallelLoad.Sql):

USE CarSales_Staging

GO

SET IDENTITY_INSERT dbo.SimpleParallelLoad ON

GO

INSERT dbo.SimpleParallelLoad (ID, FileName)

VALUES (1, N'C:\SQL2012DIRecipes\CH13\MultipleFlatFiles\Stock01.Csv')

GO

INSERT dbo.SimpleParallelLoad (ID, FileName)

VALUES (2, N'C:\SQL2012DIRecipes\CH13\MultipleFlatFiles\Stock02.Csv')

GO

INSERT dbo.SimpleParallelLoad (ID, FileName)

VALUES (3, N'C:\SQL2012DIRecipes\CH13\MultipleFlatFiles\Stock03.Csv')

GO

INSERT dbo.SimpleParallelLoad (ID, FileName)

VALUES (4, N'C:\SQL2012DIRecipes\CH13\MultipleFlatFiles\Stock04.Csv')

GO

SET IDENTITY_INSERT dbo.SimpleParallelLoad OFF

GO



4.



Create a new SSIS package and name it SimpleParallelProcessing. Add two

connection managers—one OLEDB, the other ADO.NET—that connect to the

database you will use to load the data and metadata (CarSales_Staging in this

example). I will name them CarSales_Staging_OLEDB and CarSales_Staging_

ADONET, respectively.



5.



Add the following variables at the task level, as well as the initial values given:



Variable Name



Type



Value



Comments



CreateList



Boolean



String



A flag indicating that the list is

to be deleted and re-created.



FileFilter



String



*.CSV



Allows you to specify the file

extension to use.



FileSource



String



C:\SQL2012DIRecipes\CH13



Allows you to specify the file

directory to use.



Of course, you should use your own file filter and source directory if you are not following this

example exactly.

6.



Add a Sequence container onto the Data Flow pane, and name it Create table of files

to process.



754

www.it-ebooks.info



Chapter 13 ■ Organising And Optimizing Data Loads



7.



Add an Execute SQL task into the Sequence container, and name it Prepare Table.

Double-click to edit. Set the following elements:

Connection Type:



OLEDB



Connection:



CarSales_Staging_OLEDB



SQL Statement:



TRUNCATE TABLE dbo.SimpleParallelLoad



The Execute SQL Task Editor dialog box should look like it does in Figure 13-14.



Figure 13-14.  Execute SQL task to truncate tables

8.



Click OK to confirm.



9.



Add a Script component into the Sequence container under the Execute SQL task that

you just created. Name it Loop Through Files and Write to table and connect the

Execute SQL task Prepare Table to it.



755

www.it-ebooks.info



Chapter 13 ■ Organising And Optimizing Data Loads



10. Double-click to edit, set the Script Language to Microsoft Visual Basic 2010 and add

the following read-only variables:

User::FileFilter

User::FileSource

11. Click Edit Script.

12. Replace the Main method with the following (C:\SQL2012DIRecipes\CH13\

SimpleParallelLoad.Vb):

Public Sub Main()



Dim sqlConn As SqlConnection

Dim sqlCommand As SqlCommand



sqlConn = DirectCast(Dts.Connections("CarSales_Staging_ADONET"). 

AcquireConnection(Dts.Transaction), SqlConnection)



Dim FileSource As String = Dts.Variables("FileSource").Value.ToString

Dim FileFilter As String = Dts.Variables("FileFilter").Value.ToString

Dim dirInfo As New System.IO.DirectoryInfo(FileSource)

Dim fileSystemInfo As System.IO.FileSystemInfo

Dim FileName As String



Dim sqlText As String



For Each fileSystemInfo In dirInfo.GetFileSystemInfos(FileFilter)



FileName = fileSystemInfo.Name



sqlText = "INSERT INTO dbo.SimpleParallelLoad (FileName) VALUES('" & FileName & "')"



sqlCommand = New SqlCommand(sqlText, sqlConn)

sqlCommand.CommandType = CommandType.Text

sqlCommand.ExecuteNonQuery()



Next



Dts.TaskResult = ScriptResults.Success



End Sub

13. Close the SSIS Script task window and click OK to confirm your modifications to the

Script component.

14. Add a Script Component into the Sequence container under the Script task that you

just created. Name it Reset variable and connect the “Loop Through Files and Write

to table” Script component to it. Double-click to edit this second Script component,

and add the CreateList read/write variable.



756

www.it-ebooks.info



Chapter 13 ■ Organising And Optimizing Data Loads



15. Click Edit Script and replace the Main method with the following:

Public Sub Main()



Dts.Variables("CreateList").Value = False



Dts.TaskResult = ScriptResults.Success



End Sub

16. Close the SSIS Script Task window and click OK to confirm your modifications. The

first part of the package is now complete, and this will (re-)create the list of files to

process, as required. The SSIS package should look like Figure 13-15.



Figure 13-15.  The initial section of the parallel load package

17. Now to move on to the actual parallel processing. Add the following variables at

package level:



Variable Name



Type



Value



Comments



Batch_0



Object



The first of the four batches.



Batch_1



Object



The second of the four batches.



Batch_2



Object



The third of the four batches.

(continued)



757

www.it-ebooks.info



Chapter 13 ■ Organising And Optimizing Data Loads



Variable Name



Type



Value



Comments



Batch_3



Object



The fourth of the four batches.



FileName_Batch_0



String



The name of the file that will be loaded by batch one.



FileName_Batch_1



String



The name of the file that will be loaded by batch two.



FileName_Batch_2



String



The name of the file that will be loaded by batch three.



FileName_Batch_3



String



The name of the file that will be loaded by batch four.



18. Add an Execute SQL task under the Sequence container, connect the latter to the new

task, and rename it Prepare destination table. Double-click to edit. Set the following

elements:

Connection Type:



OLEDB



Connection:



CarSales_Staging_OLEDB



SQL Statement:



TRUNCATE TABLE dbo.ParallelStock



19. Click OK to confirm.

20. Add a new Flat File connection manager, name it Process0, and configure it to

connect to any of the source files. In the properties for this connection manager,

set the expression for its ConnectionString to be @[User::FileName_Batch_0]

(as explained in Recipe 13-1, steps 14 to 20).

21. Add an Execute SQL task under the “Prepare destination table” task, connect the

latter to the new task, and rename it Get Batch 0. Double-click to edit. Set the

following elements:

Connection Type:



ADO.NET



Connection:



CarSales_Staging_ADONET



SQL Statement:



SELECT



FileName



FROM



dbo.SimpleParallelLoad

WITH (NOLOCK)



WHERE



ProcessNumber = 0



22. Click OK to confirm.

23. Add a Process Loop container under the “Prepare destination table” task, connect the

latter to the new task, and rename it Load all files in Batch 0. Double-click to edit. Set

the following elements :

Enumerator:



Foreach ADO Enumerator



ADO Object Source Variable:



Batch_0



Variable:



FileName_Batch_0



758

www.it-ebooks.info



Chapter 13 ■ Organising And Optimizing Data Loads



24. Click OK to confirm.

25. Add a Data Flow task inside the Foreach Loop container, and configure it to connect

the Flat File connection manager Process0 to the destination table (ParallelStock)

using the CarSales_Staging_OLEDB connection manager. Ensure that Tablock is

checked.

26. Repeat steps 22 through 24 three times (once for each additional parallel load). Be

sure to name the Flat File connection managers Process1, Process2, and Process3.

Name the Process Loop containers Load all files in Batch 1, Load all files in Batch 2,

and Load all files in Batch 3. The final package should look like Figure 13-16.



Figure 13-16.  The complete parallel load package

You can now run the package.



How It Works

This recipe is based on the scenario where you receive many flat files, all having the same format, and all need

to be loaded into the same table. Clearly you can process them sequentially, as was shown in previous recipes,

but parallel processing should prove to be faster in most cases. The speed gain is generally worth the extra effort

required to create a more complex SSIS package.

In this recipe, we perform the load operation in two distinct parts:





First, cycle through the files in a directory and store the file paths in an SQL Server table.







Second, load the files using this table as the source of the file names, where each separate

file load tasks iterates through the files it has been allocated independently of the other

files that will be loaded.



759

www.it-ebooks.info



Chapter 13 ■ Organising And Optimizing Data Loads



It is important to separate out the two parts, and first to attribute process numbers to the source files so

that each branch of the process flow only loads its own set of files, as each process is separate from every other

process. You can think of it as giving “flags” to each file to tell it which path to follow.

You can use an ADO.NET recordset as part of this technique, as was described previously in the Recipe 13-4,

but I prefer to introduce another method which is to use a persisted database table. This seems preferable because

using a table allows you to store the list on disk, and can provide a basis for logging and batch processing, as will

be described in Recipe 13-12.

We have to be clear about exactly what this package will achieve. It will simply divide all the source files

into batches (here I will use four, each one of which will be a separate parallel process) and process them all. It

will not attempt to perform any load balancing or process ordering. This approach therefore works well when

all the files are much the same size, and you are not looking for optimal sequencing. It is, however, alluring

in its simplicity—both conceptually and practically. The technique used to define which file is attributed to

which process is to use an SQL Server Identity column, and then apply “% 4” to create the numbers 0 to 3 (the

remainder when the modulo operator is applied) as a separate ProcessNumber column to each file in turn. In

this example, the process thread attribution is defined as a calculated column, to avoid extra steps. The table of

file names and batch numbers is then used as the source list for the parallel load. Of course, you do not need to

use four parallel processes—you can use more or less. Just how many processes are required for an optimal data

load could require considerable testing on your part in a configuration identical to your production environment.

By way of extending the functionality, I will add a tweak to reuse—or truncate—the list of files to process,

which will allow reprocessing a defined list of files, if the source data changes. This is what the CreateList

variable is used for. When the package is loaded, this variable is set to True, so the initial list is truncated in the

SimpleParallelLoad SQL Server table, then repopulated. After the table is populated, the variable is set to False,

so the table is persisted, unless you need to re-create it. This allows you to pass in the variable from the command

line or from an SQL Server Agent job, and control re-processing.

If this process seems rather complex, then I suggest that you take a good look at the screen capture of the

whole job that you will find in Figure 13-16. This will hopefully make things somewhat clearer.



■■Note It is essential to check the Tablock check box in the OLEDB destination task, as this will allow parallel bulk

loads to take place. Failing to do this will slow down the load considerably. Generally, the destination database must

not be in FULL recovery mode for efficient parallel loads. Also, there should not be any nonlustered indexes in place.

This is discussed in greater depth in Recipe 14-10.



Hints, Tips, and Traps





Always test the time taken to perform parallel processing, and never presume that it will

automatically be faster. Test, test, and re-test!







The Sequence container that contains the tasks to isolate the list of files to process is not

strictly necessary, but it helps to visually isolate this initial part of the process.







You can use a stored procedure rather than SQL text in the Script task. Although this

is slightly more complex, it is considered better coding practice. In this case, use

sqlCommand.CommandType = CommandType.StoredProcedure, add the name of the stored

procedure as the SqlCommand, and define any parameters in the Parameters pane.







For simplicity’s sake, in this recipe I have placed the SimpleParallelLoad control table in

the same database as the one in which the data is loaded. In a production environment,

you might want to place it in a separate logging and monitoring database.



760

www.it-ebooks.info



Chapter 13 ■ Organising And Optimizing Data Loads







What is the optimal number of parallel load tasks? This is a good question, and one that

necessitates judicious use of the hallowed answer, “It depends.” Generally, “no more than

there are free processor cores, and not so many that you are choking the I/O subsystem,”

is a more accurate answer. So once again, it is best to test and measure on a test system

that is identical to your production environment before attempting a production run.



13-6. Loading Source Files with Load Balancing

Problem

You want to load multiple identically-structured files in parallel using the available processors to optimum effect.



Solution

Load the files in parallel using load balancing. Now I’ll show you how to carry out to a multiple parallel file load

while optimizing time by balancing the load across the available processor cores:

1.



Create a table to log the files once they have been loaded. The code to create this is

(C:\SQL2012DIRecipes\CH13\tblBulkFilesLoaded.Sql):

CREATE TABLE CarSales_Staging.dbo.BulkFilesLoaded

(

ID int IDENTITY(1,1) NOT NULL,

DateAdded DATETIME NULL DEFAULT GETDATE(),

FileName NVARCHAR(250) NULL,

Source TINYINT NULL

) ;

GO



2.



Create a new SSIS task. Add an OLEDB connection manager to the destination

database (CarSales_Staging in this example) named CarSales_Staging_OLEDB.

Create an ADO.NET connection manager to the CarSales_Staging database named

CarSales_Staging_ADONET. This will be used to write to the BulkFilesLoaded table.



3.



Add the following variables:



Variable Name



Type



Value



Comments



FilePath



String



C:\SQL2012DIRecipes\CH13\



The path to the source files.



MultipleFlatFiles

FileType



String



.CSV



The source file extension.



MaxFiles



Int32



1000



The maximum possible

number of files to process.



RecordSet



Object



The ADO recordset that holds the

list of files to process.



Counter0



Int32



The lowest counter value

corresponding to the processor

affinity of available processors.

(continued)



761

www.it-ebooks.info



Chapter 13 ■ Organising and Optimizing data LOads



Variable Name



Type



FileName0



String



Value



Comments

The file name variable mapping to

the lowest processor affinity.



...



4.



Counter’n’



The highest counter value

corresponding to the highest

processor affinity.



FileName’n’



The file name variable mapping to

the highest processor affinity.



Add an Execute SQL task named Prepare Log Table, and double-click to edit.

Configure as follows:

Connection Type:



ADO.NET



Connection:



CarSales_Staging_ADONET



SQL Statement:



TRUNCATE TABLE dbo.BulkFilesLoaded



5.



Add a Script task and connect the previous task to it. Set the following variables as

read/write: User::FilePath, User::FileType, User::MaxFiles, User::RecordSet.



6.



Set the script language to Microsoft Visual C# 2010, and click Edit Script.



7.



Add the following references to the namespaces region:

using System.IO;



8.



Replace the Main method with the following code (C:\SQL2012DIRecipes\CH13\

LoadBalancing1.cs):



public void Main()

{

// Create dataset to hold file names

DataSet ds = new DataSet("ds");

DataTable dt = ds.Tables.Add("FileList");

DataColumn IndexID = new DataColumn("IndexID", typeof(System.Int32));

dt.Columns.Add(IndexID);

DataColumn FileName = new DataColumn("FileName", typeof(string));

dt.Columns.Add(FileName);

DataColumn IsProcessed = new DataColumn("IsProcessed", typeof(Boolean));

dt.Columns.Add(IsProcessed);



// create primary key on IndexID field

IndexID.Unique = true;

DataColumn[] pK = new DataColumn[1];



762

www.it-ebooks.info



Tài liệu bạn tìm kiếm đã sẵn sàng tải về

13-5. Loading Multiple Flat Files in Parallel

Tải bản đầy đủ ngay(0 tr)

×