Tải bản đầy đủ - 0 (trang)
13-6. Loading Source Files with Load Balancing

13-6. Loading Source Files with Load Balancing

Tải bản đầy đủ - 0trang

Chapter 13 ■ Organising and Optimizing data LOads



Variable Name



Type



FileName0



String



Value



Comments

The file name variable mapping to

the lowest processor affinity.



...



4.



Counter’n’



The highest counter value

corresponding to the highest

processor affinity.



FileName’n’



The file name variable mapping to

the highest processor affinity.



Add an Execute SQL task named Prepare Log Table, and double-click to edit.

Configure as follows:

Connection Type:



ADO.NET



Connection:



CarSales_Staging_ADONET



SQL Statement:



TRUNCATE TABLE dbo.BulkFilesLoaded



5.



Add a Script task and connect the previous task to it. Set the following variables as

read/write: User::FilePath, User::FileType, User::MaxFiles, User::RecordSet.



6.



Set the script language to Microsoft Visual C# 2010, and click Edit Script.



7.



Add the following references to the namespaces region:

using System.IO;



8.



Replace the Main method with the following code (C:\SQL2012DIRecipes\CH13\

LoadBalancing1.cs):



public void Main()

{

// Create dataset to hold file names

DataSet ds = new DataSet("ds");

DataTable dt = ds.Tables.Add("FileList");

DataColumn IndexID = new DataColumn("IndexID", typeof(System.Int32));

dt.Columns.Add(IndexID);

DataColumn FileName = new DataColumn("FileName", typeof(string));

dt.Columns.Add(FileName);

DataColumn IsProcessed = new DataColumn("IsProcessed", typeof(Boolean));

dt.Columns.Add(IsProcessed);



// create primary key on IndexID field

IndexID.Unique = true;

DataColumn[] pK = new DataColumn[1];



762

www.it-ebooks.info



Chapter 13 ■ Organising And Optimizing Data Loads



pK[0] = IndexID;

dt.PrimaryKey = pK;



DirectoryInfo di = new DirectoryInfo(Dts.Variables["FilePath"].Value.ToString());

FileInfo[] filesToLoad = di.GetFiles(Dts.Variables["FileType"].Value.ToString());



DataRow rw = null;



Int32 MaxFiles = 0;



foreach (FileInfo fi in filesToLoad)

{

rw = dt.NewRow();

rw["IndexID"] = MaxFiles + 1;

rw["FileName"] = fi.Name;

rw["IsProcessed"] = 0;

dt.Rows.Add(rw);

MaxFiles += 1;

}



Dts.Variables["User::MaxFiles"].Value = MaxFiles;



Dts.Variables["User::RecordSet"].Value = dt;



Dts.TaskResult = (int)ScriptResults.Success;

}

9.

10.



Close the Script window. Confirm your changes to the Script task with OK.

Add a For Loop container to the Control Flow pane and connect the Script task to it.

Name it Container 0. Double-click to edit and set the For Loop properties as follows:

InitExpression:



@Counter0 = 0



EvalExpression:



@Counter0 <= @MaxFiles



AssignExpression:



@Counter0 = @Counter0 + 1



11.



Add a Script component inside the For Loop container, name it Get next available file

name, and set the following variables as read/write: User::FileName0,User::Counter0.



12.



Set the Script language as Microsoft Visual C# 2010, and click Edit Script.



13.



Add the following references to the namespaces region:

using System.Data.OleDb;

using System.Xml;

using System.Threading;



763

www.it-ebooks.info



Chapter 13 ■ Organising And Optimizing Data Loads



14. Set the Main method as follows:

public void Main()

{



DataTable dT = new DataTable();

DataRow[] matchingRows;

String fileName = "";

Variables vars = null;

Variables varsMax = null;



PollLabel:



try

{



Dts.VariableDispenser.LockOneForWrite("User::RecordSet", ref vars);



dT = (DataTable)vars[0].Value;



matchingRows = dT.Select("IsProcessed = 0", "IndexID ASC");



int numberOfRows = matchingRows.GetLength(0);



if (numberOfRows != 0)

{

fileName = matchingRows[0][1].ToString();

matchingRows[0]["IsProcessed"] = true;



Dts.Variables["FileName0"].Value = fileName;

vars[0].Value = dT;

}

else

{

Dts.VariableDispenser.LockOneForRead("User::MaxFiles", ref varsMax);

Dts.Variables["Counter0"].Value = varsMax[0].Value;

}



vars.Unlock();

}



catch

{

System.Random RandomNumber = new System.Random();

Thread.Sleep(RandomNumber.Next(200, 800));

goto PollLabel;

}

Dts.TaskResult = (int)ScriptResults.Success;

}



764

www.it-ebooks.info



Chapter 13 ■ Organising And Optimizing Data Loads



15. Add a Data Flow task in the Foreach Loop container, and connect the Script

component to it. Double-click the precedence constraint (the green arrow), and set

the constraint options as follows:

Evaluation Operation:



Expression and Constraint



Value:



Success



Expression:



@Counter0 != @MaxFiles



Logical And:



All constraints must evaluate to True



16. Confirm your modifications with OK, and configure the actual file load using the Data

Flow task. I will not describe this here, as it has been covered exhaustively in other

recipes, particularly in Recipe 13-1.

17. Add an Execute SQL task in the Foreach Loop container, and connect the Data Flow

task to it. Configure as follows:

Connection Type:



ADO.NET



Connection:



CarSales_Staging_ADONET



SQL Statement:



INSERT INTO dbo.BulkFilesLoaded

(Filename, Source)

VALUES (@FileName0, 0)



18. Click ParameterMapping on the left, and add the following parameter:

Variable Name



Direction



Data Type



Parameter Name



User::FileName0



Input



String



@FileName0



19. Confirm your modifications with OK.

20. Repeat steps 10 through 19 for each parallel load that you wish to add. You will have

to alter every reference to FileName0 to become FileName’n’ (the number of the

process). Do the same for Counter0. Remember that this means not just in the script

code, but also variable names and the variables used in the For Loop parameters in

step 10 and in the precedence constraint in step 15. Your package should look like

that in Figure 13-17.



765

www.it-ebooks.info



Chapter 13 ■ Organising And Optimizing Data Loads



Figure 13-17.  The process flow for the load balancing package



How It Works

The previous recipe loaded multiple sets of files in parallel, but it made no attempt to balance the load process. It

presumed that all the files are of roughly similar sizes, and that all the processing cores to which they are assigned

are able to work to their full potential, without being required to switch to other processes—and consequently

slow down the file load.

While such an approach can solve many load requirements, there will inevitably be cases where you will

need to implement an approach that can balance the load. This means that each processing core will be able to

take the next file to process when it becomes available, without pre-assigning files to a specific SSIS path—and

consequently pre-assigning them to a processor or core.

Balancing a load across the available processors (or processing cores) is somewhat more complex, but it has

the advantage of processing all source files as fast as the hardware will allow. Indeed, the only tricky part of this

process is to isolate the list of files to load, and then to ensure that (a) there are no conflicts when file names are

taken from the list, (b) that no file is loaded twice, and (c) that there are no deadlocks when file names in the list

are updated as having been loaded. My preferred solution to this problem is to use an ADO DataTable to hold the

list of files, and to let SSIS handle this as an object variable. The fact that this is in memory is as near a guarantee

as you can get that there will not be simultaneous read/write access to the list of files (and consequently locking

issues)—as can be the case if you store it in an SQL Server table.

As with most approaches to parallel processing, I advise you only to set up a load path when there is an

available processor core. In this recipe, therefore, when defining the variables to use, you should set up eight

Counter variables and eight Filename variables, numbered 0 through 7 if your server has eight cores that SSIS can

use for the process.

Oh, and just for once, I have scripted this one in C#. I am conscious of the fact that many SQL Server

developers seem to tend toward using VB.NET, but do not want to exclude those who prefer C#. After all, it has

been around for scripting in SSIS since SQL Server 2008 appeared.



766

www.it-ebooks.info



Chapter 13 ■ Organising And Optimizing Data Loads



The code works like this:

First, as far as the “Create object variable of files to process” Script task is concerned. This task does

two things:





First, it creates a dataset and datatable to hold the list of files to process.







Second, it populates this datatable with the names of the files to process.



This first half of the code creates a dataset, and then creates a datatable containing the required columns

(IndexID, FileName, IsProcessed). Then, using the file path and extension, it loops through the names of all

the requisite files in the source directory and adds them to the FileName column of the datatable. Finally, the

datatable is passed out to the RecordSet SSIS variable.

The Script component inside the Foreach loop (which is duplicated for every parallel load) reads the

datatable from the RecordSet SSIS object variable. To do this it first locks the variable—very briefly—and filters

the datatable to return only files that have not been processed. It then takes the first available file name, sets the

IsProcessed flag to True and unlocks the variable. There is also a simple conflict-detection process that causes

the process to sleep for a random number of milliseconds should the variable be locked by another of the load

processes. It is worth noting that the datatable has to be converted from the SSIS object variable to be used when

it is passed into the Script component—but not to when it is returned back to the SSIS object variable.



Hints, Tips, and Traps





It is possible to use an SQL Server table to store the list of files to load, and to update each

one once it has been loaded, but handling read/write conflicts—and ensuring that no

table is loaded twice—can be a little tricky, in my experience. Using an in-memory object

allows for such read and write speeds that conflict can be avoided. At least this has been

the case in the systems in which I have used this particular approach.







It is important to use the LockOneForWrite approach with certain variables in the scripts.

This is because if you lock certain variables that require frequent virtually simultaneous

access at the script level (using read/write variables) you will inevitably cause contention

at some point during package execution, and this will cause the whole package to fail.







The MaxFiles variable is used as a safeguard to prevent infinite loops when processing

files from the in-memory dataset. If there are no available files in the datatable, the

process iterates until the maximum number of files is reached.



13-7. Loading Data to Parallel Destinations

Problem

You want to quickly load data from a single source table on a multiprocessor server.



Solution

Read the data from the source table then split the source into two separate data flows. Use parallel destination

loading to accelerate the load part of the process. To load an SQL Server table using parallel loading:

1.



Create a new SSIS package, and name it SingleSourceParallelProcessing. Add two

connection managers, both OLEDB. The first will connect to the source database

that you will be using to load the data from (CarSales in this example). The second

will connect to the destination database (CarSales_Staging, here). I will name them

CarSales _OLEDB and CarSales_Staging_OLEDB, respectively.



767

www.it-ebooks.info



Chapter 13 ■ Organising And Optimizing Data Loads



2.



Add a Data Flow task onto the Control Flow pane. Name it Parallel Table Import.

Double-click to edit.



3.



Add an OLEDB source task that you name Stock. Double-click to edit and set the

OLEDB connection manager to CarSales _OLEDB. Set the data access mode as SQL

Command, and enter or build the following query

(C:\SQL2012DIRecipes\CH13\StockToParallel.Sql):

SELECT

ID

,Make

,Marque

,Model

,Registration_Date

,Mileage

,ID % 2 AS ProcessNumber



FROM

dbo.Stock



4.



Click OK to confirm.



5.



Add a Conditional Split task onto the Data Flow pane. Name it Separate out

according to process number. Connect the data source task to this new task.

Double-click to edit.



6.



Add two outputs named Process0 and Process1. Set the conditions as follows:

ProcessNumber == 0

ProcessNumber == 1

This way, the contents of the ProcessNumber column that you created as part of the

source SQL will be used to direct the data to an appropriate destination. The dialog

box should look like Figure 13-18.



768

www.it-ebooks.info



Tài liệu bạn tìm kiếm đã sẵn sàng tải về

13-6. Loading Source Files with Load Balancing

Tải bản đầy đủ ngay(0 tr)

×