Tải bản đầy đủ - 0 (trang)
13-10. Inserting Records in Parallel and in Bulk

13-10. Inserting Records in Parallel and in Bulk

Tải bản đầy đủ - 0trang

Chapter 13 ■ Organising And Optimizing Data Loads



2.



Add an OLEDB Configuration Manager, configured to connect to the destination

database named CarSales_Staging_OLEDB.



3.



Add a Flat File connection manager named BulkLoad and configure it to read the

source file (C:\SQL2012DIRecipes\CH13\BulkStock.Csv).



4.



Create the CarSales_Staging.dbo.BulkStock destination table using the following DDL

(C:\SQL2012DIRecipes\CH13\tblBulkStock.Sql):

CREATE TABLE CarSales_Staging.dbo.BulkStock

(

ID BIGINT NULL,

Make VARCHAR(50) NULL,

Marque NVARCHAR(50) NULL,

Model VARCHAR(50) NULL

) ;



GO



5.



Assuming that you are loading data into a staging table, add an Execute SQL task to

the data flow pane, name it Prepare destination table, and configure as follows:

Connection Type:



OLEDB



Connection:



CarSales_Staging_OLEDB



SQL Statement:



TRUNCATE TABLE dbo.BulkStock



6.



Add a Sequence container to the Data Flow pane. Connect the previous

task—“Prepare destination table”—to it.



7.



Add as many Bulk Insert tasks inside the Sequence container as you have available

processor cores in your server. Configure each one (using the Connection pane)

to read data using the BulkLoad source Configuration Manager and to write to the

appropriate destination table, as shown in Figure 13-23.



778

www.it-ebooks.info



Chapter 13 ■ Organising And Optimizing Data Loads



Figure 13-23.  The Bulk Insert task

8.



Double-click the first of the Bulk Insert tasks, and click Expressions. Expand

Expressions in the right-hand pane, and click the Ellipse button. Select the LastRow

property and set it to the User::RecordRange variable. Confirm the expression with

OK, and then confirm all your modifications for the Bulk Insert task with OK.



9.



Double-click the second of the Bulk Insert tasks, and click Expressions. Expand

Expressions in the right-hand pane, and click the Ellipse button. Set the following two

expressions:



10.



FirstRow:



@[User::RecordRange] + 1



LastRow:



(@[User::RecordRange] * 2)



Double-click the third of the Bulk Insert tasks, and click Expressions. Expand Expressions

in the right-hand pane, and click the Ellipse button. Set the following two expressions:

FirstRow:



(@[User::RecordRange] * 2) + 1



LastRow:



(@[User::RecordRange] * 3)



779

www.it-ebooks.info



Chapter 13 ■ Organising And Optimizing Data Loads



11. Do the same thing for all the Bulk Insert tasks except the last one—ensuring that you

increment the multipliers (the *2 and *3 in the preceding step) by one for each task.

12. Double-click the last of the Bulk Insert tasks, and click Expressions. Expand

Expressions in the right-hand pane, and click the Ellipse button. Select the FirstRow

property and set it to (@[User::RecordRange] * n) + 1—where n is the multiplier. Do

not set the last Row property, as this will default to the last row in the source file.

Confirm the expression with OK, and then confirm all your modifications for the Bulk

Insert task with OK.

The final package should look like Figure 13-24.



Figure 13-24.  The parallel bulk load package

Now, when you run the package, the source file will be loaded in parallel using as many processor cores

as there are available, and for which there is a corresponding Bulk Insert task. Each Bulk Insert task will load a

separate set of records.



How It Works

There are occasions when you will have a single, extremely large source file, and a very restricted time window

in which to load the data. This is when you need to be able to load batches of data from the same file in parallel.

Fortunately, the Bulk Insert task can do exactly this, as it allows you to specify both the initial record to load from

a source file as well as the final record to load. So if you know (approximately) how many records there are likely



780

www.it-ebooks.info



Chapter 13 ■ Organising And Optimizing Data Loads



to be in a source file, as well as how many parallel processes you are running, then you can use SSIS expressions

to define these parameters for each parallel load, and perform a parallel bulk insert from a single source file.

You can count the records in the source file to get an exact figure to use for the definition of the subsets of

records to load, but the time taken to get an exact figure is likely to render the operation pointless in most cases,

and so an approximate figure should suffice. Should the file sizes vary considerably, then you may have no choice

but to get the real row count. To do this, merely create a data flow task, using a Flat File source. Select one column

only from those available, and connect to a Row Count transform, configured with the FileCounter variable as the

rowcount variable. As you test your package, you will be able to get an approximate record count from the source

file, and place this figure in the FileCounter variable.



■■Note  When I refer to “available” processor cores, I mean those that can be used for the ETL process that you

are building. If you have the luxury of a server dedicated to SSIS and the process that you are designing will have

exclusive access to the server resources then defining the cores that you can use is easy—it is all of them. If the

server will be used for other process at the same time as your load process is running, then you will have to decide

just how many cores you want SSIS to use.



Hints, Tips, and Traps





Creating more Bulk Insert tasks than there are available processor cores can be

counterproductive, since switching between cores—and the associated waits—can slow

the process down considerably.







You can adjust the various configuration options available for Bulk Inserts as you see

fit—they are explained in Recipe 5-3.







Unfortunately, the Bulk Insert task offers few logging, counter, or error trapping options.



13-11. Creating Self-Optimizing Parallel Bulk Inserts

Problem

You want load balancing for parallel bulk inserts. You want the process to adjust to the number of records to load

based on the last load job.



Solution

Extend the package created in Recipe 13-10 to count and store the record count for a data load. This then

becomes the basis for the record counter used to calculate a balanced load in subsequent loads. The package

from Recipe 13-10 is available at C:\SQL2012DIRecipes\CH13\13_10.Dtsx.

Taking the package as a basis, perform the following steps.

1.



Create the SSISVariables SQL Server table—used to store SSIS variables—using the

following DDL (C:\SQL2012DIRecipes\CH13\tblSSISVariables.Sql):

CREATE TABLE CarSales_Staging.dbo.SSISVariables

(

ID INT IDENTITY(1,1) NOT NULL,

SSISPackageName NVARCHAR (50) NULL,



781

www.it-ebooks.info



Chapter 13 ■ Organising and Optimizing data LOads



SSISVarName NVARCHAR (50) NULL,

SSISVarValue NVARCHAR (50) NULL,

LastUpdated DATETIME NULL,

CONSTRAINT PK_SSISVariables PRIMARY KEY CLUSTERED

(

ID ASC

)

) ;

GO

2.



Add one record to the SSISVariables table, using T-SQL like the following:

INSERT INTO dbo.SSISVariables (SSISPackageName, SSISVarName, SSISVarValue)

VALUES ('ParallelBulkInsertFile.dtsx', 'RecordRange', '1000000')



3.



Create an ADO.NET connection manager named CarSales_Staging_ADONET and

configured to connect to the database where the dbo.SSISVariables table is located

(CarSales_Staging).



4.



Add a new Execute SQL task named Get Range of Records. Configure it as follows:



5.



Connection Type:



ADO.NET



Connection:



CarSales_Staging_ADONET



SQL Statement:



SELECT



@RecordRange = CAST(SSISVarValue AS INT)



FROM



dbo.SSISVariables



WHERE



SSISVarName = 'RecordRange'



Click ParameterMapping and add a parameter, as follows:

Variable Name:



User::RecordRange



Direction:



Output



Type:



Int64



Parameter Name:



@RecordRange



Confirm all your modifications.

6.



Connect the new task to the existing task, “Prepare destination table”.



7.



Add a new Execute SQL task named Redefine Data Ranges. Connect the Sequence

container to it. Configure it as follows:

Connection Type:



ADO.NET



Connection:



CarSales_Staging_ADONET



SQL Statement:



UPDATE



dbo.SSISVariables



SET



SSISVarValue =

FLOOR(@RecordRange / @FileCounter)



WHERE



SSISVarName = 'RecordRange'



782

www.it-ebooks.info



Tài liệu bạn tìm kiếm đã sẵn sàng tải về

13-10. Inserting Records in Parallel and in Bulk

Tải bản đầy đủ ngay(0 tr)

×