Tải bản đầy đủ - 0 (trang)
13-7. Loading Data to Parallel Destinations

13-7. Loading Data to Parallel Destinations

Tải bản đầy đủ - 0trang

Chapter 13 ■ Organising And Optimizing Data Loads


Add a Data Flow task onto the Control Flow pane. Name it Parallel Table Import.

Double-click to edit.


Add an OLEDB source task that you name Stock. Double-click to edit and set the

OLEDB connection manager to CarSales _OLEDB. Set the data access mode as SQL

Command, and enter or build the following query









,ID % 2 AS ProcessNumber




Click OK to confirm.


Add a Conditional Split task onto the Data Flow pane. Name it Separate out

according to process number. Connect the data source task to this new task.

Double-click to edit.


Add two outputs named Process0 and Process1. Set the conditions as follows:

ProcessNumber == 0

ProcessNumber == 1

This way, the contents of the ProcessNumber column that you created as part of the

source SQL will be used to direct the data to an appropriate destination. The dialog

box should look like Figure 13-18.



Chapter 13 ■ Organising And Optimizing Data Loads

Figure 13-18.  Defining outputs from a Conditional Split task


Add two OLEDB destinations to the Data Flow pane. Name them Process0 and



Connect the Conditional Split task to the Process0 destination. Select Process0 as the

output from the Input Output Selection dialog box (see Figure 13-19).



Chapter 13 ■ Organising And Optimizing Data Loads

Figure 13-19.  Selecting an output from a Conditional Split task


Double-click the Process0 OLEDB destination task, and configure it as follows:

OLEDB Connection Manager:


Name of Table or View:


Keep Identity:


Table Lock:



Click OK to confirm.


Repeat steps 7 through 9 for each destination task.


You can now run the package, which should look like Figure 13-20.

Figure 13-20.  The final package for parallel destination loads



Chapter 13 ■ Organising And Optimizing Data Loads

How It Works

This recipe takes a simple look at loading data in parallel from a single data source. This source can be any

database table that can be read using an OLEDB or an ADO.NET data source. As there are a few minor variations

on this idea, I show the core method first, and then explain a couple of extensions to this technique in the next

couple of recipes.

In all these examples I am only be creating two parallel loads. Of course, you can extend this to handle the

number best suited to your requirements—and your system. As ever, there are no hard and fast rules as to how

many parallel loads to run for optimal performance. You will have to test and tweak for the best results.

Let us start with a source datatable. It can be an SQL Server source—or indeed any data source that SSIS can

connect to. The classic sources other than SQL Server are the main relational databases on the market. I presume

that you are faced with a single datatable and a single destination table. As part of the initial SELECT clause,

a “process path” identifier is generated. In this case, it is deduced from the ID using the SQL Server modulo

operator (%). This guarantees that each source record will be attributed either a 0 or a 1 as a flag to be used by the

Conditional Split task. This allows the load to be split into two—even for a single destination table. However, you

must be using the Bulk Insert API (table or view—fast load) and have to select Table Lock for parallel inserts to be

sure to work.

Hints, Tips, and Traps

When using SQL to generate the ProcessNumber, it can be a good idea to preview the

results before running a vast import.

Remember that you can always perform a simple load of the batch number column

only, and then count the number of rows for each batch to check that you are getting a

reasonably balanced distribution.

There may be times when you need to calculate the ProcessNumber field without having

a nice, simple unique ID field as a starting point, as was the case in this example. So, try

using code like the following as part of the SELECT statement:

,ABS(CAST(HashBytes('SHA1', CAST(Make AS VARCHAR(20))+ Marque) AS INT)) % 2 AS ProcessNumber

This code will create a hash from one or more fields, which is then used to derive the ProcessNumber


There are similar ways for generating the process number from the pass-through SQL,

which you can use if you are connecting to a database other than SQL Server. This will

depend on the flavor of the SQL used by the database, however, and so I can only refer

you to the documentation for the particular database that you are using.

13-8. Using a Single Data File As a Multiple Data Source

for Parallel Destination Loads


You want to load data quickly from a single flat file and have multiple processors available.



Chapter 13 ■ Organising and Optimizing data LOads


Enable the source file to be read by multiple source tasks, and consequently allow efficient parallel loading into

the destination table, as follows.


Create a new SSIS package, and add an OLEDB connection manager to the

destination database (CarSales_Staging in this example), named



Right-click in the Connection Managers tab and select New Flat File Connection.

Name the connection manager Stock.


Click the Browse . . . button, and navigate to the directory containing the file to load

(C:\SQL2012DIRecipes\CH13\MultipleFlatFiles). Select Stock01.Csv (in this

example). Ensure that all the relevant configuration information is entered (SSIS

should guess all of this correctly).


Add a new Data Flow task and double-click to edit.


Add a Flat FileSource task and configure it to use the Flat File Connection named

Stock that you created in step 2.


Add a Script component to the Data Flow pane, and set it as a transformation

command type. Connect the Flat File source task Stock to this Script task, which I

suggest naming Generate Hash.


Edit the Script component, select input Columns on the left pane, and select:



Or the columns that you wish to use to generate a hash key. The dialog box should

look like Figure 13-21.



Tài liệu bạn tìm kiếm đã sẵn sàng tải về

13-7. Loading Data to Parallel Destinations

Tải bản đầy đủ ngay(0 tr)