Tải bản đầy đủ - 0 (trang)
13-8. Using a Single Data File As a Multiple Data Source for Parallel Destination Loads

13-8. Using a Single Data File As a Multiple Data Source for Parallel Destination Loads

Tải bản đầy đủ - 0trang

Chapter 13 ■ Organising and Optimizing data LOads



Solution

Enable the source file to be read by multiple source tasks, and consequently allow efficient parallel loading into

the destination table, as follows.

1.



Create a new SSIS package, and add an OLEDB connection manager to the

destination database (CarSales_Staging in this example), named

CarSales_Staging__OLEDB.



2.



Right-click in the Connection Managers tab and select New Flat File Connection.

Name the connection manager Stock.



3.



Click the Browse . . . button, and navigate to the directory containing the file to load

(C:\SQL2012DIRecipes\CH13\MultipleFlatFiles). Select Stock01.Csv (in this

example). Ensure that all the relevant configuration information is entered (SSIS

should guess all of this correctly).



4.



Add a new Data Flow task and double-click to edit.



5.



Add a Flat FileSource task and configure it to use the Flat File Connection named

Stock that you created in step 2.



6.



Add a Script component to the Data Flow pane, and set it as a transformation

command type. Connect the Flat File source task Stock to this Script task, which I

suggest naming Generate Hash.



7.



Edit the Script component, select input Columns on the left pane, and select:





Make







Marque



Or the columns that you wish to use to generate a hash key. The dialog box should

look like Figure 13-21.



772

www.it-ebooks.info



Chapter 13 ■ Organising And Optimizing Data Loads



Figure 13-21.  Selecting input columns for a Script task

8.



Select Inputs and Outputs in the left pane, and add an output column (expand Output

0 select Output Columns and click the Add Columns button). Name the Output

column ProcessNumber. It should preferably be of DataType 4-byte signed or

unsigned integer. The dialog box will look like Figure 13-22.



773

www.it-ebooks.info



Chapter 13 ■ Organising And Optimizing Data Loads



Figure 13-22.  Creating an output column for a Script task

9.



Select Script in the left pane. Set the script language to Microsoft Visual Basic 2010

and click Edit Script. Start by adding the following two directives to the Imports region

of the script file:

Imports System.Security.Cryptography

Imports System.Text



10.



Replace the method Input0_ProcessInputRow with the following code:

Public Overrides Sub Input0_ProcessInputRow(ByVal Row As Input0Buffer)



Row.ProcessNumber = GetHashValue(Row.Marque & Row.Make) Mod 2



End Sub



11. Add the following function in the ScriptMain class:

Private



Dim

Dim

Dim



Function GetHashValue(ByVal SourceData As String) As Object

dataToHash As [Byte]() = New UnicodeEncoding().GetBytes(SourceData)

SHA As datatype = New SHA256Managed

hashedData As [Byte]() = SHA.ComputeHash(dataToHash)



774

www.it-ebooks.info



Chapter 13 ■ Organising And Optimizing Data Loads



RNGCryptoServiceProvider.Create().GetBytes(dataToHash)

Dim hashedDataInt As Int64 = BitConverter.ToInt64(hashedData, 0)

Return Abs(hashedDataInt)



End Function

12. Close the SSIS Script window and click OK.

13. Add an OLEDB destination task and connect the Script task to it.

14. Double-click to edit the destination task, configure it to use the CarSales_Staging_

OLEDB connection manager and the Stock destination table. Click Columns on the

left to map the columns.

15. Click OK to finish.

You can now run the package and load the data.



How It Works

This recipe attempts to answer the question, “How do I derive a process thread column from a flat file to enable

multiple data destinations?” The answer is fairly straightforward and consists of using a Script task to generate

the hash and corresponding ProcessNumber field. To avoid reiterating everything that was described in the

previous recipe, I will merely explain the differences, and use previous the example as a basis for extension.

There are a few things to note about the script code. First, the procedure Input0_ProcessInputRow will fire

for every row that this SSIS package processes. So what you are doing is to add a new column to the row—and

then add the process number to this. Note that you have to create the new output column before you can create

the script (at least if you want to avoid annoying alerts and errors). Second the GetHashValue function takes the

concatenated input columns, and creates an SHA hash, which it then converts (well, the first Byte, anyway) to

an integer. This integer is then used to derive the process number using the MOD function (not unlike the T-SQL %

function used in the previous recipe).



Hints, Tips, and Traps





The hash value does not need to be based on all the fields in the source table, as it is not

necessary to ensure uniqueness given that we are not using this hash for comparison

purposes, but merely for an approximate balance in the data batch definition. So use any

combination of fields that you feel gives an equitable distribution.







If you have an integer field in the source data, and do not need to generate a hash to

deduce a flow ID, then you can simply use the following single line of code to define the

process flow number: Row.ProcessNumber = (CType(Row.ID % 2, Int32) + 1;



13-9. Reading and Writing Data from a Database Source

in Parallel

Problem

You want to load data efficiently from a source database into SQL Server using all available processor cores.



775

www.it-ebooks.info



Chapter 13 ■ Organising And Optimizing Data Loads



Solution

Create an SSIS package with multiple data flow tasks that will read and write the data in parallel, as follows.

1.



Create a new SSIS package. Add two OLEDB connection managers—one to the source

database CarSales (CarSales_OLEDB) and one to the destination database CarSales_

Staging (CarSales_Staging_OLEDB).



2.



Create a destination table in the destination database (CarSales_Staging) using the

following DDL (C:\SQL2012DIRecipes\CH13\tblExportStock.Sql):

CREATE TABLE CarSales_Staging.dbo.ExportStock

(

ID BIGINT NOT NULL,

Make VARCHAR (50) NULL,

Marque NVARCHAR(50) NULL,

Model NVARCHAR(50) NULL,

) ;

GO



3.



Add a Data Flow task and double-click to edit.



4.



Add an OLEDB source and configure as follows:

OLEDB Connection Manager: CarSales_OLEDB

Data Access Mode:



SQL Command



SQL Text:



SELECT



ID, Make, Marque, Model



FROM



dbo.Stock WITH (NOLOCK)



WHERE



(AccountID % 4) = 0

OPTION (MAXDOP 1)



5.



Add an OLEDB destination, connect the previous task to it, and configure as follows:

OLEDB Connection Manager: CarSales_Staging_OLEDB

Data Access Mode:



Table or View – Fast Load



Name of the Table or View:



dbo.ExportStock



Table Lock:



Checked



6.



Click Mapping and map the columns (all have the same names in both source and

destination, so SSIS should do his for you).



7.



Repeat steps 5 through 6 for each available processor core in your server, only be

careful to increment the modulo factor by one each time. It should read for the

second source to destination task pair: WHERE (AccountID % 4) = 1; for the third

source to destination task pair: WHERE (AccountID % 4) = 2; and so forth.



776

www.it-ebooks.info



Tài liệu bạn tìm kiếm đã sẵn sàng tải về

13-8. Using a Single Data File As a Multiple Data Source for Parallel Destination Loads

Tải bản đầy đủ ngay(0 tr)

×