Tải bản đầy đủ - 0 (trang)
11-4. Detecting and Loading Delta Data Only

11-4. Detecting and Loading Delta Data Only

Tải bản đầy đủ - 0trang

Chapter 11 ■ Delta Data Management



1.



Create the four ETL metadata tables that will provisionally hold delta data

information. For updates in the source database, use the following DDL:

CREATE TABLE CarSales.dbo.TMP_Updates (ID INT);



2.



For inserts in the source database, use the following DDL:

CREATE TABLE CarSales.dbo.TMP_Inserts (ID INT);



3.



For updated data in the destination database, use the following DDL (drop this table

first if you have created it for another recipe). The code is in

(C:\SQL2012DIRecipes\CH11\tbl11-4_Invoice_Lines_Updates.Sql):

CREATE TABLE CarSales_Staging.dbo.Invoice_Lines_Updates

(

ID INT NOT NULL,

InvoiceID INT NULL,

StockID INT NULL,

SalePrice NUMERIC(18, 2) NULL,

DateUpdated DATETIME,

LineItem SMALLINT,

VersionStamp VARBINARY(8) NULL,

HashData VARBINARY(256) NULL

) ;

GO



4.



For deletes in the destination database use the following DDL:

CREATE TABLE CarSales_Staging.dbo.TMP_Deletes (ID INT);

GO



5.



As a destination table for this recipe, create the following table. ). The code is in

(C:\SQL2012DIRecipes\CH11\tbl11-4_Invoice_Lines.Sql):

CREATE TABLE CarSales_Staging.dbo.Invoice_Lines

(

ID INT NOT NULL,

InvoiceID INT NULL,

StockID INT NULL,

SalePrice NUMERIC(18, 2) NULL,

DateUpdated DATETIME,

LineItem SMALLINT,

VersionStamp VARBINARY(8) NULL,

HashData VARBINARY(256) NULL

) ;

GO



640

www.it-ebooks.info



Chapter 11 ■ Delta Data ManageMent



6.



Add three package-scoped variables. The references to the database tables will be

replaced later with references to temporary tables. Configure these variables—for the

moment—as follows:

Name:



DeleteTable



DataType:



String



Value:



Tmp_Deletes



Name:



UpdateTable



DataType:



String



Value:



Tmp_Updates



Name:



InsertTable



DataType:



String



Value:



Tmp_Inserts



7.



Create a new SSIS package and add create two OLEDB connection managers: one

correctly configured for the source server (CarSales) and one correctly configured for

the destination server (CarSales_Staging), named CarSales_OLEDB and CarSales_

Staging_OLEDB, respectively. Set the RetainSameConnection property to True for

both the source and destination connection managers.



8.



Add a new Execute SQL task onto the Control Flow pane. Name it Create Temp tables

on Source. This task creates the session-scoped temporary tables that will hold the

IDs of all records to insert and update from the source dataset (in production if not in

development). Configure as follows:

Name:



Create Temp tables



Connection:



CarSales_OLEDB



SQL Statement:



CREATE TABLE ##TMP_INSERTS (ID INT);

CREATE TABLE ##TMP_UPDATES (ID INT);



9.

10.



Confirm with OK to return to the Data Flow pane.

Add a new Execute SQL task onto the Control Flow pane. Connect the previous

Execute SQL task to it. Name it Create Temp tables on Destination. This task creates

the session-scoped temporary table that will hold all the IDs in the source data (which

will be used to isolate records to delete). Configure as follows:

Name:



Create Temp tables



Connection:



CarSales_Staging_OLEDB



SQL Statement:



CREATE TABLE ##TMP_DELETES (ID INT);



11.



Confirm with OK to return to the Data Flow pane.



12.



Add a Data Flow task to the Control Flow pane. Name it Delta Detection. Connect the

Execute SQL task “Create Temp Tables on Source” to it. Double-click to enter the Data

Flow pane.



641

www.it-ebooks.info



Chapter 11 ■ Delta Data Management



13.



Add an OLEDB source adapter. Configure it to select the ID and delta detection rows

from the source table (Invoice_Lines) using an SQL command, like this:

Name:



Create Temp tables



Connection:



CarSales_OLEDB



SQL Statement:



SELECT



ID, HashData AS HashData_Source



FROM



dbo.Invoice_Lines WITH (NOLOCK)



14.



Add a Multicast transform to the Data Flow pane. Connect the data source adapter to it.



15.



Add a Lookup transform to which you connect the Multicast transform. Name it Detect

Hash Deltas. On the General pane of the Lookup transform, configure it as follows:

Cache Mode:



No Cache



Connection Type:



OLEDB Connection Manager



Specify how to handle rows

with NoMatch entries:



Send rows with no matching

entries to the No Match output



16.



Click Connection on the left. Set the connection manager to CarSales_Staging_

OLEDB because as you will be comparing the source data with the destination data

that you are looking up with this Lookup transform.



17.



Set the Lookup to “Use results of an SQL Query “ and enter the following SQL:

SELECT

FROM



ID, HashData

dbo.Invoice_Lines WITH (NOLOCK)



18.



Click Columns on the left. The two tables, source (on the left), and destination

(on the right) will appear. Drag the ID column from the Available Lookup Columns (or

Destination) table on the right to the ID column of the Available Input Columns (Source)

table on the left. This maps the unique IDs of the two data sources to each other.



19.



Select the HashData column of the Available Lookup Columns (or destination) table

on the right and provide an output alias—I suggest HashData_Destination. This

allows you to compare the hashes for source and destination for each record.



20.



Click OK to confirm your modifications. Return to the Data Flow pane.



21.



Add an OLEDB destination adapter to the Data Flow pane and set its

ValidateExternalMetadata property to False. Connect the Lookup transform to this

destination, ensuring that you select the Lookup NoMatch output. Configure as follows:

OLEDB Connection Manager:



CarSales_OLEDB



Data Access Mode:



Table name or view name variable



Variable Name:



InsertTable



642

www.it-ebooks.info



Chapter 11 ■ Delta Data Management



22.



Click Mappings and ensure that the ID columns are connected. Click OK to confirm

your modifications. You will be creating a temporary table of all new IDs in the data

source, which you will use later to transfer only these records (the inserts) to the

destination server.



23.



Add an OLEDB destination adapter to the Data Flow pane. Connect the Multicast

transform to it. Name it Records to Delete, and set its ValidateExternalMetadata

property to False. Double-click to edit.



24.



Set the connection manager to CarSales_Staging_OLEDB and the data access mode to

table name or view name variable. Select DeleteTable as the table variable name.



25.



Click Mappings. Ensure that the ID columns are connected. Then click OK to confirm.

This way, you create a temporary table of all IDs in the source dataset that you can

compare with current IDs in the destination dataset to deduce deleted records.



26.



Now it is time to move to the Update part of the process. Add a Conditional Split

transform on to the Data Flow pane. Name it Detect Hash Deltas. Connect the

Lookup transform to it. The Lookup Match output should be applied automatically.

This sends the Lookup Match Output to the Conditional Split.



27.



Double-click to edit the Conditional Split. Add a single output condition named

HashData Differences, where the condition is

(DT_UI8) HashData_Source != (DT_UI8) HashData_Destination



28.



Click OK to confirm. Close the dialog box.



29.



Add an OLEDB destination adapter to the data flow and connect the Conditional Split

transform to it. The Input Output dialog box will appear, where you should select

the HashData Differences output (which is, in fact, the source for the destination

adapter). Set the following:



30.



OLEDB Connection Manager:



CarSales_OLEDB



Data Access Mode:



Table name or view name variable



Variable Name:



UpdateTable



Click Mappings and ensure that the ID columns are mapped correctly, and then

click OK to confirm your modifications. This sends a subset of all IDs with different

hashes to the session-scoped temporary table. The Data Flow pane should look

like Figure 11-9.



643

www.it-ebooks.info



Chapter 11 ■ Delta Data Management



Figure 11-9.  Data flow for hash-based delta detection

31.



Click the Control Flow tab to return to the Control Flow pane.



32.



Now we can move on to the second part of the process—using the IDs that identify

the Inserts, Deletes, and Updates to perform the data modifications, fetching (in

the case of inserts and updates) only the records that are required. Let’s begin with

Deletes. Click the Control Flow tab and add an Execute SQL Task. Name it Delete

Data. Connect this to the Delta Detection Data Flow task. Configure the Execute SQL

task as follows (C:\SQL2012DIRecipes\CH11\11-4_DeleteData.Sql):

Connection:



CarSales_Staging_OLEDB



SQL Statement:



DELETE FROM



dbo.Invoice_Lines



WHERE



ID IN (

SELECT



DST.ID



FROM



dbo.Invoice_Lines DST



LEFT OUTER JOIN



##TMP_Deletes TMP

ON TMP.ID = DST.ID



WHERE



TMP.ID IS NULL



)

33.



Now on to Inserts. Add a Data Flow task to the Control Flow pane. Add the previous task

(DeleteData) to it. Name the task Insert Data. Set its DelayValidation property to True.



644

www.it-ebooks.info



Chapter 11 ■ Delta Data Management



34.



Add a new package-scoped variable named InsertSQL. This must be a string containing

the SELECT statement corresponding to the source data, joined to the TMP_Inserts

temp table. In this example, the following value will be used; it will be tweaked to use

the temp table once all is working (C:\SQL2012DIRecipes\CH11\11-4_InsertSQL.Sql):

SELECT

FROM



SRC.ID, SRC.InvoiceID, SRC.StockID, SRC.SalePrice, SRC.HashData

dbo.Invoice_Lines SRC

INNER JOIN TMP_Inserts TMP

ON SRC.ID = TMP.ID



35.



Edit the “Insert Data” Data Flow task.



36.



Add an OLEDB source to the Data Flow pane. Rename it CarSales_OLEDB, and

configure it as follows:



37.



OLEDB Connection Manager:



CarSales_OLEDB



Data Access Mode:



SQL Command from variable



Variable Name:



InsertSQL



Add an OLEDB destination connector to the Data Flow pane. Connect the source to it.

Configure as follows:

OLEDB Connection Manager:



CarSales_Staging_OLEDB



Data Access Mode:



Table or view - Fast load



Name of the Table or View:



dbo.Invoice_Lines



38.



Click Columns and ensure that all the required columns are selected, and then click

OK to confirm your changes. This makes a second trip to the source server, and adds

any new records not already in the destination dataset.



39.



Return to the Control Flow pane.



40.



Finally, we’re on to Updates. First, add two package-scoped string variables, as follows

(C:\SQL2012DIRecipes\CH11\11-4_UpdateDataTable.Sql):

Name:



UpdateDataTable



Value:



Invoice_Lines_Updates



Name:



UpdateSQL



Value:



SELECT



SRC.ID, SRC.InvoiceID,

SRC.StockID, SRC.SalePrice,

SRC.LineItem, SRC.DateUpdated

SRC.HashData



FROM



dbo.Invoice_Lines SRC

INNER JOIN TMP_Updates TMP

ON SRC.ID = TMP.ID



645

www.it-ebooks.info



Chapter 11 ■ Delta Data Management



41.



The UpdateSQL variable selects only the records that need to be updated from the

source server. UpdateDataTable refers to the temporary table of data containing all the

records that require updating. UpdateCode performs the update on the destination

server. Name it Update Data. Connect the previous Data Flow task (Insert Data) to it.

Set its DelayValidation property to True.



42.



Double-click to edit. Add an OLEDB source to the Data Flow pane. Rename it CarSales_

OLEDB. Set its ValidateExternalMetadata property to False. Configure it as follows:



43.



OLEDB Connection Manager:



CarSales_OLEDB



Data Access Mode:



SQL Command from variable



Variable Name:



UpdateSQL



Add an OLEDB destination connector to the Data Flow pane. Connect the source to it.

Configure as follows:

OLEDB Connection Manager:



CarSales_Staging_OLEDB



Data Access Mode:



SQL Command from variable



Variable Name:



UpdateDataTable



44.



Click Mappings and ensure that all the required columns are mapped, and then click

OK to confirm your changes. This makes a second trip to the source server and adds

any new records not already in the destination dataset to the Invoice_Lines_Updates

scratch table, which will end up by being a temporary table soon, too.



45.



Return to the Data Flow pane. Add an Execute SQL task, which you rename Carry out

updates. Connect the “Update Data” Data Flow task to this. Set its DelayValidation

property to True.



46.



Edit the Carry out updates Execute SQL task and set the following

(C:\SQL2012DIRecipes\CH11\11-4_CarryOutUpdates.Sql):

Connection:



CarSales_Staging_OLEDB



SQL Statement:



UPDATE



DST



SET



DST.InvoiceID = UPD.InvoiceID

,DST.SalePrice = UPD.SalePrice

,DST.LineItem = UPD.LineItem

,DST.UpdateDate = SRC.UpdateDate

,DST.StockID = UPD.StockID

,DST.HashData = UPD.HashData



FROM



dbo.Invoice_Lines DST

INNER JOIN



##Invoice_Lines_Updates UPD



ON DST.ID = UPD.ID



646

www.it-ebooks.info



Chapter 11 ■ Delta Data Management



The Control Flow pane should look like Figure 11-10.



Figure 11-10.  Process flow for detecting and loading only delta data

That, finally, is it. You have an SSIS package that can detect data differences and only return the full data for

any records that are new or need updating. Once the process is debugged and functioning, you can modify the

variable values to use temporary tables, as follows

(C:\SQL2012DIRecipes\CH11\11-4_VariablesForTempTables.Sql):

DeleteTable:



TempDB..##TMP_Deletes



UpdateTable:



TempDB..##TMP_Updates



InsertTable:



TempDB..##TMP_Inserts



UpdateDataTable:



TempDB..##Invoice_Lines_Updates



InsertSQL:



SELECT



SRC.ID, SRC.InvoiceID,

SRC.StockID, SRC.SalePrice,

SRC.HashData



FROM



dbo.Invoice_Lines SRC

INNER JOIN



TempDB..##TMP_Inserts TMP



ON SRC.ID = TMP.ID



647

www.it-ebooks.info



Chapter 11 ■ Delta Data Management



UpdateSQL:



SELECT



SRC.ID, SRC.InvoiceID, SRC.StockID,

SRC.SalePrice, SRC.HashData



FROM



dbo.Invoice_Lines SRC

INNER JOIN



TempDB..##TMP_Updates TMP



ON SRC.ID = TMP.ID

Finally, you can delete the provisional tables, TMP _Updates and TMP_Inserts, in the source database, and

TMP_Deletes and Invoice_Lines_Updates in the destination database. The process will now use the temporary

tables instead of persisted tables.



How It Works

The downside of Recipes 11-1 and 11-3 is that the entire dataset is transferred from the source server every time

that the process runs. An improvement on this approach might be to detect any records which are different, and

only send across the records that have changed, which is done in this recipe.

So, assuming that you have extremely limited rights on the source server, how can this be done?

The solution is to use a two-phase process:





First, bring back only the Key column(s) and the Delta Flag column to the destination server.







Then, detect the deltas (insert, update, and delete) and request only the appropriate

records for each from the source server.



In this example, I am using a hash code to detect deltas —and consequently presume that your source

data has a hash code generated for each insert and update. This approach works equally well with TIMESTAMP or

ROWVERSION columns or date fields.

It is worth noting that this approach is only productive if you have a Delta Detection column, and will

probably not prove much use if you are comparing multiple columns in the source and destination data sets,

because you will almost inevitably end up bringing most of the data back from the source server twice. In my

tests, using this approach can show real speed improvements with wide tables, but can be slower with very

narrow tables. In a 15-column table of mixed data types, a few of them medium-length VARCHARs and a 2 percent

data differential, I got a 10 percent reduction in processing time. In a 150-column table containing several tens of

very wide columns and the same 2 percent data differential, the process ran nearly four times faster. Of course,

your results depend on your source data, network, and destination server configuration.

Figure 11-11 gives perhaps a clearer, illustrated view of the process.



Figure 11-11.  Process flow to detect data modifications at source



648

www.it-ebooks.info



Chapter 11 ■ Delta Data Management



On the technical level, this technique will use session-scoped temporary tables to hold the delta IDs of

records that will be updated and deleted. To achieve this in practice, you need to set the RetainSameConnection

connection property, where the temporary table will be created (source or destination) to True. This ensures

that the same connection is used throughout the duration of the SSIS package, and therefore, that any sessionscoped temporary tables used are available for all steps that need to use them. Then you have to ensure that any

temporary tables that you create must be session-scoped—that is, their names begin with double hashes (##).

This recipe’s approach is best used in the following circumstances:





When you are importing wide tables that have a reliable delta-detection column.







When the modified data is only a small percentage of the total.







When you do not want to persist tables to disk for temporary data sets in the destination

database.



Please note that there is a little “scaffolding” to set up before creating the package, which you will need

to remove afterward. This comment may need explaining, so here goes. In order to use session-scoped

temporary tables easily in SSIS, it helps enormously to first create normal tables in the source and development

environments that you will use to set up the package—pretty much like you did in Recipe 11-1. When creating

the package, you will use SSIS variables to refer to the scratch tables in the source and destination databases.

When all is working, alter the SSIS variables to point to the session-scoped temporary tables, and delete the

corresponding tables in their respective databases. I call these tables “provisional.”

The four ETL metadata tables are used as described in Table 11-2.

Table 11-2.  ETL Metadata Tables Used in This Recipe



ETL Metadata Table



Description



Source database table for IDs

of new records.



Once deltas have been detected on the destination database, their IDs

are sent back to the source database so that the corresponding records

can be sent for insertion at the destination.



Source database table for IDs

of updated records.



Once data changes have been detected on the destination database, their

IDs are sent back to the source database so that the corresponding records

can be sent for insertion at the destination.



Destination database table

to hold modified data.



The IDs of updated records on the destination database.



Destination database table

for IDs of deleted records.



The IDs of deleted records on the destination database.



In this recipe, the VERSIONSTAMP column used in Recipe 11-1 is replaced by a VARCHAR column named

HashData, which holds the hash calculation used for delta detection.



Hints, Tips, and Traps





As you are using session-scoped temporary tables, you could have data spill out of

memory and into TempDB; so it is well worth ensuring that your TempDB is configured

correctly (the correct number of files for the number of processors, etc.; see Books On

Line (BOL) for details).



649

www.it-ebooks.info



Chapter 11 ■ Delta Data Management







As for the scaffolding (the permanent table(s) that you use to hold delta data), you can test

it using the materialized database tables until you are happy that everything is working,

and only then replace the variable references to the temporary tables at the end of your

first phase of testing.







There are two main tweaks to remember:





ValidateExternalMetadata must be False for all OLEDB source and destination

connectors that use a temporary table.







DelayValidation must be True for all Data Flow tasks that use a temporary table.



11-5. Performing Delta Data Upserts with Other

SQL Databases

Problem

You are using another SQL database as a data source and you want to transfer only modified or new records to

the destination, as well as IDs of deleted records.



Solution

Use a delta detection column and use temporary tables as shown in the previous recipe. The following steps

detail a simplified approach to this for an Oracle data source.

1.



Create an Oracle global temporary table using the following DDL

(C:\SQL2012DIRecipes\CH11\tblOracleDelta.Sql):

create global temporary table SCOTT.DELTA_DATA

(

"EMPNO" NUMBER(4,0)

) ;

on commit preserve rows



2.



Create a destination SQL Server table using the following DDL

(C:\SQL2012DIRecipes\CH11\tblOracle_EMP.Sql):

CREATE TABLE dbo.Oracle_EMP

(

EMPNO NUMERIC(4, 0) NULL,

LASTUPDATED datetime NULL,

ENAME VARCHAR(10) NULL,

JOB VARCHAR(9) NULL,

MGR NUMERIC(4, 0) NULL,

HIREDATE datetime NULL,

SAL NUMERIC(7, 2) NULL,

COMM NUMERIC(7, 2) NULL,

DEPTNO NUMERIC(2, 0) NULL

) ;



650

www.it-ebooks.info



Tài liệu bạn tìm kiếm đã sẵn sàng tải về

11-4. Detecting and Loading Delta Data Only

Tải bản đầy đủ ngay(0 tr)

×