Tải bản đầy đủ - 0 (trang)
11-9. Detecting, Logging, and Loading Delta Data

11-9. Detecting, Logging, and Loading Delta Data

Tải bản đầy đủ - 0trang

Chapter 11 ■ Delta Data Management



CREATE TABLE CarSales.dbo.DeltaTracking

(

DeltaID BIGINT IDENTITY(1,1) NOT NULL,

ObjectName NVARCHAR (128) NULL,

RecordID BIGINT NULL,

DeltaOperation CHAR(1) NULL,

DateAdded DATETIME NULL DEFAULT (getdate()),

CONSTRAINT PK_DeltaTracking PRIMARY KEY CLUSTERED

(

DeltaID ASC

) WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, IGNORE_DUP_KEY = OFF,

ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON)

) ;

GO

3.



Again, for the sake of simplicity, here is a (fairly) generic tracking trigger that can be

added to any table (in the source database), which has a single primary key column of

an INT data type (C:\SQL2012DIRecipes\CH11\tr_DeltaTracking.Sql):

CREATE TRIGGER CarSales.dbo.tr_DeltaTracking

ON dbo.Invoice_Lines FOR INSERT, UPDATE, DELETE

AS

DECLARE @InsertedCount BIGINT

DECLARE @DeletedCount BIGINT

DECLARE @ObjectName

NVARCHAR(128)

SELECT

SELECT



@InsertedCount = COUNT(*) FROM INSERTED

@DeletedCount = COUNT(*) FROM DELETED



SELECT

FROM

WHERE



@ObjectName = OBJECT_NAME(parent_id)

sys.triggers

parent_class_desc = 'OBJECT_OR_COLUMN'

AND object_id = @PROCID



-- Inserts

IF @InsertedCount >  0 AND @DeletedCount = 0

BEGIN

INSERT INTO dbo.DeltaTracking (RecordID, ObjectName, DeltaOperation)

SELECT

ID

,@ObjectName AS ObjectID

,'I' AS DeltaOperation

FROM INSERTED

END

-- Deletes

IF @InsertedCount = 0 AND @DeletedCount > 0

BEGIN

INSERT INTO dbo.DeltaTracking (RecordID, ObjectName, DeltaOperation)



670

www.it-ebooks.info



Chapter 11 ■ Delta Data ManageMent



SELECT

ID

,@ObjectName AS ObjectID

,'D' AS DeltaOperation

FROM DELETED

END

-- Updates

IF @InsertedCount > 0 AND @DeletedCount > 0

BEGIN

INSERT INTO dbo.DeltaTracking (RecordID, ObjectName, DeltaOperation)

SELECT

ID

,@ObjectName AS ObjectID

,'U' AS DeltaOperation

FROM INSERTED

END

GO

4.



Synchronize the data between the source and destination tables before allowing any

data modifications to the source table. This is as simple as (a) preventing updates to

the source data table, (b) loading the source data into a clean destination table, and

(c) allowing DML operations on the source data again.



5.



Now that the infrastructure is in place, you can proceed to the creation of the actual

ETL package itself. Create a new SSIS package and name it Prepare Scratch tables.

Add two OLEDB connection managers named CarSales_Staging_OLEDB and

CarSales_OLEDB.



6.



Add an Execute SQL task to the Control Flow pane. Configure as follows:

Connection Type:



OLEDB



Connection:



CarSales_Staging_OLEDB



SQL Statement:



TRUNCATE TABLE dbo.Invoice_Lines_Deletes

TRUNCATE TABLE dbo.Invoice_Lines_Updates



7.



Add a Sequence Container to the Control Flow pane. Name it Upsert and Delete Deltas.

Inside this container, add three Data Flow tasks named Inserts, Deletes, and Updates.



671

www.it-ebooks.info



Chapter 11 ■ Delta Data Management



8.



Add an Execute SQL task to the Control Flow pane. Name it Delete Data. Configure as

follows:

Connection Type:



OLEDB



Connection:



CarSales_Staging_OLEDB



SQL Statement:



DELETE



DST



FROM



dbo.Invoice_Lines DST

INNER JOIN dbo.Invoice_Lines_Deletes DL

ON DST.ID = DL.ID



9.



Add an Execute SQL task to the Control Flow pane. Name it Update Data. Configure

as follows:

Connection Type:



OLEDB



Connection:



CarSales_Staging_OLEDB



SQL Statement:



UPDATE



DST



SET



DST.InvoiceID = UPD.InvoiceID

,DST.StockID = UPD.StockID

,DST.SalePrice = UPD.SalePrice



FROM



dbo.Invoice_Lines DST



INNER JOIN



dbo.Invoice_Lines_Updates UPD

ON DST.ID = UPD.ID



10.



Add an Execute SQL task to the Control Flow pane. Name it Delete Tracking

Records. Configure as follows:

Connection Type:



OLEDB



Connection:



CarSales_OLEDB



SQL Statement:



DELETE FROM dbo.DeltaTracking

WHERE ObjectID = OBJECT_ID('dbo.Invoice_Lines')



The SSIS package should look like Figure 11-21.



672

www.it-ebooks.info



Chapter 11 ■ Delta Data Management



Figure 11-21.  Process flow for trigger-based delta data upserts



11.



Double-click the “Inserts” Data Flow task and add an OLEDB source to the Data Flow

pane. Configure it as follows:

OLEDB Connection Manager:



Car_Sales_OLEDB



Data Access Mode:



SQL Command



SQL Command Text:



SELECT SRC.ID

,SRC.InvoiceID

,SRC.StockID

,SRC.SalePrice

FROM



dbo.Invoice_Lines SRC

INNER JOIN



dbo.DeltaTracking TRK



ON SRC.ID = TRK.RecordID

WHERE



TRK.DeltaOperation = 'I'

AND ObjectID = OBJECT_ID('dbo.Invoice_Lines')



673

www.it-ebooks.info



Chapter 11 ■ Delta Data Management



12.



Add an OLEDB destination to the Data Flow pane. Connect the OLEDB source to it.

Configure it as follows:

OLEDB Connection Manager:



CarSales_Staging_OLEDB



Data Access Mode:



Table or View - Fast Load



Table or View:



dbo.Invoice_Lines



13.



Click Mappings and ensure that all the columns are mapped between the source and

destination.



14.



Double-click the “Deletes” Data Flow task and add an OLEDB source to the Data Flow

pane. Configure it as follows:

OLEDB Connection Manager:



CarSales_OLEDB



Data Access Mode:



SQL Command



SQL Command Text:



SELECT



TRK.RecordID



FROM



dbo.DeltaTracking TRK



WHERE



TRK.DeltaOperation = 'D'

AND ObjectID = OBJECT_ID('dbo.Invoice_Lines')



15.



Add an OLEDB destination to the Data Flow pane. Connect the OLEDB source to it.

Configure it as follows:

OLEDB Connection Manager:



CarSales_Staging_OLEDB



Data Access Mode:



Table or View - Fast Load



Table or View:



dbo.Invoice_Lines_Deletes



16.



Click Mappings and ensure that the RecordID and ID columns are mapped between

the source and destination.



17.



Double-click the “Updates” Data Flow task and add an OLEDB source to the Data

Flow pane. Configure it as follows:

OLEDB Connection Manager:



CarSales_OLEDB



Data Access Mode:



SQL Command



SQL Command Text:



SELECT SRC.ID

,SRC.InvoiceID

,SRC.StockID

,SRC.SalePrice

,SRC.DateUpdated



674

www.it-ebooks.info



Chapter 11 ■ Delta Data Management



FROM



dbo.Invoice_Lines SRC

INNER JOIN



dbo.DeltaTracking TRK



ON SRC.ID = TRK.RecordID

WHERE



TRK.DeltaOperation = 'U'

AND ObjectID = OBJECT_ID('dbo.Invoice_Lines')



18.



19.



Add an OLEDB destination to the Data Flow pane. Connect the OLEDB source to it.

Configure it as follows:

OLEDB Connection Manager:



Car_Sales_Staging_OLEDB



Data Access Mode:



Table or View - Fast Load



Table or View:



dbo.Invoice_Lines_Updates



Click Mappings and ensure that all the columns are mapped between the source and

destination.



You can now run the package and load any delta data.



How It Works

This approach is essentially in two parts:





First, detect the changes to the source table using triggers, and then log the changes.







Second, carry out periodic data loads using the logged changes to indicate the records to

load and modify.



A long-standing and reliable way of tracking delta data is to use triggers to flag inserts, updates, and deletes.

There are several ways to store the data flags indicating data changes; however, the two classic ways are either to

use extra columns in the source tables or to create and use separate “delta” tables. I will explain the delta table

method, which traditionally means recording the unique ID of the record affected in a table to a table (most

probably on the source data server) either in the source database or in another database. Then these delta IDs

can be used to isolate the data sets needed to perform upserts and deletes in destination data.

The trigger-based process to track changes looks like Figure 11-22.



Figure 11-22.  A trigger-based process to track changes



675

www.it-ebooks.info



Chapter 11 ■ Delta Data Management



Not all DBAs allow their source databases to be encumbered (this word is used from their point of view) with

extraneous tables and triggers, so this solution is not always applicable in practice. However, when it is possible,

it has many advantages, as well as a few disadvantages.

The advantages are:





A simple log of data changes and the change type.







The log (the delta table) is narrow and easy to use.



The major inconveniences and potential problems are:





The underlying logic that keeps data in sync can be very complex.







There are performance implications because triggers are fired for every change.







If there are existing triggers, then trigger sequencing and interaction can be problematic.







Long-running transactions (unless transactions are correctly handled) can cause

problems.







Data inconsistencies happen if the process is not analyzed thoroughly.



A trigger-based approach is best used when you can count on the agreement and involvement of the source

data DBA. It also means getting it accepted can slow down DML operations on the source database, but that the

ease and speed of the data transfer to a destination server are a sufficient compensation. You also need to be sure

that there are no unforeseen side effects and that the process has been thoroughly tested.

In this example, I used standard tables in a destination database to hold the data used for updates and

deletes, simply because it is easier to set up and can persist data for investigation and test purposes. You may

prefer to use session-scoped temporary tables (as described in previous recipes).

Once you have set up a delta-tracking process using triggers, it is equally easy to use pure T-SQL to perform

inserts, updates, and deletes in a destination database. You can create three separate triggers—one each for

update, delete, and insert operations—and you can hard-code the table name and/or object ID if you prefer.

I find that a generic trigger makes maintenance of delta tracking much easier.

This approach lets you use the same tracking table for multiple source data tables—assuming that they

all have a single-column primary key of the same data type. When things are not this simple, you have several

choices:





Separate tables for multicolumn source tables







Separate tables for data types







Widen your table to hold several possible key columns







Add multiple columns for different key data types







Use an NVARCHAR column to hold all possible primary keys—and risk the speed hit when

reading and writing







Have one complex trigger to handle multiple primary key columns







Have a trigger per type of primary key and/or PK data type



Given the multiple possibilities, I can only suggest that you try out some of these approaches to find the one

best suited to your environment.



676

www.it-ebooks.info



Chapter 11 ■ Delta Data Management



A similar approach is possible with many other SQL databases, providing that they allow triggers to be used.

As all requirements (and the syntax for different database) are different, I can only encourage you to consult the

relevant product documentation, and to use this recipe as an overall model.



Hints, Tips, and Traps





This process presumes that the source and destination data will be, and will be kept, in

sync. Should the two get out of sync, you need to prevent updates to the source data table

while it is loaded into a clean destination table, and delete or truncate all records relative

to this table from the delta table before allowing DML operations on the source data again.







The Deletes and Updates tables may be in another database, of course—or even on

another (linked) server, if that suits your infrastructure and operational requirements.







You can simply load all data in a delta table to a staging table, and then insert/update/

delete. This is likely to be slower, but it allows you to keep a trace of all operations.







It is a very good idea to wrap the whole SSIS package in a transaction, as this will ensure

that you do not





Lose delta records.







Attempt duplicate inserts.







Carry out old updates.







It is possible to create the scratch tables at the start of this stored procedure and drop

them at the end, if you prefer. This way, any failures will leave you with the scratch tables

intact, which can help with debugging. Of course, you must then remember to drop them

manually before re-running the package.







The particular implementation in this recipe presumes that you have a clear cutoff point

in the source database, which allows delta data to be transferred to the destination with

no modification of the source during the delta data load. Should the source database be

continually active, then you need to use the DateAdded field in the tracking table to define

a cutoff point, and then delete (not truncate) operations up to this date and time, which

have been duplicated across the databases once the delta load is successfully completed.



11-10. Detecting Differences in Rowcounts, Metadata,

and Column Data

Problem

You want to detect differences in rowcounts, metadata, and column data without using SSIS or T-SQL.



Solution

Use the TableDiff utility. From a Command window to detect if there have been any changes made to a source

table before running an ETL delta data load.

Once again, I will use a few mini-recipes to describe how to use TableDiff because its various applications

are too different to be rolled into one simple example.



677

www.it-ebooks.info



Chapter 11 ■ Delta Data Management



Metadata and Rowcounts

The following code, run in a Command window, will return the rowcounts for the two tables and indicate if that

the metadata is identical for the two tables.

"C:\Program Files\Microsoft SQL Server\110\COM\Tablediff.exe"

-sourceuser Adam

-sourcepassword Me4B0ss

-sourceserver ADAM02

-sourcedatabase CarSales

-sourceschema dbo

-sourcetable Sales

-destinationuser Adam

-destinationpassword Me4B0ss

-destinationserver ADAM02

-destinationdatabase CarSales_Staging

-destinationschema dbo

-destinationtable Sales

-q

It is the -q (quick) parameter that tells TableDiff only to compare record counts and metadata. Should the

two tables have differing columns, then TableDiff will return: “have different schemas and cannot be compared.”



Row Differences

To get details on any records that differ between the two tables (without the -q parameter and preferably sending

the output to an SQL Server table rather than the Command window), add the following:

-dt -et DataList

DataList is the name of the table (in the destination database) that will hold the list of anomalies.



Column Differences

To see all data differences at column level, add the -c parameter.

The resulting table has the columns listed in Table 11-3.

Table 11-3.  DataDiff Output



Column Name



Possible values



ID

Msdifftool-Errorcode



Comments

Contains the PK, GUID, or unique ID.



0, 1, or 2



0 for a data mismatch.

1 for a record in the destination table, but not

the source.

2 for a record in the source table, but not the

destination.



MSdifftool_Offendingcolumns



The name of the column where the data differs

between the two tables.



678

www.it-ebooks.info



Tài liệu bạn tìm kiếm đã sẵn sàng tải về

11-9. Detecting, Logging, and Loading Delta Data

Tải bản đầy đủ ngay(0 tr)

×