Tải bản đầy đủ - 0 (trang)
9-26. Handling Type 2 Slowly Changing Dimensions with SSIS

9-26. Handling Type 2 Slowly Changing Dimensions with SSIS

Tải bản đầy đủ - 0trang

Chapter 9 ■ Data Transformation



CREATE TABLE CarSales_Staging.dbo.Tmp_Client_SCDSSIS2

(

SurrogateID INT NOT NULL

);

GO

3.



Create a new SSIS package named SCD_Type2. Add two OLEDB connection

managers—one named CarSales_OLEDB, connecting to the CarSales database,

the other named CarSales_Staging_OLEDB, connecting to the CarSales_Staging

database. Set the RetainSameConnection property for the latter to True.



4.



Add the following two variables:



Name



Scope



DataType



TempTable SCD_Type2 String

ValidFrom

5.



Value



Comments



Tmp_Client_SCD Used to switch from the databasebased table to the temporary table.



SCD_Type2 Int32



Automatically gets the current date.



Set the EvaluateAsExpression property to True for the ValidFromvariable. Set the

expression (this will get the current date as an integer ) as:

YEAR( GETDATE()) * 100000 + MONTH( GETDATE()) * 1000 + DAY( GETDATE())



6.



In the Control Flow tab, add an Execute SQL task, and configure as follows:



Name



Create Temp Table for Session



Connection:



CarSales_Staging_OLEDB



SQL Statement:



CREATE TABLE ##Tmp_Client_SCDSSIS2

(SurrogateID INT NOT NULL)



7.



Even if you will not be using this temporary table until the package is debugged and

functioning, it is just as well to create it now. Click OK to confirm your changes.



8.



Add a Data Flow task and name it Main SCD Type 2 Process. Connect the previous

Execute SQL task (Create Temp Table for Session) to it. Double-click to edit.



536

www.it-ebooks.info



Chapter 9 ■ Data Transformation



9.



Add an OLEDB Source connection to the Data Flow pane, name it Source Data, and

configure as follows:

OLEDB Connection Manager: CarSales_OLEDB

Data Access Mode:



SQL Command



SQL Command Text:



SELECT



ID, ClientName,

C ountry, Town, County, Address1,

Address2, ClientType, ClientSize



FROM



dbo.Client



ORDER BY



ID



10.



Confirm with OK.



11.



Right-click the Source Data OLEDB Source connection and select Show Advanced

Editor. Select the Input and Output Properties tab, click the OLEDB Source Output,

and set the IsSorted property to True in the right of the dialog box.



12.



Expand the OLEDB Source Output and expand the Output columns. Click the column

ID and set the SortKeyPosition property to 1.



13.



Add a Lookup transformation and connect the source you just created to it. Doubleclick to edit. In the General tab, set it to use an OLEDB connection manager,

the Cache mode to Full Cache, and to redirect rows to NoMatch output. In the

Connection pane, set the OLEDB connection manager to CarSales_Staging_OLEDB

and to use the results of the following SQL query:

SELECT

FROM

ORDER BY



14.



SurrogateID, BusinessKey, ClientName, Country, Town, County, Address1,

Address2, ClientType, ClientSize, ValidFrom, ValidTo, IsCurrent

dbo.Client_SCDSSIS2 WITH (NOLOCK)

BusinessKey



In the Columns pane, map the ID column (available input columns) to the

BusinessKey column (available lookup columns), and then select all the other

columns from the available lookup columns, except ValidFrom, ValidTo, and

IsCurrent. Alias all the lookup columns by prefixing them with DIM_. The dialog box

should look like Figure 9-20.



537

www.it-ebooks.info



Chapter 9 ■ Data Transformation



Figure 9-20.  Lookup columns for an SSIS Type 2 SCD

15.



Click OK to confirm.



16.



Add a Conditional Split transform and connect the Lookup transform to it using the

Lookup Match Output. Double-click to edit. Create an output named DataDifference,

with a condition like this:

(ClientSize != DIM_ClientSize) || ( ClientName) != ([ClientName])



17.



Once the comparison for each field has been entered (only one is shown here), click

OK to finish your modifications.



18.



Add a Multicast transform, and connect the Conditional Split transform to it using the

DataDifference output.



19.



Add a Merge transform and connect the Lookup transform to it. Be sure to select the

Lookup NoMatch Output and to map this to the Merge Input 1. Then connect

the Multicast transform to the Merge transform. Double-click to edit. Ensure

that the columns are correctly mapped, as shown in Figure 9-21.



538

www.it-ebooks.info



Chapter 9 ■ Data Transformation



Figure 9-21.  Merge column mapping in SSIS

20.



Add a Derived Column transform, connect the Merge transform to it, and add two

derived columns, like in Figure 9-22.



Figure 9-22.  Derived Column transform

21.



Confirm with OK.



22.



Add an OLEDB destination and connect the Derived Column transform to it.

Configure it as follows:

OLEDB Connection Manager:



CarSales_Staging_OLEDB



Data Access Mode:



Table or View – Fast Load



Name of Table or View:



dbo.Client_SCDSSIS2



23.



Once you have ensured that the columns are mapped—including mapping the source

data ID column to the SCD Type 2 table Business Key, confirm with OK.



24.



Add an OLEDB Destination and connect the Multicast transform to it. Configure it as

follows:



25.



OLEDB Connection Manager:



CarSales_Staging_OLEDB



Data Access Mode:



Table name or view name variable



Variable Name:



User::TempTable



Click Mappings and ensure that the ID column from the source data is mapped to the

SurrogateID column of the destination table. Once you have ensured that the column

is mapped, confirm with OK. The data flow is finished and looks like Figure 9-22.



539

www.it-ebooks.info



Chapter 9 ■ Data Transformation



26. Return to the Control Flow tab and add an Execute SQL task. Name it Update SCD

Type 2 table. Connect the previous Data Flow task to it and configure it as follows:

Connection:



CarSales_Staging_OLEDB



SQL Statement: UPDATE

SET



SCDSSIS2

SCDSSIS2.IsCurrent = 0

,SCDSSIS2.ValidTo = YEAR(DATEADD(d,-1,GETDATE())) * 100000

+ MONTH(DATEADD(d,-1,GETDATE())) * 1000

+ DAY(DATEADD(d,-1,GETDATE()))



FROM



dbo.Client_SCDSSIS2SCDSSIS2



INNER JOIN Tmp_Client_SCDSSIS2 TMP

ON SCDSSIS2.SurrogateID = TMP.SurrogateID

WHERE



SCDSSIS2.IsCurrent = 1



How It Works

Unfortunately, efficiently processing slowly changing dimensions in SSIS is not intuitive or as simple as it could

be. Yes, there is—and has been since SSIS appeared—a Slowly Changing Dimension transform, but unfortunately

it is more a learning tool than a real-world piece of kit. I will not even recommend using it here.

So you are left with the moderately unpalatable solution of writing your own SSIS package to handle slowly

changing dimensions. Luckily, a Type 1 SCD is nothing more than an in-place update (or insert for new data).

When it comes to Type 2 SCDs, however, the package is fairly complex, so here is an overview of what it does:





Links to the current dimension table.







Links to the source data table or view.







Detects any new dimension records (by mapping the business key between the two data

sets and detecting where there are no matches) and load them.







Analyzes all existing records in the two data sources, and compares attribute fields. Where

any of these differ, two things will happen:





Add a new record for the latest version of the dimension data (suitably flagged as

valid).







Updates the previously valid version with the validity set to false, and the date it

ceased being valid.



To get this to work, the updates to the existing records will use the session-scoped temporary table to identify

the records to update. In this way, a set of record IDs for later update can be stored as part of the core data flow

process, and the update command can be run once this “core” process has finished. Initially this temporary table

will be persisted on disk. It will be replaced by the session-scoped temporary table ##Tmp_Client_SCDSSIS2once

all debugging is completed. The main thing about such an approach is that it avoids using an OLEDB Command

transform as part of the data flow, which would fire for every record—yet apply to the entire dimension table,

causing massive unnecessary work for the server.

This means that the overall high-level process looks like Figure 9-23.



540

www.it-ebooks.info



Chapter 9 ■ Data transformation



Figure 9-23. SSIS SCD Type 2 process

The “core” process—the Data Flow task from Figure 9-23 looks like Figure 9-24.



Figure 9-24. Detail of SSIS SCD Type 2 process



541

www.it-ebooks.info



Chapter 9 ■ Data Transformation



The temporary table can be deleted—and the session-scoped table used in its place—once everything is

working. It is, however, vital to use it during the development phase.

So, once the package has run, and you have debugged it, you can do the following:

1.



Change the Variable value for TempTable to ##Tmp_Client_SCDSSIS2.



2.



Alter the reference in the task Update SCD Type 2 table so that the temp table used is

the session-scoped temporary table. The code needs to be tweaked to use:

INNER JOIN ##Tmp_Client_SCDSSIS2 TMP



3.



Delete the dbo.Tmp_Client_SCDSSIS2 table in the CarSalesdatabase.



Your SSIS package will now use the session-scoped temporary table to update the destination table records

and avoid the need for an extraneous table to persist in the database.



Hints, Tips, and Traps





The reason for aliasing the dimension columns in the Lookup transform is that it (a)

makes it much easier to track the use of columns with identical names in the package,

and (b) it prevents SSIS applying two or three part dotted notation names that can rapidly

become extremely hard to manipulate.







It is important to sort both the source data sets (source and lookup) and to tell SSIS

that they are sorted. Otherwise, you will get some very strange results from the Merge

transform.







Be sure to implement NULL handling or the conditional split will fail.







If you are dealing with large dimensions, then you should place the cache file on the

fastest drive array possible—or even a solid-state disk, if you can. Any speed gains here

will have a marked effect on the whole package.



9-27. Handling Type 3 Slowly Changing Dimensions

Using T-SQL

Problem

You need to track the last version of one or more columns in a denormalized single record in the destination table

using T-SQL.



Solution

Use MERGE in T-SQL to carry out a Type 3 SCD. The following describes how to do it.

1.



Create a destination table containing the business key and an attribute, as well as

previous version columns for the attribute and the data that it is valid to. The DDL for

such a destination table is (C:\SQL2012DIRecipes\CH09\tblClient_SCD3.sql):

CREATE TABLE CarSales_Staging.dbo.Client_SCD3

(

ClientID INT IDENTITY(1,1) NOT NULL,



542

www.it-ebooks.info



Chapter 9 ■ Data Transformation



BusinessKey INT NOT NULL,

ClientName VARCHAR(150) NULL,

Country VARCHAR(50) NULL,

Country_Prev1 VARCHAR(50) NULL,

Country_Prev1_ValidTo INT NULL,

Country_Prev2 VARCHAR(50) NULL,

Country_Prev2_ValidTo INT NULL,

) ;

GO

2.



Run the following code snippet(C:\SQL2012DIRecipes\CH09\SCD3.sql):

USE CarSales_Staging;

GO



DECLARE @Yesterday INT = CAST(CAST(YEAR(DATEADD(dd,-1,GETDATE())) AS CHAR(4))

+ RIGHT('0' + CAST(MONTH(DATEADD(dd,-1,GETDATE())) AS VARCHAR(2)),2) + RIGHT('0'

+ CAST(DAY(DATEADD(dd,-1,GETDATE())) AS VARCHAR(2)),2) AS INT)



MERGE

CarSales_Staging.dbo.Client_SCD3

AS DST

USING

CarSales.dbo.Client

AS SRC

ON

(SRC.ID = DST.BusinessKey)



WHEN NOT MATCHED THEN



INSERT (BusinessKey, ClientName, Country)

VALUES (SRC.ID, SRC.ClientName, SRC.Country)



WHEN MATCHED

AND

(DST.Country <> SRC.Country

OR DST.ClientName <> SRC.ClientName)



THEN UPDATE



SET

DST.Country = SRC.Country

,DST.ClientName = SRC.ClientName

,DST.Country_Prev1 = DST.Country

,DST.Country_Prev1_ValidTo = @Yesterday

,DST.Country_Prev2 = DST.Country_Prev1

,DST.Country_Prev2_ValidTo = DST.Country_Prev1_ValidTo

;



How It Works

Let us be clear, a Type 3 SCD applies denormalization to a table so that multiple column sets are used to provide

versioning. I find that it helps to understand Type 3 SCDs as denormalized data. This allows you to concentrate

on appreciating the limitations and the drawbacks of this approach. This is another way of saying only use it if

you have no other alternative! A Type 3 SCD is a single table with a set of duplicate columns for each value whose

evolution you wish to track, as well as the date that the data evolved. This is ungainly, necessitates extremely wide

tables, and, at some point, involves losing historical data since it is impossible to have previous column versions

for all time—because SQL Server will run out of columns.



543

www.it-ebooks.info



Tài liệu bạn tìm kiếm đã sẵn sàng tải về

9-26. Handling Type 2 Slowly Changing Dimensions with SSIS

Tải bản đầy đủ ngay(0 tr)

×