Tải bản đầy đủ - 0 (trang)
9-25. Handling Type 2 Slowly Changing Dimensions in T-SQL

9-25. Handling Type 2 Slowly Changing Dimensions in T-SQL

Tải bản đầy đủ - 0trang

Chapter 9 ■ Data Transformation



DECLARE @Today INT = CAST(CAST(YEAR(GETDATE()) AS CHAR(4))

+ RIGHT('0' + CAST(MONTH(GETDATE()) AS VARCHAR(2)),2)

+ RIGHT('0' + CAST(DAY(GETDATE()) AS VARCHAR(2)),2) AS INT);



-- Insert statement for the latest, newest update to an existing dimension record

INSERT INTO dbo.Client_SCD2 (BusinessKey, ClientName, Country, Town, Address1,

Address2, ClientType, ClientSize, ValidFrom, IsCurrent)



SELECT ID, ClientName, Country, Town, Address1, Address2, ClientType, ClientSize,

@Today, 1

FROM



(

-- Merge statement

MERGE CarSales_Staging.dbo.Client_SCD2

AS DST

USING CarSales.dbo.Client

AS SRC

ON

(SRC.ID = DST.BusinessKey)



WHEN NOT MATCHED THEN



INSERT (BusinessKey, ClientName, Country, Town, County, Address1, Address2,

ClientType, ClientSize, ValidFrom, IsCurrent)

VALUES (SRC.ID, SRC.ClientName, SRC.Country, SRC.Town, SRC.County, Address1,

Address2, ClientType, ClientSize, @Today, 1)



WHEN MATCHED

AND (

ISNULL(DST.ClientName,'') <> ISNULL(SRC.ClientName,'')

OR ISNULL(DST.Country,'') <> ISNULL(SRC.Country,'')

OR ISNULL(DST.Town,'') <> ISNULL(SRC.Town,'')

OR ISNULL(DST.Address1,'') <> ISNULL(SRC.Address1,'')

OR ISNULL(DST.Address2,'') <> ISNULL(SRC.Address2,'')

OR ISNULL(DST.ClientType,'') <> ISNULL(SRC.ClientType,'')

OR ISNULL(DST.ClientSize,'') <> ISNULL(SRC.ClientSize,'')

)



-- Update statement for a changed dimension record, to flag as no longer active

THEN UPDATE



SET DST.IsCurrent = 0, DST.ValidTo = @Yesterday



OUTPUT SRC.ID, SRC.ClientName, SRC.Country, SRC.Town, SRC.Address1, SRC.Address2,

SRC.ClientType, SRC.ClientSize, $Action AS MergeAction

) AS MRG



WHERE MRG.MergeAction = 'UPDATE'

;

The destination table will have new records added, and changed records added and flagged as being the

latest version.



533

www.it-ebooks.info



Chapter 9 ■ Data Transformation



How It Works

Here we are looking at a technique that can be so powerful when data has to be updated, and a history of the

changes maintained in SQL Server tables. As we are dealing with SCDs, I will presume that we will not be

using WHEN NOT MATCHED to delete dimension data, and will only look at UPSERTing data—that is inserting and

updating dimension data. A Type 2 SCD is a technique for change tracking using historical records in a table,

while indicating the valid date ranges and the current data. This also has many uses outside a data warehousing

environment.

To illustrate all of this, I am here taking the dbo.Client table as the source data from the CarSales database,

and exporting this data, suitably transformed, into the CarSales_Stagingdatabase. This example presumes that

the business key is a unique primary key.

Handling a Type 2 SCD is only slightly more complex than a Type 1. What this process does is track changes

over time to the dimension attributes by





Adding a new record every time a dimension attribute (the value in any column

containing the descriptive data) changes.







Logging the date when the change occurs, thus ensuring that the start date for each new

record is kept, as well as the date the previous record ceased being valid.







Flagging the current record for every business key, which can improve reporting

performance.



This supposes a destination table containing all the values required, a business key (the client ID from

the source data) and a surrogate key that will be used for data warehousing. As well as these “core” fields, the

following fields will be required to track the evolution of the dimension over time:

ValidFrom: Logs the date from that this dimension record was valid.

ValidTo: Logs the date to that this dimension record was, or is, valid.

IsCurrent: Indicates that this is the current record (the most recent data) for a

dimension element.

The preceding SQL snippet maps the two tables on the business key and inserts a new record into the

destination table if the record referenced by the business key is not already present (WHEN NOT MATCHED), as well

as adding an auto-incremented surrogate key and setting today’s date as the ValidFrom date.

If any differences between the source and destination tables are detected(using the WHEN MATCHED

AND. . .clause) then severel events are triggered. Firstly an auto-incremented surrogate key is updated and today’s

date is set as as the ValidFrom date. The outcome is that this record is flagged as the current record. Then

yesterday’s date is set as the ValidTo date for the previous valid record for this business key.

All of this may seem complex, but it is made extremely simple using the OUTPUT clause of the MERGE

statement. What the code does is carry out the INSERT and UPDATE as before for new and existing records, and

then it selects the UPDATEd records (funneled via the OUTPUT clause) as a separate INSERT. This way, the latest

modifications to the source data become a new record—handled as a separate INSERT—and any required

tracking data, such as the ValidFrom date and the IsCurrent flag, is added at this stage.



Hints, Tips, and Traps





The ValidFrom and ValidTo dates are added as INT data types, rather than DATE or

DATETIME purely with a view to future loading into Analysis Services, where the choice of

data type can not only make dates easier to manipulate, but can also enhance processing

times. Feel free to use any of the date data types if you need to.



534

www.it-ebooks.info



Chapter 9 ■ Data Transformation







I am presuming a 24-hour cycle on dimension date validity, for the sake of simplicity. If

your requirements are more complex, then the validity range of the dimension record can

be any valid date range.







You may not need to check for data differences on all columns. If certain columns contain

data that is not considered an essential attribute, then do not use it in the

WHEN MATCHED ... AND clause.



9-26. Handling Type 2 Slowly Changing Dimensions with SSIS

Problem

You need to track the changes over time as source data is added to a destination table as part of an SSIS data flow

process.



Solution

Use SSIS package that uses a conditional split and an Execute SQL task to treat this as a Type 2 SCD. The

following shows you how.

1.



In the destination database (CarSales_Staging) create a table that for

the Client_SCDSSIS2 Type 2 dimension. The DDL for this table is

(C:\SQL2012DIRecipes\CH09\tblClientSCDSSIS2.sql):

USE CarSales_Staging;

GO



CREATE TABLE CarSales_Staging.dbo.Client_SCDSSIS2

(

SurrogateID INT IDENTITY(1,1) NOT NULL,

BusinessKey INT NOT NULL,

ClientName VARCHAR(150) NULL,

Country VARCHAR(50) NULL,

Town VARCHAR(50) NULL,

County VARCHAR(50) NULL,

Address1 VARCHAR(50) NULL,

Address2 VARCHAR(50) NULL,

ClientType VARCHAR(20) NULL,

ClientSize VARCHAR(10) NULL,

ValidFrom INT NULL,

ValidTo INT NULL,

IsCurrent BIT NULL

);

GO



2.



Create a temporary table—in disk while creating and testing the package in the

staging database—that will be used to update the dimension table. The DDL for this

table is



535

www.it-ebooks.info



Chapter 9 ■ Data Transformation



CREATE TABLE CarSales_Staging.dbo.Tmp_Client_SCDSSIS2

(

SurrogateID INT NOT NULL

);

GO

3.



Create a new SSIS package named SCD_Type2. Add two OLEDB connection

managers—one named CarSales_OLEDB, connecting to the CarSales database,

the other named CarSales_Staging_OLEDB, connecting to the CarSales_Staging

database. Set the RetainSameConnection property for the latter to True.



4.



Add the following two variables:



Name



Scope



DataType



TempTable SCD_Type2 String

ValidFrom

5.



Value



Comments



Tmp_Client_SCD Used to switch from the databasebased table to the temporary table.



SCD_Type2 Int32



Automatically gets the current date.



Set the EvaluateAsExpression property to True for the ValidFromvariable. Set the

expression (this will get the current date as an integer ) as:

YEAR( GETDATE()) * 100000 + MONTH( GETDATE()) * 1000 + DAY( GETDATE())



6.



In the Control Flow tab, add an Execute SQL task, and configure as follows:



Name



Create Temp Table for Session



Connection:



CarSales_Staging_OLEDB



SQL Statement:



CREATE TABLE ##Tmp_Client_SCDSSIS2

(SurrogateID INT NOT NULL)



7.



Even if you will not be using this temporary table until the package is debugged and

functioning, it is just as well to create it now. Click OK to confirm your changes.



8.



Add a Data Flow task and name it Main SCD Type 2 Process. Connect the previous

Execute SQL task (Create Temp Table for Session) to it. Double-click to edit.



536

www.it-ebooks.info



Tài liệu bạn tìm kiếm đã sẵn sàng tải về

9-25. Handling Type 2 Slowly Changing Dimensions in T-SQL

Tải bản đầy đủ ngay(0 tr)

×