Tải bản đầy đủ - 0 (trang)
10-17. Controlling Data Flow via Profile Metadata

10-17. Controlling Data Flow via Profile Metadata

Tải bản đầy đủ - 0trang

Chapter 10 ■ Data Profiling



5.



Add a Script component that you set as a transform and connect the Flat File source

to it. Double-click to edit. Add the IsSafeToProceed variable as a read-write variable.

Click InputColumns and select the column to profile (InvoiceNumber in this

example). Set the ScriptLanguage to Microsoft Visual Basic 2010 and click Edit Script.



6.



Replace ScriptMain with the following code:

Public Class ScriptMain

Inherits UserComponent

Dim NullCounter As Integer

Dim RowCounter As Integer

Public Overrides Sub PreExecute()

MyBase.PreExecute()

End Sub

Public Overrides Sub PostExecute()

If (NullCounter / RowCounter) >= 0.25 Then

Me.Variables.IsSafeToProceed = False

End If

End Sub

Public Overrides Sub Input0_ProcessInputRow(ByVal Row As Input0Buffer)

If Row.InvoiceNumber_IsNull Then

NullCounter = NullCounter + 1

End If

RowCounter = RowCounter + 1

End Sub

End Class



7.



Close the Script editor and confirm your modifications with OK.



8.



Add a Raw File destination. Name it Pause Output and connect the Script task to it.

Double-click to edit.



9.



Configure the Raw File destination as follows:

Access Mode:



FileName



File Name:



C:\SQL2012DIRecipes\CH10\Invoice.Raw



Write Option:



Create always



615

www.it-ebooks.info



Chapter 10 ■ Data Profiling



10.



Ensure that all the columns are selected, and then confirm your modifications

with OK.



11.



Return to the Control Flow pane and add a second Data Flow task. Name it Final

Load and connect the first Data Flow task to it. Double-click the connector and set

the Precedence Constraint Editor as follows. The dialog box should then look like

Figure 10-22.

Evaluation Operation:



Expression and Constraint



Value:



Success



Expression:



@IsSafeToProceed



Figure 10-22.  Precedence Constraint Editor in SSIS

12.



Confirm your changes.



13.



Double-click the second Data Flow task to edit it.



14.



Add a Raw File source named Continue Load, which you configure to use the same

file that you defined in step 9 (C:\SQL2012DIRecipes\CH10\Invoice.Raw).



15.



Add an OLEDB destination named Final Load, to which you connect the Raw File

source. Configure it to load the data into the CarSales_Staging database, creating a

table from the OLEDB destination. The final package should look like Figure 10-23.



616

www.it-ebooks.info



Chapter 10 ■ Data Profiling



Figure 10-23.  Controlled data flow in SSIS



How It Works

Profiling data as it is loaded allows you to capture metrics that you specify, which can then be used to halt

processing if necessary. Inevitably this means pausing the data flow while counters are finalized and analyzed.

So the trick is to find a way to stage the data efficiently while awaiting the results of the profiling. My preferred

solution is to use a RAW file destination to hold the data temporarily, and then continue the final load into the

destination table(s) if the profiling has found no anomalies. This will slow down the process to some extent, but

as it is nearly always the final load that is the slowest part of an SSIS package, it avoids a wasted load and reload if

there is a problem. Outputting data into a raw file is extremely fast—and in virtually all cases should prove faster

than loading into a staging table. In the best of all possible worlds you should place the RAW file in the SQL Server

itself, if you can on a fast disk array.

The code used here is a very simple example to illustrate the principle. During a data load the number of

NULLs in a column will be tested, and if it exceeds 25 percent, then the process will halt. This is set in the

PostExecute method. If the number of NULLs is below this threshold, then the package will continue into its final

phase—loading the data into the destination tables.

There are other solutions, of course. You can profile the data once it has been loaded, for instance. This

is probably the easiest solution, as you can use any combination of the T-SQL techniques described in this

chapter to produce a data profile that should alert you to any potential issues with the data load. Also any stored

procedures or SQL code can be run as Execute SQL tasks at the end of an SSIS package and can output alerts to

your logging infrastructure, as described in Chapter 12.



617

www.it-ebooks.info



Chapter 10 ■ Data Profiling



Alternatively you can choose to profile data while it is being loaded. There are two possibilities here:





Profile any staged data using T-SQL from an Execute SQL task.







Insert a Multicast task after the data source and profile the data using the some of the

various SSIS scripts shown earlier in the chapter. Using a separate data path will avoid

slowing down the processing if asynchronous transforms are used.



Hints, Tips, and Traps





This recipe’s script is ridiculously simple, but it is there to give you an idea of how to

profile data during a load, and use the results of the profiling to control the package flow.







You can extend the profiling using any of the SSIS techniques described earlier in this

chapter.



Summary

This chapter showed many ways of profiling source data using SQL Server. The question is which method to

choose and when. Obviously, the answer will depend on your circumstances and requirements. You can profile

data not yet in SQL Server if you can connect to it. This essentially means an OLEDB provider (over ODBC if

necessary) to use T-SQL commands on the source. However, in practice, this could be extremely slow. If you are

using SSIS, then in any real-world setting the data has to be in SQL Server, or the workarounds to get the SSIS

Data Profiling task to read the source can make it unbearably slow.

If your data has already been staged into SQL Server, then the horizons that open to you (at least as far as

profiling data is concerned) are considerably wider. You can use SSIS or T-SQL to profile the data, and for a quick

data profile, the SSIS Data Profiling task can be set to run virtually instantly. You can also store the profile output

as XML or shred it into an SQL Server table.

Once you have profiled your data, you can then use the results to make decisions as to whether or not to

continue the data load. This presumes that the source data either is a RAW file on disk or has been loaded into a

staging table. In either case, your source data can then proceed to the next stage of the ETL process.

As a schematic overview, Table 10-8 is my take on the various methods and their advantages and

disadvantages.

Table 10-8.  Advantages and Disadvantages of the Techniques Described in This Chapter



Technique



Advantages



Disadvantages



T-SQL Profiling



Relatively simple.



Requires multiple specific code snippets.



SSIS Profile Task



Easy to use.



Limited.

Requires XML viewer or cumbersome

output workaround.



Custom SSIS Profiling



Easy to integrate into an existing

SSIS package.



Complex to set up.



T-SQL Profiling Script



Copy and paste.



Can provide too much information.



SSIS Script Task Profiling



Highly adaptable and extensible.



Complex.



SSIS Pattern Profiling



Easy to set up.



T-SQL Pattern Profiling



Easy to set up.



618

www.it-ebooks.info



Very slow.



Chapter 11



Delta Data Management

Fortunately, much of data integration is a relatively simple case of transferring all the data from a source

(whatever it may be) to a destination (hopefully SQL Server). The destination is cleared out before the load

process, and the objective is to get the data from A to B accurately, completely, and in the shortest possible time.

Yet this basic scenario will inevitably not match everybody’s requirements all of the time. When developing ETL

solutions you may frequently be called upon to ensure that only changes in data are applied from the source to

the destination. Detecting data changes, and only (re)loading a subset of changed data will be the subject of this

chapter. In an attempt to give a simple title to a subject that can prove quite complex, I propose calling this delta

data management.



Preamble: Why Bother with Delta Data?

There are several reasons that spending the time to set up delta data handling can pay big dividends:

Time saved: If the source data contains only a small percentage of inserts, updates, or

deletes, then applying only these changes instead of reloading a huge amount of data

can prove considerably faster.

Reduced network load: If you are able to isolate modified data at the source, then the

quantities of data sent from the source to the destination can be reduced considerably.

Less stress on the server(s): This means reduced processor utilization and less disk

activity.

Less blocking: While blocking is hopefully not a major problem in a reporting or

analysis environment (especially during overnight processes), it can be an issue in

certain circumstances, and so is best if it is kept to a minimum.

In any case, you might find it easier to consider the arguments—both those for managing delta data and

those against it; they are provided in Table 11-1.



619

www.it-ebooks.info



Chapter 11 ■ Delta Data Management



Table 11-1.  A Brief Overview of the Advantages and Disadvantages of Loading Delta Data



Full loads



Advantages



Disadvantages



Simplicity.



Can be much slower.

Resource consumption.

Blocking.



Delta data



Can be much faster.



Complexity.



Less network load.



Time to implement.



Less server load.



Harder to debug and maintain.



It follows that when faced with an ETL project, you must always ask yourself the question “is it worth the

effort required to implement delta data loading techniques?” As is so often the case when faced with these

sorts of question, it is largely impossible to say immediately where the thresholds for efficiency gains lie. So be

prepared to apply some basic testing to get your answer, unless there is another compelling reason to choose the

“truncate and load” technique over the development of a—possibly quite complex—delta data load process.



Delta Data Approaches

At the risk of oversimplification, let’s say that there are two main ways of detecting data changes:

At source: A method to flag changed records, including indication of the change type

(insert, delete, or update).

During the ETL load: A method to compare source and destination records—and

isolating those that differ.

This simple overview quickly requires a little more in-depth analysis.



Detecting Changes at Source

Flagging the data change at source probably comes closest to an ideal solution from the ETL developer’s point of

view. This is for the following reasons:





Only changed records need to be moved between systems.







No full comparison between two data sets is required because the source system

maintains its own history of changes.







Delta information may even be held in separate tables, avoiding expensive scans of large

source tables to isolate delta subsets.







The work is most often done by the source system DBA.



Detecting Changes During the ETL Load

Unfortunately, many source systems are destined to remain closed-off “black boxes” to the ETL developer. This

is often because the source system DBAs will not countenance anything perceived as adding system overhead

to their databases. Such recalcitrance on the part of DBAs can even extend to low-overhead solutions such as

change tracking and change data capture, which are the subject of Chapter 12. Another possible reason for being



620

www.it-ebooks.info



Chapter 11 ■ Delta Data ManageMent



unable to tweak a source system is that it is a third-party development where any modifications are practically

or legally impossible. So, if you are not allowed to touch the source system in any way, you have to compare data

sets and deduce changes during the load process.

However, performing data comparisons during (or, as we shall see in some cases, before) a load process can

nonetheless allow for faster data loads with less network and server stress. So what we need to consider here is

how to compare data.

Put simply, there are two main record comparison approaches:





Compare columns between source and destination data sets for each important field.

This does not necessarily need to be all columns, and can be a subset that reflects the

requirements of how the destination data will be used.







Use an indicator field that is stored in both source and destination data sets. This can be

the following:





A date added and/or date updated field.







A hash field (a checksum).







A ROWVERSION field (formerly called a Timestamp field, even if it has nothing to do

with the time or the date). Indeed any other counter field that increments if the data

changes will do.



In either case, you will look for three kinds of differences:





Data present in the source and absent from the destination (inserts).







Data present in the destination and missing in the source (deletes).







Data present in both the source and the destination, but where the comparison indicates

a change (updates).



So, let’s start looking at ways of applying this knowledge to data loads in practice. I will begin with the more

usual case of data comparison during load, and then look at ways of applying delta flags in source systems.

Remember that the recipes in this chapter exist to solve different ETL challenges and circumstances. Each has its

advantages and drawbacks. In the real world, you may well find yourself mixing and matching techniques from

different recipes in order to solve an ETL challenge.

As is customary in this book, you will have to download the sample data from the book’s companion website

if you wish to follow the examples. This means creating the sample databases CarSales and CarSales_Staging,

which are used in virtually all the examples in this chapter. I will use the CarSales database as the source of data,

and the CarSales_Staging database as the destination database in the examples in this chapter.

The sample data you can download will always have a key column and if you alter the examples for your own

data then you should also have a key column. As you may already have discovered in your career with SQL Server,

loading delta data without key columns varies from extremely difficult to impossible. Remember also that an

SQL Server ROWVERSION is also a unique number, which is incremented each time a field is modified, and has

nothing to do with date and time.

Before leaping into the recipes in this chapter, you need to be forewarned that some of them are longer than

those seen elsewhere in this book. Indeed, at first sight they may seem complex to implement. The best advice

that I can give you is to read them through thoroughly a couple of times before applying them to your own ETL

challenges. I particularly recommend that you take a good look at the Control Flow and Data Flow figures, where

given, to get a clearer understanding of the process in each case. Also when you are creating your own ETL

solutions based on these ideas, do not hesitate to look ahead to what is to come, and to skip to the “How It Works”

section to ensure that you have understood the whys and wherefores of a process.

Finally, as delta data management has to handle a set of challenges for which there are only a finite set of

solutions, some of the recipes in this chapter do have similar phases. Rather than reiterate the same information

over and over again, I do occasionally require you to refer to elements in other recipes in the chapter in order to

complete a specific step in a process.



621

www.it-ebooks.info



Chapter 11 ■ Delta Data Management



11-1. Loading Delta Data as Part of a Structured ETL Process

Problem

You want to load only new data, update any changed records, and remove any deleted records during regular

data loads.



Solution

Use SSIS and detect the way in which the source data has changed. Then process the destination data accordingly

to apply inserts, updates, and deletes to the destination table. I’ll explain how this is done.

1.



Create three tables in the destination database (CarSales_Staging) using the following

DDL (dropping them first, if you have created them for use with another recipe, of

course) (C:\SQL2012DIRecipes\CH11\DeltaInvoiceTables.Sql):

CREATE TABLE dbo.Invoice_Lines

(

ID INT NOT NULL,

InvoiceID INT NULL,

StockID INT NULL,

SalePrice NUMERIC(18, 2) NULL,

VersionStamp VARBINARY(8) NULL -- this field is to hold ROWVERSION data

);

-- The destination table with no IDENTITY column or referential integrity

GO

CREATE TABLE dbo.Invoice_Lines_Updates

(

ID INT NOT NULL,

InvoiceID INT NULL,

StockID INT NULL,

SalePrice NUMERIC(18, 2) NULL,

VersionStamp VARBINARY(8) NULL -- this field is to hold ROWVERSION data

) ; --The "scratch" table for updated records

GO

CREATE TABLE dbo.Invoice_Lines_Deletes

(

ID INT NULL

) ; -- The "scratch" table for you to deduce deleted records

GO



2.



Create an SSIS package (I will name it SSISDeltaLoad.dtsx) and add two OLEDB

connection managers: one to the source database (name it CarSales_OLEDB) and

one to the destination database (name it CarSales_Staging_OLEDB).



3.



Add a Data Flow task to the Control Flow pane. Name it Inserts and Updates.

Double-click this to enter the Data Flow pane.



622

www.it-ebooks.info



Chapter 11 ■ Delta Data Management



4.



Add an OLEDB source adapter. Configure it to use the CarSales_OLEDB connection

manager and to select all required rows from the Invoice_Lines source table using the

following SQL snippet:

SELECT

FROM



ID, InvoiceID, StockID, SalePrice

,VersionStamp AS VersionStamp_Source

dbo.Invoice_Lines WITH (NOLOCK)



5.



Add a Multicast transform to the Data Flow pane. Name it Split Data. Connect the

Inserts and Updates data source adapter to it.



6.



Add a Lookup transform to which you connect the Multicast transform. Name it Lookup

RowVersions. On the General pane of the Lookup transform, configure it as follows:

Cache Mode:



No Cache



Connection Type:



OLEDB Connection Manager



Specify how to handle rows with

NoMatch entries:



Send rows with no matching entries

to the No Match output



7.



Click Connection on the left. Set the connection manager to CarSales_Staging_OLEDB

because as you will be comparing the source data with the destination data that you

are looking up with this Lookup transform.



8.



Set the Lookup to “Use results of an SQL Query” and enter the following SQL:

SELECT

FROM



ID, VersionStamp

dbo.Invoice_Lines WITH (NOLOCK)



9.



Click Columns on the left. The two tables, source (on the left), and destination

(on the right) will appear. Drag the ID column from the Available Lookup Columns (or

Destination) table on the right to the ID column of the Available Input Columns (Source)

table on the left. This maps the unique IDs of the two data sources to each other.



10.



Select the VersionStamp column of the Available Lookup Columns (or destination)

table on the right and provide an output alias—I suggest VersionStamp_Destination.

This allows you to compare the VersionStamps for source and destination for each

record. The dialog box should look like Figure 11-1.



623

www.it-ebooks.info



Tài liệu bạn tìm kiếm đã sẵn sàng tải về

10-17. Controlling Data Flow via Profile Metadata

Tải bản đầy đủ ngay(0 tr)

×