Tải bản đầy đủ - 0 (trang)
10-10. Storing SSIS Profile Data in a Database

10-10. Storing SSIS Profile Data in a Database

Tải bản đầy đủ - 0trang

Chapter 10 ■ Data profiling



CREATE TABLE dbo.DataProfiling_ColumnLength

(

ProfileRequestID NVARCHAR(255) NULL,

ColumnLengthDistributionProfile_ID BIGINT NULL,

MinLength TINYINT NULL,

MaxLength TINYINT NULL,

ID int IDENTITY(1,1) NOT NULL,

DateAdded DATETIME NOT NULL DEFAULT GETDATE()

) ;

GO

CREATE TABLE dbo.DataProfiling_ValueDistributionItem

(

Value NVARCHAR(255) NULL,

Count INT NULL,

ValueDistribution_Id BIGINT NOT NULL

) ;

GO

CREATE TABLE dbo.DataProfiling_LengthDistributionItem

(

Length tinyint NOT NULL,

Count bigINT NULL,

LengthDistribution_Id BIGINT NOT NULL

) ;

GO

CREATE TABLE dbo.DataProfiling_Join_ValueDistribution

(

ValueDistribution_Id BIGINT NOT NULL,

ColumnValueDistributionProfile_Id BIGINT NOT NULL

) ;

GO

CREATE TABLE dbo.DataProfiling_Join_LengthDistribution

(

LengthDistribution_Id BIGINT NOT NULL,

ColumnLengthDistributionProfile_Id BIGINT NOT NULL

) ;

GO

2.



Create a new SSIS package. Add two ADO.NET connection managers, the one named

CarSales_ADONET should point to the CarSales database; the one named

CarSales_Staging_ADONET should point to the CarSales_Staging database.



3.



Add an Execute SQL task. Configure it to use the connection manager for the profile

data. Name it Prepare Tables and set the SQL Statement to (C:\SQL2012DIRecipes\

CH10\TruncateSSISProfileTables.Sql):

TRUNCATE TABLE dbo.DataProfiling_ColumnLength;

TRUNCATE TABLE dbo.DataProfiling_ColumnNulls;

TRUNCATE TABLE dbo.DataProfiling_ColumnStatistics;



581

www.it-ebooks.info



Chapter 10 ■ Data Profiling



TRUNCATE

TRUNCATE

TRUNCATE

TRUNCATE

TRUNCATE



TABLE

TABLE

TABLE

TABLE

TABLE



dbo.DataProfiling_ColumnValueDistribution;

dbo.DataProfiling_Join_LengthDistribution;

dbo.DataProfiling_Join_ValueDistribution;

dbo.DataProfiling_LengthDistributionItem;

dbo.DataProfiling_ValueDistributionItem;



4.



Add a Data Profiling task and double-click to edit. Define a new File Destination

named C:\SQL2012DIRecipes\CH10\MyProfile.Xml—as described in Recipe 10-6,

step 3.



5.



Create four profile requests, as follows, all using the CarSales_ADONETconnection

manager:



Profile Type



Name



TableOrView



Column



Column Null Ratio

Profile



ColumnNull_



Client



Town



Client



ClientName



Options



Client_Town



Column Statistics

Profile



ColumnLength_



Column Length

Distribution Profile



ColumnStatistics_

InvoiceLines_



Client_Name



IgnoreLeadingSpaces = True

IgnoreTrailingSpaces = True



InvoiceLines



SalePrice



Client



Country



SalePrice

Column Value

Distribution Profile



ColumnValue_

Client_Country



ValueDistribution

Option = AllValues



6.



Run the package to create an initial output XML file (By clicking Debug ➤ Start

Debugging, for instance).



7.



Add a Data Flow task, join the Profile task to it, and double-click to open the Data

Flow pane.



8.



Add an XML source and configure as follows:



9.

10.



Data Access Mode:



XML File Location



XML Location:



The C:\SQL2012DIRecipes\CH10\MyProfile.Xml file created in step 3



XSD Location:



Generate an XSD from the MyProfile.Xml, named

C:\SQL2012DIRecipes\CH10\MyProfile.Xsd.



Confirm your modifications with OK.

Create eight OLEDB destination tasks, all using the connection manager, and

configure them to use the following outputs from the XML source to the destination

tables. (You should map all the columns that are available in the destination tables

that correspond to source columns, as shown.)



582

www.it-ebooks.info



Chapter 10 ■ Data Profiling



XML Data Source Output



Destination Table



ColumnStatisticsProfile



dbo.DataProfiling_ColumnStatistics



ColumnNullRatioProfile



dbo.DataProfiling_ColumnNulls



ColumnLengthDistributionProfile



dbo.DataProfiling_ColumnLength



LengthDistribution



dbo.DataProfiling_Join_LengthDistribution



LengthDistributionItem



dbo.DataProfiling_LengthDistributionItem



ColumnValueDistributionProfile



dbo.DataProfiling_ColumnValueDistribution



ValueDistribution



dbo.DataProfiling_Join_ValueDistribution



ValueDistributionItem



dbo.DataProfiling_ValueDistributionItem



11. The Data Flow pane should look something like Figure 10-5.



Figure 10-5.  Data Flow for a custom profile output

You can now run the package. Once it has run the profile data, it will be available in the destination tables

for analysis.



How It Works

Profiling data and then analyzing with the Data Profile Viewer is great when you are first getting to know your

data, but it can prove limiting when you want to store this information for future use—either once the data

loading process has finished, or as part of a business process to test the profile results before allowing a load to

continue. So you are likely to want to store the profile data in order to compare it with acceptable thresholds and

use some logic to control the data flow. This requires a two-step approach:

First, define the Profiling task, and output the results to a variable.

Second, load a small subset of the XML from the variable into SQL Server tables to

capture the essential data.



583

www.it-ebooks.info



Chapter 10 ■ Data Profiling



This technique does imply the following prerequisites:





A Data Source that you must configure to connect to a suitable data source.







A Data Flow task that is configured to use the data source that you have defined.



As you can see, some profile types, such as Column Null Ratio profile and Column Statistics profile can be

shredded directly into a single output table. Other profile types such as Column Length Distribution Profile and

Column Value Distribution Profile require only a single table if all you want is the count of the number of distinct

elements, but will require two other tables (a table for all the individual items and a many-to-many table to join

the items to the aggregate table) if you want the details of every value and the number of times it appears.

As it would be far too detailed to show how to obtain every possible piece of profile information, this recipe

is an overview of four of the four main profile types that I have found useful when applying business rule analysis

to an SSIS data load. I will not explain every last detail of how to use an XML data source, as many ways of doing

this were covered in Chapter 3. However, you should be able to take this recipe as an example on which you can

build to create your own profile output tables. The relationship between the profile type that you are analyzing

and the output table(s) is given in Table 10-7.

Table 10-7.  Table Used for Profile Types



Profile Type



Table(s) Used



Column Null Ratio Profile



Dbo.DataProfiling_ColumnNulls



Column Statistics Profile



Dbo.DataProfiling_ColumnStatistics



Column Length Distribution Profile



Dbo.DataProfiling_ColumnLengthDistribution <=

Dbo.DataProfiling_Join_LengthDistribution =>

Dbo.DataProfiling_LengthDistributionItem



Column Value Distribution Profile



Dbo.DataProfiling_ColumnValueDistribution <=

Dbo.DataProfiling_Join_ValueDistribution =>

Dbo.DataProfiling_ValueDistributionItem



Hints, Tips, and Traps





There are many, many outputs from the XML file produced by the Data Profiling task.

Should you need a better understanding of the file, then I suggest that you open the

MyProfile.Xsd file (created in step 8) in Visual Studio, or even read the XML to see how

the data is structured.







If you do not wish to use a file for the XML data, then once the package is tested and

debugged, you can define a string variable (at package level) and use this both as

the destination for the Profiling task and as the source data for the XML data source.

However, any further work on the package will necessitate resetting these as file-based

data while the package is modified and debugged. Then the string variable can be

reapplied.







Of course, you can extend this package to handle to other profile types, such as

Functional Dependency or Value Inclusion.



584

www.it-ebooks.info



Chapter 10 ■ Data Profiling



10-11. Tailoring Specific Source Data Profiles in SSIS

Problem

You want to profile your data in a way that is tailored to your source data and profiling requirements.



Solution

Use SSIS to create a custom profiling package using standard SSIS tasks and go beyond the standard options

available in the SSIS Data Profiling task.

1.



Create an SSIS package. Name it Profiling.Dtsx.



2.



Create the CarSales_Staging.dbo.DataProfiling table whose DDL was given in Recipe

10-5 to store profile data (unless you have already created it, of course).



3.



Create an ADO.NET connection for this task that connects to the destination

(CarSales_Staging) database. I am naming it CarSales_Staging_ADONET.



4.



Add a Flat File connection manager and connect to the data source file,

C:\SQL2012DIRecipes\CH10\Stock.Txt in this example. Name it StockFile. Make

sure that you set the data type for the Mileage column to four-byte signed integer

[DT_I4] in the Advanced pane.



5.



Add the following variables in your package:



Variable Name



Type



Value



Mileage_MAX



Int32



0



Mileage_MIN



Int32



0



RowCount



Int32



0



6.



Having clicked on the Data Flow pane, Click the “Click here . . . ” prompt to add a Data

Flow task.



7.



Add a Flat File source, and configure it to use the StockFile connection manager.



8.



Add a Row Count task from the toolbox onto the Data Flow pane and connect the Flat

file source to it.



9.



Double-click to add the RowCount variable from the pop-up list of available variables

(Figure 10-6). Confirm with OK.



Figure 10-6.  Adding a variable to an SSIS data flow



585

www.it-ebooks.info



Chapter 10 ■ Data Profiling



10.



Add an Aggregate task on to the Data Flow pane. Name it Attribute Analysis and

connect the Row Count task to it.



11.



Double-click the Aggregate task to edit it.



12.



Select the column(s) that you wish to analyze in the upper part of the Aggregations

tab—or drag the column down to the lower part of the pane. I will use the Mileage

column in this example, once for the maximum and once for the minimum.



13.



Select the type of analysis that you wish to apply from the pop-up in the Operation

column. In this example I am using Maximum first and Minimum second.



14.



Rename the output alias appropriately. I suggest Mileage_MAX and Mileage_MIN,

respectively. The Aggregate Transformation dialog box should look something like

Figure 10-7.



Figure 10-7.  The Aggregate Transformation dialog box in SSIS

15.



Click OK to confirm your modifications. Return to the Data Flow tab.



16.



Add an Unpivot transform task onto the Data Flow pane. Connect the Aggregate task

to it. Double-click to edit the Unpivot task.



586

www.it-ebooks.info



Tài liệu bạn tìm kiếm đã sẵn sàng tải về

10-10. Storing SSIS Profile Data in a Database

Tải bản đầy đủ ngay(0 tr)

×