Tải bản đầy đủ - 0 (trang)
10-6. Profiling Data Quickly in SSIS

10-6. Profiling Data Quickly in SSIS

Tải bản đầy đủ - 0trang

Chapter 10 ■ Data profiling



Figure 10-1. Configuring the Output Destination in an SSIS Data Profiling task

4.



Click Quick Profile and select the CarSales_ADONET connection that you created in

step 1 from the ADO.NETConnection pop-up list.



5.



Select the table to analyze—in this case it is dbo.Client.



6.



Check all the profile types that you wish to analyze. In this example, I am selecting all

possible profile types. This dialog box should look like Figure 10-2.



571

www.it-ebooks.info



Chapter 10 ■ Data Profiling



Figure 10-2.  Selecting the types of profile to run in an SSIS Data Profiling task

7.



Confirm your modifications with OK. You will see the selected profile requests, as

shown in Figure 10-3.



572

www.it-ebooks.info



Chapter 10 ■ Data Profiling



Figure 10-3.  Selected profile requests in the SSIS Data Profiling Task Editor

8.



Click OK in the Data Profiling Task Editor to confirm your profile selection.



You can now run the SSIS package containing the profile task, and, consequently, profile the source data.

This will generate the XML output file that you specified in step 3 (C:\SQL2012DIRecipes\CH10\SSISProfile.xml).

While it is possible to read this XML file directly in a web browser, you will probably find it easier to read it using

the Data Profile Viewer, as described in Recipe 10-9.



How It Works

Since SQL Server 2008, SSIS has provided a task to profile source data. This task is slightly set apart from other

SSIS tasks in that (at least in my experience) it is something that tends to be run “interactively” when initially

looking at data, rather than as part of a structured ETL process. This said, you could also use it as part of a more

comprehensive process if you need to. Once the task has run against a data source, you can then examine the

output (most easily with the Data Profile Viewer described in Recipe 10-9) and decide if the source data matches

your expectations. This recipe shows you how to obtain a profile using the Quick Profile option of the SSIS Profile

task. Once a quick profile has been created, a series of profile requests is generated. These can be tweaked—or

deleted—from the Profile Requests pane. Then the Profiling task can be run, and the output viewed in the Profile

Viewer, as described in Recipe 10-9.



573

www.it-ebooks.info



Chapter 10 ■ Data Profiling



While ensuring a good overall analysis of data sources, the task does have the following limitations:





Data sources are limited to tables and views from SQL Server 2000 and up over an

ADO.NET connection.







The account running the task must have read/write permissions, including CREATE TABLE

permissions, on the TempDB database.







The task can only produce an XML output file. Moreover, this file must be viewed using

either the supplied Data Profile Viewer, or by loading the XML data using a custom task,

as described in Recipe 10-10.



The breadth and depth of the analysis provided by this task are non-negligible, and cover the areas described in

Table 10-4. As you can see, the Data Profiling task—even when resorting to the Quick Profile option—can provide

you with a remarkably complete set of initial profile data.

Table 10-4.  SSIS Data Profiling Task Options



Element



Description



Potential Use



Candidate Key



Indicates whether a column or set of

columns is a key, or could be a key, for

the selected source table.



Determines the potential of a column (or

set of columns) to be a unique key.



Column Length

Distribution



Returns all the distinct lengths of string

values in the selected column and the

percentage of rows in the table for each

column length.



Determines the range of lengths of data in

a column.



Column Null Ratio



Returns the percentage of null values in

the selected column.



Determines the percentage of  NULLs in the

column.



Column Pattern



Returns a set of regular expressions that

cover the specified percentage of values

in a string column.



Determine the text patterns in a column.

This covers the way the text looks and is

formatted.



Column Statistics



Determines the ranges of values and dates

Indicates statistics such as minimum,

of data in a column.

maximum, average, and standard

deviation for numeric columns, and

minimum and maximum for DATETIME

columns.



Column Value

Distribution



Indicates all the distinct values in the

selected column and the percentage

of rows in the table that each value

represents.



Determines exactly how many distinct

values exist in a column in order to define

the distribution of values.



Functional Dependency Indicates the extent to that the values in Determines if value sets correspond

between columns.

one column (the dependent column)

depend on the values in another column

or set of columns (the determinant

column).

Value Inclusion



Indicates the overlap in the values

between two columns or sets of

columns.



574

www.it-ebooks.info



Determines the potential of a column

(or set of columns) to be a foreign key.



Chapter 10 ■ Data Profiling



Hints, Tips, and Traps





You can create as many quick profiles as you wish, each from a separate table or view.

They will all be added to the profile requests.



10-7. Creating Custom Data Profiles with SSIS

Problem

You want to use SSIS to profile data while adjusting the profile specifications to suit your precise requirements.



Solution

Rather than running a quick profile, use the Data Profiling task to configure the available options to return the

information that suits you. Once again I will use the CarSales.dbo.Stock table as the data to be profiled.

1.



Create a new SSIS package. Add an ADO.NET connection manager to the CarSales

source database. Name it CarSales_ADONET.



2.



Right-click in the Connection Managers tab and select New File Connection. Select

the Usage Type Create File and browse to the chosen folder

(C:\SQL2012DIRecipes\CH10\CustomProfile.Xml in this example). Click Open to

create the file and OK to close the New File Connection dialog box.



3.



Add a Data Profiling task to the Control Flow pane. Double-click to edit. In the

General tab, select File Connection as the destination type.



4.



Select the CustomProfile.xml connection that you created in step 3.



5.



Set Overwrite Destination to True.



6.



Click Profile Requests on the left to active the Requests pane.



7.



Scroll down to the bottom of the available profile requests (if any exist), click in a

blank record, and select a profile request from those available in the pop-up list. In

this example, I suggest starting with a Column Length Distribution Profile Request.



8.



Press Tab or Enter to confirm the request creation. SSIS will give this request a name

(LengthDistReq probably).



9.



In the lower part of the pane—the Request Properties section—you will now configure

the request, starting with the connection to the source database. For the property

named Connection Manager, select the CarSales_ADONET Connection manager that

you created in step 1.



10.



Select dbo.Stock as the TableOrView property.



11.



Select Make as the Column Property.



12.



Click OK.



You can now run the package and analyze the resulting XML profile data.



575

www.it-ebooks.info



Chapter 10 ■ Data Profiling



How It Works

If you prefer to create and fine-tune profiles directly, then you can do so by configuring the SSIS Data Profiling

task manually rather than use the Quick Profile approach. The profiling options are fairly wide-ranging—as seen

in Table 10-6. With a little practice, however, they become fairly easy to use.

Defining a profile will always include one of the elements from Table 10-5. This is essentially the core

definition of which database, table and columns that you are profiling.

Table 10-5.  Compulsory Elements in an SSIS Data Profiling Task



Element



Comments



Connection Manager



An existing or newly created ADO.NET connection manager.



Table or View



A table or view available through the selected connection manager.



Column



A single column (or all columns selected by using the asterisk “*”) from the table

that you wish to profile.



RequestID



A user-defined name for this profile element. Although SSIS will name each

profile request, you can overwrite these names with your own.



Once you have specified the database, table and column you have to indicate which type of profile you wish

to run. You can also specify certain parameters specific to the type of profile. Each of these configurations forms

a separate profile request. The various options that are specific to each type of profile request are explained in

Table 10-6.

Table 10-6.  SSIS Profile Task Options



Element



Description



Potential Use



Candidate Key



Indicates whether a column or set of

columns is a key, or an approximate key, for

the selected table.



Determine a column (or set of

columns) potential as an unique key.



Column Length

Distribution



Indicates all the distinct lengths of string

values in the selected column and the

percentage of rows in the table that each

length represents.



Determine the range of column

lengths.



Column Null Ratio



Indicates the percentage of null values in the Determine the percentage of NULLs.

selected column.



Column Pattern



Returns a set of regular expressions that

cover the specified percentage of values in a

string column.



Determine the text patterns in a

column.



Column Statistics



Indicates statistics such as minimum,

maximum, average, and standard deviation

for numeric columns, and minimum and

maximum for DATETIME columns.



Determine ranges of values and

dates.



Column Value Distribution Indicates all the distinct values in the

selected column and the percentage of rows

in the table that each value represents.



Determine how many distinct

values exist in a column.

(continued)



576

www.it-ebooks.info



Tài liệu bạn tìm kiếm đã sẵn sàng tải về

10-6. Profiling Data Quickly in SSIS

Tải bản đầy đủ ngay(0 tr)

×