Tải bản đầy đủ - 0 (trang)
9-30. Cleansing Data As Part of an ETL Process

9-30. Cleansing Data As Part of an ETL Process

Tải bản đầy đủ - 0trang

Chapter 9 ■ Data Transformation



4.



Add an OLEDB source and configure as follows:

Name:



Car Sales



Connection Manager:



CarSales_Staging_OLEDB



Data Access Mode:



Table or view



Name of Table or View:



CarColoursForDQSInSSIS



5.



Add a DQS Cleansing task, name it DQS Cleansing, and connect the data source that

you just created to it. Double-click to edit.



6.



Click New to create a Data Quality connection manager. Select the DQS server name

from the pop-up list of available DQS servers, and then click OK.



7.



Select the DQS Knowledge Base containing the domain that you wish to use. The

dialog box should look like Figure 9-28.



Figure 9-28.  Configuring the DQS connection manager



554

www.it-ebooks.info



Chapter 9 ■ Data Transformation



8.



Click the Mapping tab to switch to the Mapping pane. Select the Color check box

in the upper section of the dialog box. This indicates that source column is to be

cleansed.



9.



In the lower section of the dialog box, select Colors as the DQS cleansing domain

to be used for the Color column in the source data. The dialog box should look

something like Figure 9-29.



Figure 9-29.  Configuring the DQS Cleansing domain

10.



Confirm your modifications to the DQS Cleansing task.



555

www.it-ebooks.info



Chapter 9 ■ Data Transformation



11. Add a Conditional Split task to the Data Flow pane and connect the DQS Cleansing

task to it. Double-click to edit. Add the following three outputs:



Output Name



Condition



Comments



New



[Record Status]==”New”



Creates an output for records where DQS cannot

either validate or correct the data being cleansed.



Correct



[Record Status]==”Correct” ||

[Record Status]==”Corrected”



Creates an output for records where DQS accepts

the data as valid.



12. Click OK to confirm your changes.

13. Add an OLEDB destination to the Data Flow pane. Connect this destination to the

Correct output from the Conditional Split task, and then configure it as follows.

Afterward, map the source to the destination columns. You will not need to map the

DQS status column(s).

Name:



Correct Records



OLEDB Connection Manager:



CarSales_OLEDB



Data Access Mode:



Table or view – fast load



Name of Table or View:



Dbo.stock



14. Add an OLEDB destination to the Data Flow pane. Connect this destination to

the new output from the Conditional Split task, and then configure it as follows.

Afterward, map the source to the destination columns. You will not need to map the

DQS status column(s).

Name:



New Records



OLEDB Connection Manager:



CarSales_Staging_OLEDB



Data Access Mode:



Table or view – fast load



Name of Table or View:



Dbo.Stock_FailedCleansing. (Use SSIS to create a “New” table)



The final package should look like that in Figure 9-30.



556

www.it-ebooks.info



Chapter 9 ■ Data Transformation



Figure 9-30.  DQS cleansing package overview

You can now run the package. Assuming that all goes well, it will load validated data directly into the

destination table and send any data that requires further intervention to the staging table.



How It Works

As I mentioned in the introduction to this chapter, there is one aspect of data cleansing that I will touch on, and

that is SQL Server Data Quality Services. Unfortunately, I cannot give a complete introduction to SQL Server Data

Quality Services, as that would require a chapter to itself. However, I can explain how to use it to cleanse data as it

flows through an SSIS process using the SSIS DQS Cleansing task.

Should you not yet be familiar with Data Quality Services, then all you need to know for the purposes of this

recipe is that it allows you to compare data being loaded into an SQL Server table with a set of reference data

(contained in a Knowledge Base). Each Knowledge Base can contain many Domains (think of them as a kind of

advanced and cleverly designed lookup table). You can then use a domain to validate and even correct the source

data as it flows into the destination table. When DQS analyzes source data and compares it with a domain, it will

flag data as New—that is unknown in the Knowledge Base, Correct—data in the source is exactly as it is in the

Knowledge Base or Corrected—the Knowledge Base contains a mapping that allows DQS to replace source data

with what should be used.

It is worth noting that data cleansing using the DQS Cleansing task can get much more complex than the

simple example in this recipe suggests. Each column that is cleansed using a domain in a Knowledge Base will

add a _Source and an _Output column, as well as a _Status column. This allows you to add fine-grained logic to

your data flows. The corollary of this subtlety is that it can imply extremely intricate decision logic in

the data flow.

At least one DQS Knowledge Base must be created and populated with a functioning domain. In this

example, I have created a Colours domain to cleanse the color references of new cars being added to the Stock

table in the sample database.

Source data to be validated—in this example I am using a source table containing Stock in the

CarSales_Staging database. This table is called CarColoursForDQSInSSIS.

A second destination table,Stock_FailedCleansing, is the staging database. This table holds records that

failed cleansing for manual correction, and can probably be used to update the DQS Knowledge Base.



557

www.it-ebooks.info



Chapter 9 ■ Data Transformation



Hints, Tips, and Traps





The data source can be any source that SSIS can import. It does not have to be a

database table.







This example only shows one domain being used to cleanse one source column. You can

use the DQS cleansing task to cleanse multiple source columns at once if you so choose.







This example assumes that “correct” and “corrected” data are identical. You may prefer

to separate them into two data paths (to add counters or to output corrected data to a

separate staging table for analysis purposes using a Multicast task, for instance). The two

paths can then be sorted and merged into a single destination.







The decision as to how to handle new domain data is a potentially a very big question. Do

you keep it out of the destination table and reprocess it manually once the ETL process

is finished? Or do you allow the data to be loaded into the destination table, flag any

anomalies, and update this data in place (not shown here)? The decision will depend on

the subtleties of each particular process.



Summary

This chapter has taken you on a (fairly whirlwind) tour of some of the many available data transformation

techniques that you could be called upon to apply in your career as an ETL developer. Hopefully, you have

seen that most of the “classic” problems facing the ETL developer (Data Type transformation, pivoting and

normalizing data, subsetting columns, and concatenating columns to name but a few) can be resolved either as

part of an SSIS pipeline or once data has been imported into staging tables in SQL Server.

I have given a certain weight in this chapter to slowly changing dimensions, as they seem to be becoming

more and more a part of the ETL universe. This is possibly due to the increasing importance of business

intelligence (BI) in the enterprise. In any case, handling data loads where the destination data can change over

time is now a fundamental part of many ETL processes, and so I wanted to ensure that the core techniques for

handling such datasets were explained.

I am fully aware that there many challenges for which I have not been able to describe solutions, given space

constraints. I am also aware that for each of the techniques described in this chapter, there are many alternative

solutions and variations on each theme. Nonetheless, I hope that the recipes provided in this chapter will help

you resolve some of the more “classic” ETL problems that you are likely to encounter, and that you can take

this information as a starting point that you can use to build your own solid and robust data transformation

processes, with both SSIS and T-SQL.



558

www.it-ebooks.info



Chapter 10



Data Profiling

Every person whose work involves data ingestion and consolidation wants to know exactly what constitutes the

source data that they are using. While this does not mean knowing every morsel of data in every table, it can and

should mean having a high-level view of what is (and equally important—what is not) in a column of data. This

knowledge can often be a valuable first step in deciding on the validity of a data source, and even in choosing

whether or not to proceed with an ETL process. Indeed, since introduction of the Data Profiling task in SSIS

2008, the importance of data profiling seems to have been recognized by Microsoft. Self-evidently, then, it seems

worth taking an in-depth look at the art and science of data profiling with SQL Server. Consequently, the aim of

this chapter is to help you understand what data profiling is and what it can do to help you when working with

databases. Indeed, I suspect that many, if not most, SQL Server developers and DBAs have been using some

kind of data profiling techniques as part of their job already, even if they were not actually using the term “data

profiling” to describe what they were doing.

Data profiling with SQL Server is not limited to the SSIS Data Profiling task. In this chapter, I will show

many different approaches to data profiling, in which you “roll your own” profiling techniques using T-SQL and

SSIS—and even CLR (Common Language Runtime). This is to show you that data profiling is a varied art form,

and that it can be used in many different ways to help out with differing problems. As ever, we will start with the

simplest techniques and then progress to more complex ones.

As the terms used in this area can cause confusion, let’s start by defining what we mean by data profiling.

Data profiling is running a process to return metrics from a data set.

The first—and frequently the main type of data profiling that you can perform is attribute analysis. It

consists of looking at the data in an individual column and abstracting out a high-level view of the following:





Null counts and null ratios.







Domain analysis (the counts and ratios of each element in a column).







Field length maximum and minimum and frequently the length distribution (dominant

length and percentage dominant length).







Numerical maximum and minimum.







Value distribution (domain analysis, median, unique/distinct values and outliers).







Pattern profiles (the format of texts and numbers) and the percentage of pattern compliance.







Data type profiling.



The second type of data profiling that we look at briefly in this chapter in the context of the SSIS Data

Profiling task is relational analysis— how columns and records relate to oneanother (if at all). This includes:





Orphaned records—and the number and the percentage of orphans







Childless records—and the number and the percentage of childless records







Key (join) profile—cardinality (how many map to a join)



559

www.it-ebooks.info



Chapter 10 ■ Data Profiling



Some data profiling can be applied to any data type; some are type-specific. By this, I mean that you may

want to look for NULLs in text or numeric fields, whereas you will only look for numerical distribution in numeric

fields, and field lengths in text fields.

Clearly the first question to ask is why profile data? In reply, I would suggest that data profiling could—and

frequently should—be used in two cases:





To analyze source data before writing a complex ETL process.







As part of an ETL process.



To understand the need for profiling better, consider the all-too-frequent ETL scenario that faces a DBA or

developer. A file, or a link to some data arrives in your e-mail and you have to “integrate it.” There may be some

(limited) documentation, and both the documentation and the data might—or just might not—be trustworthy.

Profiling the data, both to perform attribute and relational analysis, can save you a lot of time, in that it enables

you to

Discover anomalies and contradictions in the data.

Ask more accurate and searching questions of the people who deliver the data.

Explain to the people in your company or department (who are convinced that they

have just given you a magic wand to solve all their data needs) how the data can,

realistically, be used, and how it can (or not) integrate into your destination databases

and data warehouses.

Data profiling as part of an ETL process can be critical for any of the following reasons:





To maintain a trace of data quality metrics.







To alert you to breaches of defined thresholds in data quality.







To allow you to develop more subtle analysis of data quality, where graduations of quality

(as opposed to simply pass or fail) can evolve. These analyses can then be sent as alerts

during or after a load routine.



Once you have these analyses, then the next question should be “how and why is this important?” Well, the

reply will depend on your specific data set. If the column in question contains data that is fundamental to later

processing (a customer ID, say), then you might wish to set a threshold for NULLs (either as an absolute value or as

a percentage) above which processing is halted. Alternatively, if a NULL in this column is going to cause a process

to fail further into the ETL routine, then a single NULL might be enough to invalidate the data flow. The point is

that getting the figures now may take a few seconds. Discovering the error after five or six hours of processing is a

real waste of time.

If you agree with the preceding premise, you may still be wondering: why should you want to go to the

trouble of developing your own profiling routines when a perfectly good SSIS task exists? Well, there are several

possible reasons why you might want to prefer a custom approach:





The Data Profiling task will only work with SQL Server sources (2000 and up), whereas

custom profiling can be made to work with many different data sources.







You may want to perform profiling without necessarily using SSIS.







The Data Profiling task is only available from SQL Server 2008 onward. You may still be

using SQL Server 2005 in some cases.







To save time, you could want to perform data profiling on an existing data flow during

data ingestion, rather than run a task that runs independently of a data flow.







You want to go beyond the “out of the box” profiling request types that are available

in SSIS.



560

www.it-ebooks.info



Chapter 10 ■ Data profiling







You want to add further statistical analysis to profiling results.







The XML output can require some clunky workarounds to read as part of an ETL process.







You want to test data for sufficient probable accuracy before running a time- and

resource-consuming ETL process. This is not easy when using the Profiling task.







You want records that your profiling captures as statistically questionable to be removed

from a data flow, and/or output to a separate data destination for analysis.







You want to perform targeted subsetting—that is, you have experienced the types of data

that tend to cause problems, and you want to profile data that has certain characteristics

in order to reduce profiling time.



The techniques described in this chapter will not attempt to automate data cleansing because it is a subject

I will not look at in any detail. However, I hope to show that effective use of data profiling techniques can be an

extremely useful first step (some might say an essential step) in data cleansing.

The techniques outlined in this chapter are all specific, rather than generic. By this I mean that all the pieces

of each profiling or analysis techniques described in this chapter will have to be assembled into a custom process

that is adapted to a specific data source. These methods do not attempt to fit any data source or to auto-adapt to

data sources.

To test the examples in this chapter, you have to download the sample files from the book’s companion web

site and install them in the C:\SQL2012DIRecipes\CH10 folder. You will also need the two sample databases,

CarSales and CarSales_Staging, which are also on the web site.



10-1. Profiling Data Attributes

Problem

You want to obtain attribute information (counts, NULL records field lengths, and other basic details for a field or

fields) from an SQL Server data source.



Solution

Use T-SQL functions to profile and return the attributes of the source data. Here is one example:

SELECT COUNT(*) FROM CarSales.dbo.Stock WHERE Marque IS NULL;



How It Works

We began by looking at data profiling using T-SQL. This snippet returns the number of records that have NULLs for

a specific column.

To explain this concept further, I will assume that you are analyzing data that has already been loaded

into SQL Server. In my experience, this is a frequent scenario when you are first loading data into a staging

database from which it will eventually be transferred into an ODS (operational data store) or Data Warehouse.

In Recipe 10-3, you see how to use them with data that is not yet in SQL Server. Fortunately, this approach only

requires you to apply a series of built-in functions that you probably already know. They are shown in Table 10-1.



561

www.it-ebooks.info



Chapter 10 ■ Data Profiling



Table 10-1.  Attribute Profiling Functions in T-SQL



Function



Use in profiling



NULL



Detect NULL values.



MAX



Get the maximum numeric value—or, combined with

LEN, the maximum string length.



MIN



Get the minimum numeric value—or, combined with

LEN, the minimum string length.



COUNT



Count the number of records.



LEN



Get the length of a character field.



Along with judicious use of the GROUP BY and DISTINCT keywords, these functions will probably cover most

of your basic data attribute profiling requirements. Indeed, given these functions’ simplicity and ease of use, it is

probably faster to look at all of them at once—and see the core attribute profiling types at the same time—rather

than laboriously explain each one individually. I am aware that all this may seem way too simple for many

readers. Nevertheless, in the interest of completeness, Table 10-2 gives the code snippets that you might find

yourself using. All the examples presume that you are using the CarSales.dbo.Stock table.

Table 10-2.  Attribute Profiling Examples in T-SQL



Profile task



Code



Comments



NULL Profiling



SELECT COUNT(*)



How many records have NULLs

in a specific column?



FROM dbo.Stock

WHERE Marque IS NULL;

NULL Percentage



SELECT

(SELECT CAST(COUNT(*)AS

NUMERIC (18,3))



The percentage of the total

this represents.



FROM dbo.Stock

WHERE Marque IS NULL)

/ (SELECT COUNT(*)

FROM dbo.Stock);

Maximum



SELECT MAX(ID) FROM dbo.Stock;



The maximum value in a

numeric column.



Minimum



SELECT MIN(ID) FROM dbo.Stock;



The minimum value in a

numeric column.



Maximum Count



SELECT COUNT(*)



The number of values that are

of the maximum value in a

numeric column.



FROM dbo.Stock

WHERE ID = (

SELECT MAX(ID)

FROM dbo.Stock);



(continued)



562

www.it-ebooks.info



Chapter 10 ■ Data Profiling



Table 10-2.  (continued)



Profile task



Code



Comments



Minimum Count



SELECT COUNT(*)



The number of values that are

of the minimum value in a

numeric column.



FROM dbo.Stock

WHERE ID = (SELECT MIN(ID)

FROM dbo.Stock);

Maximum Length



SELECT MAX(LEN(Marque))

FROM dbo.Stock;



Minimum Length



SELECT MIN(LEN(Marque))

FROM dbo.Stock;



Zero-length

String Count



SELECT COUNT(*)

FROM dbo.Stock2



The maximum length of a

character (string) column.

The minimum length of a

character (string) column.

The number of zero-length

strings.



WHERE LEN(Marque) = 0;

Count



SELECT COUNT(*)



The number of records.



FROM dbo.Stock;



Hints, Tips, and Traps





If you are looking for a fast, but not 100 percent accurate record count where a tiny

margin of error is acceptable, then you can return row counts using the following code

snippet. The accuracy of the result will depend on how recently the table metadata was

updated, but any large insert/delete will have caused the statistics to be recalculated:

SELECT

FROM

WHERE







OBJECT_NAME(object_id) AS TableName, row_count

sys.dm_db_partition_stats

object_id = OBJECT_ID('Stock');



As the basic data profiling shown in this recipe is standard T-SQL, it will work for other

databases on the same server as well as linked servers. All you need to do is use the

correct three- or four-part notation to point to the table that you are profiling.



10-2. Profiling Domain and Value Distribution

Problem

You want to obtain domain and value profile distribution data from a SQL Server source table.



Solution

Use T-SQL functions to return domain and value distribution information. Essentially, this means discovering the

number of records that contain each value and the percentage of each value in the data set.

The code snippets in Table 10-3 give you a snapshot of the domain distribution in a data table. Both of the

code examples profile the CarSales.dbo.Stock table.



563

www.it-ebooks.info



Chapter 10 ■ Data Profiling



Table 10-3.  Domain and Value Distribution in an SQL Server Data Source



Distribution



Code



Comments



Domain Distribution



SELECT TOP (100) PERCENT Marque,

COUNT(Marque) AS NumberOfMarques,



This code snippet combines

analysis of:



COUNT(Marque)/



The domain values;



(SELECT CAST(COUNT(*) AS NUMERIC(18,8))



The number of records that

contain each value;



FROM CarSales.dbo.Stock)

* 100 AS DomainPercent



Numeric Value Distribution



FROM



CarSales.dbo.Stock



GROUP BY



Marque



ORDER BY



Marque DESC;



The percentage of each value in

the data set.



SELECT



ID, COUNT(ID) AS

NumberOfValues



This code snippet combines

analysis of:



FROM



CarSales.dbo.Stock



The numeric values;



GROUP BY



ID



ORDER BY



NumberOfValues DESC;



The number of records that

containing each value.



How It Works

Value distribution simply means getting the metrics for





The number of records containing each of the values (text or numeric) in a specified field.







The percentages of records containing each of the values (text or numeric) in a

specified field.



Applying value distribution analysis to a set containing a vast number of different values might not be

particularly useful in itself. I am suggesting that this approach is best suited to the analysis of fields containing a

small variety of values—most often lookup or reference values—where the distribution of data is an indicator of

data reliability.



Hints, Tips, and Traps





The solution in this recipe can also be applied to non-SQL server data sources or linked

servers.







Profiling linked server sources is likely to be extremely slow, and while it can be useful for

an initial data analysis, it might not be practical as part of a regular ETL process.







You can see how to store the output from the profile requests in Recipe 10-5.



564

www.it-ebooks.info



Tài liệu bạn tìm kiếm đã sẵn sàng tải về

9-30. Cleansing Data As Part of an ETL Process

Tải bản đầy đủ ngay(0 tr)

×