Tải bản đầy đủ - 0 (trang)
10-2. Profiling Domain and Value Distribution

10-2. Profiling Domain and Value Distribution

Tải bản đầy đủ - 0trang

Chapter 10 ■ Data Profiling



Table 10-3.  Domain and Value Distribution in an SQL Server Data Source



Distribution



Code



Comments



Domain Distribution



SELECT TOP (100) PERCENT Marque,

COUNT(Marque) AS NumberOfMarques,



This code snippet combines

analysis of:



COUNT(Marque)/



The domain values;



(SELECT CAST(COUNT(*) AS NUMERIC(18,8))



The number of records that

contain each value;



FROM CarSales.dbo.Stock)

* 100 AS DomainPercent



Numeric Value Distribution



FROM



CarSales.dbo.Stock



GROUP BY



Marque



ORDER BY



Marque DESC;



The percentage of each value in

the data set.



SELECT



ID, COUNT(ID) AS

NumberOfValues



This code snippet combines

analysis of:



FROM



CarSales.dbo.Stock



The numeric values;



GROUP BY



ID



ORDER BY



NumberOfValues DESC;



The number of records that

containing each value.



How It Works

Value distribution simply means getting the metrics for





The number of records containing each of the values (text or numeric) in a specified field.







The percentages of records containing each of the values (text or numeric) in a

specified field.



Applying value distribution analysis to a set containing a vast number of different values might not be

particularly useful in itself. I am suggesting that this approach is best suited to the analysis of fields containing a

small variety of values—most often lookup or reference values—where the distribution of data is an indicator of

data reliability.



Hints, Tips, and Traps





The solution in this recipe can also be applied to non-SQL server data sources or linked

servers.







Profiling linked server sources is likely to be extremely slow, and while it can be useful for

an initial data analysis, it might not be practical as part of a regular ETL process.







You can see how to store the output from the profile requests in Recipe 10-5.



564

www.it-ebooks.info



Chapter 10 ■ Data Profiling



10-3. Profiling External Data

Problem

You want to profile data not (yet) in SQL Server.



Solution

Use T-SQL and OPENROWSET to access the source data over OLEDB.

Should you be looking at data in a text file—or an OLEDB data source—then you can use OPENROWSET to

return the NULL count as follows (using, in this example, the sample file

C:\SQL2012DIRecipes\CH10\Stock.Txt):

SELECT COUNT(*)

FROM OPENROWSET('MSDASQL', 'Driver={Microsoft Access Text Driver (*.txt, *.csv)};

DefaultDir= C:\SQL2012DIRecipes\CH10;','select * from Stock.Txt')

WHERE Model IS NULL;

Similarly, the percentage of NULLs can be calculated using the following T-SQL:

SELECT

(SELECT CAST(COUNT(*) AS NUMERIC (10,3))

FROM OPENROWSET('MSDASQL', 'Driver={Microsoft Access Text Driver (*.txt, *.csv)};

DefaultDir= C:\SQL2012DIRecipes\CH10;','SELECT * FROM Stock.txt WHERE Model IS NULL'))

/

(SELECT COUNT(*)

FROM OPENROWSET('MSDASQL', 'Driver={Microsoft Access Text Driver (*.txt, *.csv)};

DefaultDir= C:\SQL2012DIRecipes\CH10;','SELECT * FROM Stock.Txt'));

The code snippets are in the file C:\SQL2012DIRecipes\CH10\ProfileExternalData.Sql.



How It Works

All the techniques described in Recipes 10-1 and 10-2 to carry out domain and data analysis can be used with

OPENROWSET. The art here is to connect to the “external” (by which I mean non-SQL Server) data source. Once this

is done using the requisite driver, you use the appropriate SQL snippet in the pass-through query. The functions

that are used to profile the data are identical to those used in the previous two recipes.

OPENROWSET can access much more than text files. However, as the techniques for using this command to

read data from Microsoft Access, Excel, and various RDBMSs are explained in Chapters 1 and 4, respectively, I

refer you to these chapters for more information on the actual external connection. The profiling code will still be

the same as that shown here.



■■Note In this recipe’s examples, I am using the ACE driver to read text files because this allows the code to run

in both 32-bit and 64-bit environments. You will have to install the ACE driver as described in Recipe 1-1 for this to

work. If you are in a 32-bit environment, then you can use the Microsoft Text Driver, and replace “Microsoft Access

Text Driver (*.txt, *.csv)” with “Microsoft Text Driver (*.txt; *.csv)” in the code for this recipe.



565

www.it-ebooks.info



Chapter 10 ■ Data Profiling



10-4. Profiling External Data Faster

Problem

You want to profile data from a source other than SQL Server data in the shortest possible time.



Solution

Use T-SQL and OPENROWSET while minimizing the number of times the dataset is read. One way to do so is to use a

temporary table, as in the following example (C:\SQL2012DIRecipes\CH10\Stock.Txt):

DECLARE

DECLARE

DECLARE

DECLARE

DECLARE

DECLARE

DECLARE



@Cost_Price

@Registration_Year

@ROWCOUNT

@Mileage_MAX

@Mileage_MIN

@Registration_Year_NULL

@Cost_Price_NULL



INT

INT

INT

INT

INT

INT

INT



SELECT

CASE

WHEN Registration_Year IS NULL THEN 1 ELSE 0

END AS Registration_Year

,CASE

WHEN Cost_Price IS NULL THEN 1 ELSE 0

END AS Cost_Price

INTO

#NullSourceRecords

FROM

OPENROWSET('MSDASQL', 'Driver={Microsoft Access Text Driver (*.txt, *.csv)};

DefaultDir= C:\SQL2012DIRecipes\CH10;','select Registration_Year, Cost_Price from Stock.txt')

WHERE Registration_Year IS NULL OR Cost_Price IS NULL

SELECT

@ROWCOUNT = COUNT(*)

,@Mileage_MAX = MAX(Mileage)

,@Mileage_MIN = MIN(Mileage)

FROM OPENROWSET('MSDASQL', 'Driver={Microsoft Access Text Driver (*.txt, *.csv)};

DefaultDir= C:\SQL2012DIRecipes\CH10;','select Registration_Year, Cost_Price from Stock.txt')

SELECT

@Registration_Year_NULL = SUM(Registration_Year)

,@Cost_Price_NULL = SUM(Cost_Price)

FROM

#NullSourceRecords

PRINT

PRINT

PRINT

PRINT

PRINT

PRINT



@ROWCOUNT

@Mileage_MAX

@Mileage_MIN

@Registration_Year_NULL

@Cost_Price_NULL

CAST(@Registration_Year_NULL AS NUMERIC (12,6))

/ CAST(@ROWCOUNT AS NUMERIC (12,6))



566

www.it-ebooks.info



Chapter 10 ■ Data Profiling



How It Works

You have probably guessed after reading Recipe 10-4 (and will soon find out if you are using your own large text

file as the data source) that re-reading the entire file every time that you wish to profile one column is a very

long-winded way to go about profiling your data. So if you are profiling several columns, I suggest minimizing the

number of times that the data is read by grouping the profile data where possible so that as many profile elements

as possible can be read each time the source file is read.

The code snippet in the current recipe reads the source text file twice—but twice only. The first trawl through

the file is looking for NULL values in two columns (Registration_Year and Cost_Price). It creates a temporary

table that isolates a narrow dataset containing only 1 or 0 indicating whether there are NULL values for each

column. These columns are then summed to give the total NULL values for the columns in question. The second

parse of the source file does not apply a WHERE clause—and returns the record count as well as maximum and

minimum values for required columns. Any percentage calculations can then be carried out.



Hints, Tips, and Traps





See Recipe 2-5 for a more detailed discussion of OPENROWSET when used with text files.

Specifically, remember that OPENROWSET is built for ad hoc occasional connections, and

that a linked server is the recommended solution for more regular connections.







For a linked server, you can simplify things somewhat by using SQL similar to the

following (where MyOracleDatabase is the linked server name). Remember to use the

four-part notation to reference the table correctly:

SELECT COUNT(*)

FROM MyOracleDatabase..HR.EMPLOYEES

WHERE LAST_NAME IS NULL







It can be more laborious to profile external data if your source data file does not contain

column names. In this situation, you require a Schema.Ini file, as described in Recipe 2-6.

Here is the Schema.Ini file (C:\SQL2012DIRecipes\CH10\Schema.Ini) for a data source

file (C:\SQL2012DIRecipes\CH10\StockNoHeaders.Txt) that does not contain headers :

[StockNoHeaders.txt]

Format=CSVDelimited

ColNameHeader=False

MaxScanRows=0

Col1=MAKE Long

Col2=MARQUE long Width 20

Col3=MODEL Text Width 50

Col4=PRODUCT_TYPE Text Width 15

Col5=REGISTRATION_YEAR Text Width 4

Col6=MILEAGE Long

Col7=COST_PRICE Long

CharacterSet=ANSI

Running the code in this recipe—and using the pass-through query 'SELECT * FROM

StockNoHeaders.txt'—will profile the source data from a text file that does not

contain headers.



567

www.it-ebooks.info



Tài liệu bạn tìm kiếm đã sẵn sàng tải về

10-2. Profiling Domain and Value Distribution

Tải bản đầy đủ ngay(0 tr)

×