Tải bản đầy đủ - 0 (trang)
10-15. Pattern Profiling Using T-SQL

10-15. Pattern Profiling Using T-SQL

Tải bản đầy đủ - 0trang

Chapter 10 ■ Data Profiling



using System.Data.SqlTypes;

using System.Text.RegularExpressions;

namespace Adama

{

public class RegexProfiling

{

[SqlFunction(IsDeterministic = true, IsPrecise = true)]

public static object PatternProfiler(string charInput)

{

string patternOutput = null;

System.Text.RegularExpressions.Regex regexTxt = ;

new System.Text.RegularExpressions.Regex("[A-Z]", RegexOptions.IgnoreCase);

System.Text.RegularExpressions.Regex regexNum = ;

new System.Text.RegularExpressions.Regex("[0-9]");

patternOutput = charInput;

patternOutput = regexTxt.Replace(patternOutput, "X");

patternOutput = regexNum.Replace(patternOutput, "N");

return patternOutput;

}

}

}

2.



Compile the DLL from your code. In Visual Studio, or SSDT, this is as simple as

pressing F5. Otherwise, you will need to enter:

csc /target:library /out: C:\SQL2012DIRecipes\CH10\CLRRegex.dll 

C:\SQL2012DIRecipes\CH10\CLRRegex.cs

Of course, remember to use the path to the .cs file that you have created, and the

output path to the DLL that you want to use.



3.



Now add the DLL as an assembly to SQL Server. This presumes that your server is

enabled for CLR. Enter the following code in a Management Studio query window:

CREATE ASSEMBLY RegexPatternProfiling FROM 'C:\SQL2012DIRecipes\CH10\CLRRegex.dll'



4.



Then create a function with the assembly by using the following code in a

Management Studio query window:

CREATE FUNCTION PatternProfiler

(

@charInput NVARCHAR(4000)

)

RETURNS NVARCHAR(4000)

AS EXTERNAL NAME [RegexPatternProfiling].[Adama.RegexProfiling].[PatternProfiler];



604

www.it-ebooks.info



Chapter 10 ■ Data Profiling



5.



You can now use the PatternProfiler function just like any T-SQL function to return

the pattern from a column, using the following T-SQL:

SELECT Col1, PatternProfiler(Col1) AS Pattern_Profile

FROM DataSource



How It Works

If you really want to stick to T-SQL for all your data profiling, then it can be done, but it must be admitted, slowly

and laboriously as far as pattern profiling is concerned. Effective pattern profiling requires regular expressions,

and these are only accessible to T-SQL using CLR. This implies developing a CLR function, and then loading the

.NET assembly containing the function into the database, which is somewhat finicky. Then the analysis can call

the CLR-based function. This CLR function is essentially the same code described in Recipe 10-20.



■■Note The topic of creating and using CLR-based functions is too huge to be treated in this chapter—or even

this book. For a complete description of creating and managing SQL Server CLR-based functions please refer to

Pro SQL Server 2005 Assemblies by Robin Dewson (Apress, 2006).

It is also possible to generate pattern profiles in T-SQL. The code can be very simple (as you can see in the

following code), but, as any T-SQL programmer knows, it will be abysmally slow when used to process large

datasets, as it is both function-based and a “stringwalking” function at that. So, in the interests of completeness

but with a strong caveat about performance, here is a simple T-SQL function that you can use to profile patterns.

It also lets you choose the pattern characters as part of the input parameters. For the sake of simplicity I am

creating this function in the database where it will be used.

CREATE FUNCTION dbo.fn_PatternProfile

(

@STRING VARCHAR(4000)

,@NUMERICPATTERN CHAR(1)

,@TEXTPATTERN CHAR(1)

)

RETURNS VARCHAR(4000)

AS

BEGIN

DECLARE

DECLARE

DECLARE

DECLARE



@STRINGPOS INT= 1;

@TESTCHAR CHAR(1)= '';

@PATTERNCHAR CHAR(1)= '';

@PATTERNOUT VARCHAR(4000)= '';



WHILE @STRINGPOS <= LEN(@STRING)

BEGIN

SET @TESTCHAR = SUBSTRING(@STRING,@STRINGPOS,1)



605

www.it-ebooks.info



Chapter 10 ■ Data Profiling



IF @TESTCHAR

IN('A','B','C','D','E','F','G','H','I','J','K','L','M','N','O','P','Q','R','S','T','U','V','W',

'X','Y','Z')

SET @PATTERNCHAR = @TEXTPATTERN

ELSE IF UPPER(@TESTCHAR) IN('1','2','3','4','5','6','7','8','9')

SET @PATTERNCHAR = @NUMERICPATTERN

ELSE SET @PATTERNCHAR = @TESTCHAR

SET @PATTERNOUT = @PATTERNOUT + @PATTERNCHAR

SET @STRINGPOS = @STRINGPOS + 1

END;

RETURN(@PATTERNOUT)

END;

You can then apply this function in the following way:

SELECT dbo.fn_PatternProfile(CountryName_EN,'N','A'), CountryName_EN

FROM dbo.Countries

This code is a “stringwalking” function, and simply iterates through every character in a field and replaces

alphabetical characters with the character defined for the variable @TEXTPATTERN and numerals with the

character defined for the variable @NUMERICPATTERN. It is extremely easy to extend this function to detect other

characters, and replace them with yet other pattern markers.



10-16. Profiling Data Types

Problem

You want to profile the data types in your source data and indicate what the smallest data type should be used to

contain the data in the column.



Solution

Use an SSIS Script task to return profile information about the source data types. The following steps

describe how.

1.



Create a new SSIS package. Add a new connection manager corresponding to the data

source that you are using. For all but flat files, you can accept the data types suggested

by SSIS. For Flat File connection managers, click the Advanced pane and set the

OutputColumn Width to a relatively large figure for string columns. Set any columns

that you have doubts about to a large string value too—I suggest 4000. Confirm

the modifications to the connection manager. In this example, I will use a Flat File

connection manager that connects to the C:\SQL2012DIRecipes\CH10\Stock.Txt file.



2.



Add a Data Flow task and double-click to edit.



3.



In the Data Flow pane. Add a Source task of a type that you configure to use the

Configuration Manager that you just created.



4.



Add a Script component (defined as a Transformation when prompted) and connect

the source task to it. Double-click to edit.



606

www.it-ebooks.info



Chapter 10 ■ Data Profiling



5.



Click the Input Columns pane and make sure that all the columns you wish to analyze

are selected.



6.



Click the Script pane and select the Microsoft Visual Basic 2010 Script Language. Click

Edit Script to open the Script editor.



7.



Add the following lines to the Imports region:

Imports System.IO

Imports System.Text



8.



Replace the ScriptMain class with the following:

Public Class ScriptMain

Inherits UserComponent

Dim ColDataType(3) As Integer

Dim ColDataLength(3) As Integer

Dim ColDecimalPrecision(3) As Integer

Dim ColDecimalScale(3) As Integer



of columns 



' set to the number

handled (base 0)

' set to the number

handled (base 0)

' set to the number

handled (base 0)

' set to the number

handled (base 0)



of columns 

of columns 

of columns 



Dim ProposedDataType(17) As String

Public Overrides Sub PreExecute()

ProposedDataType(0) = "unknown"

ProposedDataType(1) = "bit"

ProposedDataType(2) = "tinyint"

ProposedDataType(3) = "smallint"

ProposedDataType(4) = "int"

ProposedDataType(5) = "BIGINT"

ProposedDataType(6) = "smallmoney"

ProposedDataType(7) = "money"

ProposedDataType(8) = "decimal"

ProposedDataType(9) = "single"

ProposedDataType(10) = "double"

ProposedDataType(11) = "time"

ProposedDataType(12) = "date"

ProposedDataType(13) = "smalldatetime"

ProposedDataType(14) = "datetime"

ProposedDataType(15) = "datetime2"

ProposedDataType(16) = "char"

ProposedDataType(17) = "varchar"

End Sub

Public Overrides Sub PostExecute()

MyBase.PostExecute()

Dim outFile As String = "C:\SQL2012DIRecipes\CH10\Output.txt"

Dim sb As New StringBuilder



607

www.it-ebooks.info



Chapter 10 ■ Data Profiling



For i = 0 To ColDataType.Length - 1

sb.AppendLine("Col" & i & "," & ProposedDataType(ColDataType(i)) & "," & 

ColDataLength(i) & "," & ColDecimalPrecision(i) & "," & ColDecimalScale(i))

Next

Using OutWrite As New StreamWriter(outFile)

OutWrite.Write(sb.ToString)

End Using

End Sub

Public Overrides Sub Input0_ProcessInputRow(ByVal Row As Input0Buffer)

GetDataType(Row.ID, 0)

GetDataType(Row.InvoiceNumber, 1)

GetDataType(Row.DeliveryCharge, 2)

GetDataType(Row.ClientID, 3)

GetDataType(Row.TotalDiscount, 4)

End Sub

Public Function GetDataType(ByVal InputCol As String, ByVal ColIndex As Integer) As Integer

Dim IsFound As Boolean = 0

Dim SuggestedDataType As Integer

Dim

Dim

Dim

Dim

Dim

Dim

Dim

Dim

Dim



MaxMoney As Decimal = 922337203685477.62 ' should be .5808

MinMoney As Decimal = -922337203685477.62 ' should be .5807

MaxSmallMoney As Decimal = 214748.3647

MinSmallMoney As Decimal = -214748.3648

MinDateTime As Integer = 1753

MaxDateTime As Integer = 9999

MinSmallDateTime As Integer = 1900

MaxSmallDateTime As Integer = 2079

MinDateTime2 As Integer = 0



Dim DecimalPrecision As Integer = 0

Dim DecimalScale As Integer = 0

Dim StringLength As Integer = 0

' no detection of decimal separator or date/time formats..

If Boolean.TryParse(InputCol, 0) Then

IsFound = True

SuggestedDataType = 1

End If



608

www.it-ebooks.info



Chapter 10 ■ Data Profiling



If Not IsFound Then

If IsNumeric(InputCol) Then



' Initial test for numeric values



' Integer first - straight mapping:

If Not IsFound Then

If Byte.TryParse(InputCol, 0) Then ' 0 - 256

IsFound = True

SuggestedDataType = 2

End If

End If

If Not IsFound Then

If Int16.TryParse(InputCol, 0) Then ' - 32768 to 32767

SuggestedDataType = 3

IsFound = True

End If

End If

If Not IsFound Then

If Int32.TryParse(InputCol, 0) Then ' -2147483648 to 2147483647

IsFound = True

SuggestedDataType = 4

End If

End If

If Not IsFound Then

If Int64.TryParse(InputCol, 0) Then ' -9223372036854775808 to

' 9223372036854775807

IsFound = True

SuggestedDataType = 5

End If

End If

' If not an integer, try the decimal data types

' Money first

If Not IsFound Then

If Decimal.TryParse(InputCol, 0) Then

If InputCol.Length - InputCol.IndexOf(".", 0) <= 5 Then

If InputCol >= MinSmallMoney And InputCol <= MaxSmallMoney Then

IsFound = True

SuggestedDataType = 6

End If



609

www.it-ebooks.info



Chapter 10 ■ Data Profiling



If Not IsFound Then

If InputCol >= MinMoney And InputCol <= MaxMoney Then

IsFound = True

SuggestedDataType = 7

End If

End If

End If

If InputCol.Length - InputCol.IndexOf(".", 0) > 5 Then

If Not IsFound Then

IsFound = True

SuggestedDataType = 8

DecimalPrecision = InputCol.Length - 1

DecimalScale = InputCol.Length - InputCol.IndexOf(".", 0) - 1

End If

End If

End If

End If

' If not one of the other numeric types - it has to be single or double!

If Not IsFound Then

If Single.TryParse(InputCol, 0) Then

IsFound = True

SuggestedDataType = 9

End If

End If

If Not IsFound Then

If Double.TryParse(InputCol, 0) Then

IsFound = True

SuggestedDataType = 10

End If

End If

End If

End If

' Date types

If Not IsFound Then

Dim DV As DateTime

If DateTime.TryParse(InputCol, DV) Then

If IsDate(InputCol) Then



610

www.it-ebooks.info



Chapter 10 ■ Data profiling



If Hour(InputCol) = 0 And Minute(InputCol) = 0 

And Second(InputCol) = 0 Then

' has no time aspect, so is a date

If Year(InputCol) = 1 And Month(InputCol) = 1 Then

' Has no year/month aspect - so is a time

IsFound = True

SuggestedDataType = 11

End If

If Year(InputCol) >= MinDateTime2 And Year(InputCol) <= MaxDateTime Then

IsFound = True

SuggestedDataType = 12

End If

End If

If Not IsFound Then

If Year(InputCol) >= MinSmallDateTime And Year(InputCol) 

<= MaxSmallDateTime Then

IsFound = True

SuggestedDataType = 13

End If

End If

If Not IsFound Then

If Year(InputCol) >= MinDateTime And Year(InputCol) <= MaxDateTime Then

IsFound = True

SuggestedDataType = 14

End If

End If

If Not IsFound Then

If Year(InputCol) >= MinDateTime2 And Year(InputCol) <= MaxDateTime Then

IsFound = True

SuggestedDataType = 15

End If

End If

End If

End If

End If

'Then Text

If Not IsFound Then

SuggestedDataType = 16

StringLength = InputCol.Length

End If



611

www.it-ebooks.info



Chapter 10 ■ Data Profiling



' Apply the most inclusive data type

'First, if there is a date so far, and it becomes a number (or vice-versa)

'automatically becomes text

If ((ColDataType(ColIndex) >= 1 And ColDataType(ColIndex) <= 10) 

And (SuggestedDataType >= 11 

And SuggestedDataType <= 15)) 

Or ((ColDataType(ColIndex) >= 11 

And ColDataType(ColIndex) <= 15) 

And (SuggestedDataType >= 1 

And SuggestedDataType <= 10))

Then

SuggestedDataType = 15

End If

' Remember the chosen data type

If ColDataType(ColIndex) < SuggestedDataType Then

ColDataType(ColIndex) = SuggestedDataType

End If

' For text types, set the col length

If SuggestedDataType = 16 Then

If InputCol.Length > ColDataLength(ColIndex) Then

ColDataLength(ColIndex) = InputCol.Length

End If

End If

' For decimal types get precision and scale

If SuggestedDataType = 7 Then

If DecimalPrecision > ColDecimalPrecision(ColIndex) Then

ColDecimalPrecision(ColIndex) = DecimalPrecision

End If

If DecimalScale > ColDecimalScale(ColIndex) Then

ColDecimalScale(ColIndex) = DecimalScale

End If

End If

Return SuggestedDataType

End Function

End Class



612

www.it-ebooks.info



Chapter 10 ■ Data Profiling



6.



Tweak the code to use the column names in your data source. Rescope the four arrays

(ColDataType, ColDataLength, ColDecimalPrecision, ColDecimalScale) to

correspond to the number of columns. Define an output file and directory suited to

your environment. In this example, it is C:\SQL2012DIRecipes\CH10\Output.Txt.



7.



Close the Script editor. Click OK to close the Script Transformation Editor. The final

package should look like Figure 10-21.



Figure 10-21.  Profiling data types

You can now run the SSIS package.



How It Works

Data profiling does not only mean looking at the values contained in a data source. It can, and sometimes should,

involve looking at the data types too. This can then be used in conjunction with any metadata that you have

concerning the source, to highlight discrepancies and eventually allow you to make changes to the destination

data table data types. For instance, if you have a range of integers between 10 and 200 in a column, and the type

is set to BIGINT, then you may want to consider changing the type of the relevant column in any tables in your

destination database, as well as making suggestions to the DBA of the data source.

Let’s be clear about this, what we are doing here is saying what the data type should, and could be, given

the data currently in a specified column. We are not saying what the data is according to SSIS, OLEDB providers,

.NET providers or source metadata. Once again, we are running the source data through a process to analyze

it and say what SQL Server data type it should be in an ideal world. Just remember that there may be very good

reasons for an apparently inappropriate data type, as the DBA who is in charge of the source may know things

about the future directions of the data that you do not. And conversely, he or she may not . . .

Therefore this SSIS script that attempts to deduce the most suitable data type for each source column

in a source file. It is not perfect, but hopefully is an acceptable trade-off between efficiency, reliability, and

complexity. Running the package will create a text file (called C:\SQL2012DIRecipes\CH10\Output.txt in this

example) that contains four columns:





The suggested data type







The maximum length for a character field







The decimal precision for decimal data types







The decimal scale for decimal data types



613

www.it-ebooks.info



Chapter 10 ■ Data Profiling



The script defines five arrays: Four to hold a column’s DataType, DataLength, DecimalPrecision, and Scale,

then one to hold the data types it will be attributing. Next the ProposedDataType array is initialized with the 17

possible data types as part of the script’s pre-execute phase. The script’s post-execute phase is defined to output

the text file of the final analysis. Each column that is to be analyzed is passed to the GetDataType function for

each row in the source data.

Then to the core of the script—the GetDataType function. This takes the source data for each column and

attempts to deduce its data type and length. First it tries various integer types, then other numeric types, then

dates, and then if all else fails, a character type. The data types are in an order of preference, so that a type with a

more encompassing definition will always be retained over a narrower data type, to ensure that the data can load

successfully.



Hints, Tips, and Traps





The initial length of the columns is a matter of personal preference. The larger it is, the

less chance there is of a load failure, but the slower the process will be.







This script will evaluate all the rows in the data source. If this is too much, then merely

insert a Row Sampling task between the data source and the Script task, edit it to set a

relevant number of rows to sample, and connect (using the Sampling Selected Output) to

the Script task.







This script could be taken further to use column names instead of numbers, and to

automate scoping of variables, and so forth—but I will leave that as a challenge for the

reader!







Data types are explained in Appendix B.



10-17. Controlling Data Flow via Profile Metadata

Problem

You want to use profiling data to add control flow logic to an ETL process.



Solution

Use an SSIS Script task to profile the data and output source data to a RAW file that will be used for future

processing if the profiling tests allow the process to continue. The following steps explain how to do it.

1.



Create a new SSIS package. Add a connection manager named Invoice_Source to the

source data (C:\SQL2012DIRecipes\CH10\Invoice.Txt in this example).



2.



At package level, add the following variable:



Variable Name



Scope



DataType



Value



IsSafeToProceed



Package



Boolean



True



3.



Add a Data Flow task, name it Profile Load, and double-click to edit.



4.



Add a Flat File source that you configure to use the Invoice_Source connection

manager that you created in step 1.



614

www.it-ebooks.info



Tài liệu bạn tìm kiếm đã sẵn sàng tải về

10-15. Pattern Profiling Using T-SQL

Tải bản đầy đủ ngay(0 tr)

×