Tải bản đầy đủ - 0 (trang)
15-22. Auditing an ETL Process

15-22. Auditing an ETL Process

Tải bản đầy đủ - 0trang

Chapter 15 ■ Logging and Auditing



Solution

Audit the log data and isolate key metrics that you define.

1.



In all your T-SQL stored procedures, remember to add the following code snippet to

all data flows in T-SQL, which are INSERTS (both SELECT...INTO and INSERT INTO).

...

,GETDATE() AS DATE_PROCESSED

...



2.



For all T-SQL updates, simply remember to add:

...

,DATE_PROCESSED = GETDATE()

...



To add the processing date and time inside an SSIS data flow you will need to add a Derived Column task on

to the Data Flow pane between the Data Source and destination tasks.

3.



Join the Derived Column task to preceding an following tasks.



4.



Double-click the Derived Column task to edit it, add a name (how about

Last_Processed), and set the expression as GETDATE()—SSIS will set the datatype to

DT_DBTIMESTAMP automatically.



5.



Map the derived column to the LAST_PROCESSED column in the destination table in

the Destination task.



6.



You will need a table to store the most recent audit data. Suggested DDL is

(C:\SQL2012DIRecipes\CH15\tblTableAuditData.Sql):

CREATE TABLE dbo.TableAuditData

(

ID INT IDENTITY(1,1) NOT NULL,

QualifiedTableName VARCHAR(500) NULL,

LastUpdatedDate DATETIME NULL,

LastRunID INT,

LastRecordCount BIGINT

)



7.



The DDL for the stored procedure that captures the audit data is

(C:\SQL2012DIRecipes\CH15\pr_AuditETL.Sql):

CREATE PROCEDURE pr_AuditETL



AS



DECLARE @SQL AS VARCHAR(MAX)

DECLARE @TableToAudit AS VARCHAR(150)

DECLARE @SchemaToAudit AS VARCHAR(150)

DECLARE @DatabaseToAudit AS VARCHAR(150)





925

www.it-ebooks.info



Chapter 15 ■ Logging and Auditing



DECLARE Tables_CUR CURSOR

FOR

SELECT SchemaName, TableName, DatabaseName FROM dbo.RefTables



OPEN Tables_CUR



FETCH NEXT FROM Tables_CUR INTO @SchemaToAudit, @TableToAudit, @DatabaseToAudit



WHILE @@FETCH_STATUS =0

BEGIN



SET @SQL = 'INSERT INTO dbo.TableAuditData (LastUpdatedDate, LastRunID,

QualifiedTableName, LastRecordCount) SELECT MAX(LastRunDate), MAX(RunID),''' +

@DatabaseToAudit + '.' + @SchemaToAudit + '.' + @TableToAudit + ''',COUNT(RunID)

FROM '+ @DatabaseToAudit + '.' + @SchemaToAudit + '.' + @TableToAudit



EXECUTE (@SQL)



FETCH NEXT FROM Tables_CUR INTO @SchemaToAudit, @TableToAudit, @DatabaseToAudit

END



CLOSE Tables_CUR

DEALLOCATE Tables_CUR



How It Works

Once you have logged all that has happened while an ETL process was running, you might wish to carry out some

cross-checks at the end of the process to ensure that certain essential elements are in place. This boils down to a

few simple checks on key tables—or even all the tables—that you have created or updated.

These checks can include verifying:





The last processed date(s) for table data







Cube last processed date







Row counters (total rows in a table and rows last updated)



This is not difficult and rarely takes long. However, it can be well worth the effort as a cross-check on the

counters logged by your processes. For this technique, I use the tables used in previous recipes for logging.

One field is fundamental to auditing staging and data tables—DATE_PROCESSED. This must be a DATETIME

field, inevitably. I suggest not setting this as a default of GETDATE(), as it is too easy to forget that a default will

only fire when a row is created, not updated. I equally suggest avoiding triggers to set the last processed date

and time in ETL processes, since they can slow down the process significantly.

Once all important tables have a LAST_PROCESSSED date field, you can set up a short stored procedure to

count the numbers of rows per table and return the last processed date. For certain tables, you may want to return

the number of rows inserted or updated, as well as the total number of rows, to get an idea of the percentage of

rows modified.



926

www.it-ebooks.info



Chapter 15 ■ Logging and Auditing



Hints, Tips, and Traps





Once you have the selected data, you can store it in an audit table (with the process ID for

instance) to track process metrics over time.







If you are looking at a large collection of tables, you may want to store the list of tables in a

table and convert the preceding code to dynamic SQL to collect the counters for a varying

group of ETL tables.







Counting the rows in large tables can take an extremely long time, so you may prefer to

use the old sys.sysindexes table to get record counts. However, cross-database use of

this system view is extremely turgid to say the least, and may prove unusable in practice.



15-23. Logging Audit Data

Problem

You want to be able to carry out in-depth verification of audit data to track key events such as inserts, updates,

and deletes.



Solution

Add metadata columns to ETL tables and extend your ETL process to update these columns during the ETL job.

The following are ways of tracking each of these events.



Auditing Inserts

Probably the simplest way of logging inserts to set the IsInserted column to 0 before running a process, and

then ensure that you add 1 (True) to this column as part of your INSERT statement.

If you are using SSIS, then add an IsInserted derived column and enter 1 as the expression.



Auditing Updates

As for inserts, ensure that the IsUpdated column is set to 1 (True) as part of an UPDATE.

If you are using SSIS and have a separate update path, use a separate or temporary tables as described in

Chapter 11. Then, add an IsUpdated column to the table holding updated records. Enter 1 as the value.



Auditing Deletes

By definition, deleted rows will no longer be present in a table, so this leaves you with two choices:





Logical deletions (flag records as deleted, and exclude in further processing).







Store deleted records in a _Deleted table.



You can even combine the two. Initially flag records as deletedand then output those records into a _Deleted

table after primary (and presumably time-sensitive) processing has completed.

For the sake of completeness, logical deletion is as easy as setting an IsDeleted flag to 1 (True).



927

www.it-ebooks.info



Chapter 15 ■ Logging and Auditing



Storing deleted records in a _Deleted table during processing can use a T-SQL trigger—something like the

following:





A copy of the table (imaginatively called Clients_DELETED, for instance, if it is to be based

on the Clients table).







One extra column in the _DELETED table named DATETIMEDELETED—which is a DATETIME

datatype.







A delete trigger, something like:

CREATE TRIGGER Trg_Del_Clients ON dbo.Clients

AFTER DELETE

AS

IF @@ROWCOUNT = 0 RETURN

INSERT INTO dbo.Clients_Deleted (ID, ClientName, DeleteDate)

SELECT ID, ClientName, GETDATE() FROM DELETED



How It Works

Verifying that the number of records added or updated, or that staging tables were processed on the date of

a batch job are not enough for in-depth auditing. You need to be able to check resulting data tables for the

following:





Inserts







Updates







Deletes







Number of times updated



While the techniques for these are well-known, it is probably worth recapitulating the best ways of carrying

out these requirements. I am not going to give a crash course in SQL Server’s built-in auditing capabilities, as

these have been documented superbly elsewhere (www.bradmcgehee.com/2010/03/an-introduction-to-sqlserver-2008-audit springs to mind). Moreover, in an ETL context, I feel that full-fledged auditing is overkill. So

here we have just taken a cursory glance at basic auditing of ETL tables. You are free, inevitably, to extend these

techniques to better meet any requirements that you have.

There are a couple of reasons for implementing key event auditing:





As a sanity check to ensure that reasonable percentages of rows of a data table are being

inserted, updated, or deleted.







Running counts of the number of records affected by the type of operation allows you to

track these percentages over time and compare each run to a baseline.



For each table that you will be logging, you need to ensure that the following columns are present:

IsInserted BIT NULL

IsUpdated BIT NULL

IsDeleted BIT NULL

DATE_PROCESSED DATETIME



928

www.it-ebooks.info



Chapter 15 ■ Logging and Auditing



Hints, Tips, and Traps





You may, if you prefer, use a Change_Type column (probably a CHAR(1)) and store I, D, or

U if you prefer, instead of using separate columns for the type of operation.







Remember to save the state of all the important columns of the source table! If you are

copying a large percentage of the columns and rows, consider copying the table with a

simple INSERT INTO—if time, table size, and disk space permit!



Summary

In this chapter, you saw the various ways in which you can log both the steps events in an ETL process and the

metrics associated with these steps. You also saw how to write them to both SQL Server tables and to files on disk.

I hope that you have come to appreciate the power of the built-in logging available in SSIS, especially the

wealth of information available in the SSIS 2012 catalog. However, as you have also seen, there will be times when

you need to extend these capabilities. This can range from minor tweaks to the built-in objects that allow you to

log events from SSIS, to the creation of a completely customized process flow control framework.

However, the main thing to appreciate is the enormous range and subtlety of the logging and monitoring

capacities of SQLServer. You can choose the approach and techniques that best suit your specific requirements.

I hope that this chapter has assisted you in making your choice.



929

www.it-ebooks.info



Appendix A



Data Types

Data types are simultaneously the lifeblood and bane of data migration. They can literally make or break

your ETL process. The first, and indeed the fundamental, aspect of a successful data load process is data type

mapping. Quite simply, if any source data is a type that SQL Server cannot understand, then an import is likely

to fail.

Therefore (and as a quick refresher course), this appendix presents the data types that you need to

understand, first in SQL Server, and then in SSIS.



SQL Server Data Types

Table A-1 is a quick overview of SQL Server data types. I realize that data types are not exactly the most exciting

thing on the planet, but they are fundamental to data ingestion and data type validation. So even if you never

learn this stuff by heart, at least you have it here as an easily available reference.

Table A-1.  SQL Server Data Type Ranges and Storage



Data Type



Description



Range



Storage



Bigint



Large integer—exact numeric



−9,223,372,036,854,775,808 to

9,223,372,036,854,775,807



8 bytes



Int



Integer



−2,147,483,648 to 2,147,483,647 4 bytes



Smallint



Small integer—exact numeric



−32,768 to 32,767



2 bytes



Tinyint



Tiny integer—exact numeric



0 to 255



1 byte



Bit



Binary digit



0 or 1



1 byte



Decimal



(p[ , s] )—precision and scale



Precision of 1–9: 5 bytes

Precision of 10–19: 9 bytes

Precision of 20–28: 13 bytes

Precision of 29–38: 17 bytes



Numeric



Same as numeric



Money



Accurate to 4 decimal places



−922,337,203,685,477.5808 to

922,337,203,685,477.5807



8 bytes



Smallmoney



Accurate to 4 decimal places



− 214,748.3648 to 214,748.3647



4 bytes

(continued)



931

www.it-ebooks.info



Appendix A ■ Data Types



Table A-1.  (continued)



Data Type



Description



Range



Storage



Float



Approximate number typefor

use with floating point

numeric data



− 1.79E+308 to −2.23E−308, 0

and 2.23E−308 to 1.79E+308



Depends on the value

being stored



Real



Approximate number typefor

use with floating point

numeric data



− 3.40E + 38 to −1.18E – 38, 0

and 1.18E – 38 to 3.40E + 38



4 bytes



Datetime



January 1, 1753 toDecember

31, 9999



Datetime2



01/01/0001to13/12/9999



DateTimeOffset

Smalldatetime



January 1, 1900 toJune 6, 2079



Char (n)



Fixed-length, single-byte

character



8000 characters



N bytes



Varchar (n)



Variable-length, single-byte

character



8000 characters



N bytes



Varchar(MAX)



Variable-length, single-byte

character



(2^31) -1



Actual data + 2 bytes



nChar (n)



Fixed-length, double-byte

character



4000 characters



N bytes times 2



nVarchar (n)



Variable-length, double-byte

character



4000 characters



2 times number of

characters entered +

2 bytes



nVarchar(MAX)



Variable-length, double-byte

character



(2^31) -1 characters



2 times number of

characters entered +

2 bytes



Binary (n)



Fixed-length binary data



8000 binary characters



N bytes



Varbinary (n)



Variable-length binary data



8000 binary characters



N bytes



(2^31) -1 binary characters



Actual data + 2 bytes



Varbinary(MAX)

Uniqueidentifier Uniqueidentifier data type



16-byte



Timestamp



Automatically generated,

unique binary number



8 bytes



XML



XML data



2 GB



HierarchyID

Geography



Actual data

Avariable-length, system

data type



Geographic and geodesic data



Geometry



CLR data type

CLR data type



932

www.it-ebooks.info



Appendix A ■ Data Types



SSIS Data Types

Table A-2 provides a quick refresher course on the available SSIS data types.

Table A-2.  SSIS Data Types



Data Type



Description



DT_BOOL



A Boolean value.



DT_BYTES



A binary variable-length data value. Its maximum length is 8000 bytes.



DT_CY



A currency value. This is an eight-byte signed integer with a scale of 4 digits and a

maximum precision of 19 digits.



DT_DATE



A date type consisting of year, month, day, hour, minute, seconds, and fractional

seconds. The fractional seconds have a fixed scale of 7 digits.

It is implemented using an 8-byte floating-point number.



DT_DBDATE



A date type consisting of year, month, and day.



DT_DBTIME



A time type consisting of hour, minute, and second.



DT_DBTIME2



A time type consisting of hour, minute, second, and fractional seconds. Fractional

seconds have a maximum scale of 7 digits.



DT_DBTIMESTAMP



A timestamp structure that consists of year, month, day, hour, minute, second, and

fractional seconds. The fractional seconds have a fixed scale of 3 digits.



DT_DBTIMESTAMP2



A timestamp structure that consists of year, month, day, hour, minute, second, and

fractional seconds. Fractional seconds have a maximum scale of 7 digits.



DT_

DBTIMESTAMPOFFSET



A timestamp structure that consists of year, month, day, hour, minute, second, and

fractional seconds. The fractional seconds have a maximum scale of 7 digits.



DT_DECIMAL



An exact numeric value having a fixed precision and a fixed scale. A 12-byte

unsigned integer data type with a separate sign, a scale of 0 to 28, and a maximum

precision of 29.



DT_FILETIME



A 64-bit value that represents the number of 100-nanosecond intervals since

January 1, 1601. The fractional seconds have a maximum scale of 3 digits.



DT_GUID



A GUID.



DT_I1



A one-byte, signed integer.



DT_I2



A two-byte, signed integer.



DT_I4



A four-byte, signed integer.



DT_I8



An eight-byte, signed integer.



DT_NUMERIC



An exact numeric value with a fixed precision and scale. This data type is a 16-byte

unsigned integer with a separate sign, a scale of 0 to 38, and a maximum precision

of 38.



DT_R4



A single-precision floating-point value.



DT_R8



A double-precision floating-point value.

(continued)



933

www.it-ebooks.info



Appendix A ■ Data Types



Table A-2.  (continued)



Data Type



Description



DT_STR



A null-terminated ANSI/MBCS character string with a maximum length of 8000

characters.



DT_UI1



A one-byte, unsigned integer.



DT_UI2



A two-byte, unsigned integer.



DT_UI4



A four-byte, unsigned integer.



DT_UI8



An eight-byte, unsigned integer.



DT_WSTR



A null-terminated Unicode character string with a maximum length of 4000

characters.



DT_IMAGE



A binary value with a maximum size of 2^311 (2,147,483,647) bytes.



DT_NTEXT



A Unicode character string with a maximum length of 230 to 1 characters.



DT_TEXT



An ANSI/MBCS character string with a maximum length of 231 to 1 characters.



Default Data Mapping in the Import/Export Wizard

The Import/Export Wizard bases data conversion on a set of XML files that you can find in the following

directories: C:\Program Files (x86)\Microsoft SQL Server\110\DTS\MappingFiles and/or

C:\Program Files\Microsoft SQL Server\110\DTS\MappingFiles.

There are three good reasons for knowing that these exist:





They provide a baseline reference of data type mapping,which although are perhaps not

definitive, can be a valuable guide.







The files can be modified to suit the data type mappings that you prefer to use.







You can write your own data type mapping files for use with the Import/Export

Wizard—although I do not show you how to do this.



Tables A-3 to A-35 provide a tabular view on the mapping data from some of these files, so you can see what

the Import/Export Wizard suggests as basic data type mapping, and then possibly use it as a basis for your own

conversion processes.



934

www.it-ebooks.info



Appendix A ■ dAtA types



MSSQL9 to MSSQL8

Table A-3. MSSQL9 to MSSQL8 Data Mapping



Source Data Type



Destination Data Type



smallint



smallint



int



int



real



real



float



FLOAT



smallmoney



smallmoney



money



money



bit



bit



tinyint



tinyint



bigint



bigint



uniqueidentifier



uniqueidentifier



varbinary



varbinary



varbinarymax



image



timestamp



timestamp



binary



binary



image



image



text



text



char



CHAR



varchar



VARCHAR



varcharmax



TEXT



nchar



NCHAR



nvarchar



nvarchar



nvarcharmax



ntext



XML



ntext



ntext



ntext



decimal



decimal



numeric



numeric



datetime



datetime



datetime2



datetime



datetimeoffset



datetime



Length



Precision



Scale



(continued)



935

www.it-ebooks.info



Appendix A ■ Data Types



Table A-3.  (continued)



Source Data Type



Destination Data Type



time



datetime



date



datetime



smalldatetime



smalldatetime



sql_variant



sql_variant



Length



Precision



Scale



MSSQL to DB2

Table A-4.  MSSQL to DB2 Data Mapping



Source Data Type



Destination Data Type



smallint



SMALLINT



int



INTEGER



real



REAL



float



DOUBLE



smallmoney



Length



Precision



Scale



DECIMAL



10



4



money



DECIMAL



19



4



bit



SMALLINT



tinyint



SMALLINT



bigint



BIGINT



uniqueidentifier



CHAR



varbinary



VARCHAR(8000) FOR BIT DATA



timestamp



CHAR(8) FOR BIT DATA



binary



CHAR(8000) FOR BIT DATA



xml



LONG VARGRAPHIC



image



VARCHAR(32672) FOR BIT DATA



sql_Variant



VARCHAR(32672) FOR BIT DATA



text



LONG VARCHAR



char



CHAR



varchar



VARCHAR



nchar



GRAPHIC



nvarchar



VARGRAPHIC



ntext



LONG VARGRAPHIC



38



(continued)



936

www.it-ebooks.info



Tài liệu bạn tìm kiếm đã sẵn sàng tải về

15-22. Auditing an ETL Process

Tải bản đầy đủ ngay(0 tr)

×