Tải bản đầy đủ - 0 (trang)
14-9. Managing Foreign Key Constraints

14-9. Managing Foreign Key Constraints

Tải bản đầy đủ - 0trang

Chapter 14 ■ ETL Process Acceleration



,RefCOL.name AS REFERENCED_COLUMN_NAME

,CAST(NULL AS NVARCHAR(300)) AS ColumnList



FROM

sys.foreign_keys FRK

INNER JOIN sys.objects OBJ

ON FRK.parent_object_id = OBJ.object_id

AND FRK.schema_id = OBJ.schema_id

INNER JOIN sys.schemas SCH

ON OBJ.schema_id = SCH.schema_id

INNER JOIN sys.foreign_key_columns FKC

ON FRK.object_id = FKC.constraint_object_id

AND FRK.parent_object_id = FKC.parent_object_id

INNER JOIN sys.columns COL

ON FKC.constraint_column_id = COL.column_id

AND FKC.parent_object_id = COL.object_id

INNER JOIN sys.columns AS RefCOL

ON FKC.referenced_object_id = RefCOL.object_id

AND FKC.referenced_column_id = RefCOL.column_id

INNER JOIN sys.objects AS RefTBL

ON FKC.referenced_object_id = RefTBL.object_id

INNER JOIN sys.schemas AS RefSCH

ON RefTBL.schema_id = RefSCH.schema_id



WHERE

FRK.is_ms_shipped = 0

AND FRK.is_not_trusted = 0 ;

-- Define (concatenate) list of key columns



IF OBJECT_ID('TempDB..#Tmp_IndexedFields') IS NOT NULL

DROP TABLE TempDB..#Tmp_IndexedFields



;

WITH Core_CTE ( FOREIGN_KEY_NAME, Rank, COLUMN_NAME )

AS ( SELECT FOREIGN_KEY_NAME,

ROW_NUMBER() OVER( PARTITION BY FOREIGN_KEY_NAME ORDER BY

FOREIGN_KEY_NAME),

CAST( COLUMN_NAME AS VARCHAR(MAX) )

FROM Tmp_Metadata_ForeignKeys),



Root_CTE ( FOREIGN_KEY_NAME, Rank, COLUMN_NAME )

AS ( SELECT FOREIGN_KEY_NAME, Rank, COLUMN_NAME

FROM

Core_CTE

WHERE Rank = 1 ),



Recursion_CTE ( FOREIGN_KEY_NAME, Rank, COLUMN_NAME )

AS

( SELECT FOREIGN_KEY_NAME, Rank, COLUMN_NAME

FROM Root_CTE



UNION ALL





842

www.it-ebooks.info



Chapter 14 ■ etL proCess aCCeLeration



SELECT Core_CTE.FOREIGN_KEY_NAME, Core_CTE.Rank,

Recursion_CTE.COLUMN_NAME + ', ' + Core_CTE.COLUMN_NAME

FROM Core_CTE

INNER JOIN Recursion_CTE

ON Core_CTE.FOREIGN_KEY_NAME = Recursion_CTE.FOREIGN_KEY_NAME

AND Core_CTE.Rank = Recursion_CTE.Rank + 1

)

SELECT

INTO

FROM

GROUP BY



FOREIGN_KEY_NAME, MAX( COLUMN_NAME ) AS ColumnList

#Tmp_IndexedFields

Recursion_CTE

FOREIGN_KEY_NAME;



UPDATET



3.



SET



T.ColumnList = Tmp.ColumnList



FROM



Tmp_Metadata_ForeignKeys T

INNER JOIN

#Tmp_IndexedFields Tmp

ON T.FOREIGN_KEY_NAME = Tmp.FOREIGN_KEY_NAME;



DROP the foreign keys in the destination database by running the following piece of

T-SQL (C:\SQL2012DIRecipes\CH14\DropForeignKeys.Sql):

-- Drop table to hold script elements

IF OBJECT_ID('tempdb..#ScriptElements') IS NOT NULL

DROP TABLE tempdb..#ScriptElements;

CREATE TABLE #ScriptElements (ID INT IDENTITY(1,1), ScriptElement NVARCHAR(MAX));

INSERT INTO #ScriptElements

SELECT DISTINCT

'ALTER TABLE '

+ SCHEMA_NAME

+ '.'

+ TABLE_NAME

+ ' ADD CONSTRAINT '

+ FOREIGN_KEY_NAME

+ ' FOREIGN KEY ('

+ ColumnList

+ ') REFERENCES '

+ REFERENCED_TABLE_SCHEMA_NAME

+ '.'

+ REFERENCED_TABLE_NAME

+ ' ('

+ ColumnList

+ ')'



843

www.it-ebooks.info



Chapter 14 ■ ETL Process Acceleration



FROM

Tmp_Metadata_ForeignKeys



-- Execute DROP scripts



DECLARE @DropFK NVARCHAR(MAX)



DECLARE DropIndex_CUR CURSOR



FOR

SELECT ScriptElement FROM #ScriptElements ORDER BY ID



OPEN DropIndex_CUR



FETCH NEXT FROM DropIndex_CUR INTO @DropFK



WHILE @@FETCH_STATUS <> −1

BEGIN



EXEC (@DropFK)



FETCH NEXT FROM DropIndex_CUR INTO @DropFK



END;



CLOSE DropIndex_CUR;

DEALLOCATE DropIndex_CUR;

4.



Carry out the data load process to update the destination database.



5.



Using code like the following, re-create the foreign keys from the persisted metadata

(C:\SQL2012DIRecipes\CH14\CreateForeignKeys.Sql):

-- Drop table to hold script elements



IF OBJECT_ID('tempdb..#ScriptElements') IS NOT NULL

DROP TABLE tempdb..#ScriptElements;



CREATE TABLE #ScriptElements (ID INT IDENTITY(1,1), ScriptElement NVARCHAR(MAX));



INSERT INTO #ScriptElements



SELECT DISTINCT

'ALTER TABLE '

+ SCHEMA_NAME

+ '.'

+ TABLE_NAME

+ ' ADD CONSTRAINT '

+ FOREIGN_KEY_NAME

+ ' FOREIGN KEY ('

+ ColumnList



844

www.it-ebooks.info



Chapter 14 ■ ETL Process Acceleration



+ ') REFERENCES '

+ REFERENCED_TABLE_SCHEMA_NAME

+ '.'

+ REFERENCED_TABLE_NAME

+ ' ('

+ ColumnList

+ ')'



FROM

Tmp_Metadata_ForeignKeys;



-- Execute CREATE scripts



DECLARE @CreateFK NVARCHAR(MAX)



DECLARE DropIndex_CUR CURSOR



FOR

SELECT ScriptElement FROM #ScriptElements ORDER BY ID



OPEN DropIndex_CUR



FETCH NEXT FROM DropIndex_CUR INTO @CreateFK



WHILE @@FETCH_STATUS <> −1

BEGIN



EXEC (@CreateFK)



FETCH NEXT FROM DropIndex_CUR INTO @CreateFK



END;



CLOSE DropIndex_CUR;

DEALLOCATE DropIndex_CUR;



How It Works

The sys.objects, sys.foreign_keys, sys.schemas, sys.foreign_key_columns, and sys.columns tables are

used to get the metadata. It is a little more complex than the metadata required for check constraints, but it

works! Then the DROP and CREATE scripts are written based on this metadata. All you have to do is copy and

execute the scripts that are produced. It does the following (very much like Recipe 14-8):





First, it stores the metadata required to DROP and ADD foreign keys in the Tmp_Metadata_

ForeignKeys table.







Second, it uses this information to generate and execute DROP scripts.







Finally, it uses this information to generate and execute ADD scripts



These scripts can be placed in an ETL process at appropriate stages in the process, as demonstrated in

Recipe 14-2.



845

www.it-ebooks.info



Chapter 14 ■ ETL Process Acceleration



Hints, Tips, and Traps





Always script out your staging database and make a backup copy before dropping

constraints.







When adding foreign key constraints at various stages in an ETL process, you can use

the constraints creation script and filter it using relevant table names to create specific

foreign key constraints, as required.







This script assumes no need for quoted identifiers in the metadata. If this is not the case

with your database naming convention, then you need to handle quoted identifiers.







You may prefer to test that foreign key work (using a piece of T-SQL to isolate

nonconforming records) first.



14-10. Optimizing Bulk Loads

Problem

You want to make a bulk load run as fast as possible.



Solution

Ensure that the BulkLoad API is used to perform the load. Then verify that only minimal logging is used during

the data load, and that any indexes and constraints are handled appropriately. The following is an example

using SSIS.

1.



Create a destination table using the following DDL

(C:\SQL2012DIRecipes\CH14\tblFastLoadClients.Sql):

CREATE TABLE CarSales_Staging.dbo.FastLoadClients

(

ID INT,

ClientName NVARCHAR (150),

Address1 VARCHAR (50),

Address2 VARCHAR (50),

Town VARCHAR (50),

County VARCHAR (50),

PostCode VARCHAR (10),

ClientType VARCHAR (20),

ClientSize VARCHAR(10),

ClientSince DATETIME,

IsCreditWorthy BIT,

DealerGroup BINARY(892)

) ;

GO



2.



Create a new SSIS package.



3.



Add two OLEDB connection managers, the first named CarSales_OLEDB, which

connects to the CarSales database, the second named CarSales_Staging_OLEDB,

which connects to the CarSales_Staging database.



846

www.it-ebooks.info



Chapter 14 ■ ETL Process Acceleration



4.



Create the following metadata tables:CarSales_Staging.dbo.IndexList (Recipe 14-3,

step 1), CarSales_Staging.dbo.Metadata_XMLIndexes (Recipe 14-6, step 1), CarSales_

Staging.dbo.MetaData_CheckConstraints (Recipe 14-8, step 1), and CarSales_Staging.

dbo.Tmp_Metadata_ForeignKeys (Recipe 14-9, step 1).



5.



Add an Execute SQL task named GatherIndexMetadata. Configure it to use the

CarSales_Staging_OLEDB connection manager. Set the SQL source type as Direct

Input, and the SQL Statement as the code given in Recipe 14-3, step 2.



6.



Add an Execute SQL task named GatherXMLIndexMetadata. Connect it to the task

created previously (GatherIndexMetadata). Configure it to use the CarSales_Staging_

OLEDB connection manager. Set the SQL source type as Direct Input, and the SQL

Statement as the code given in Recipe 14-6, step 2.



7.



Add an Execute SQL task named GatherCheckConstraintMetadata. Connect it to the

task created previously (GatherXMLIndexMetadata). Configure it to use the CarSales_

Staging_OLEDB connection manager. Set the SQL source type as Direct Input, and

the SQL Statement as the code given in Recipe 14-8, step 2.



8.



Add an Execute SQL task named GatherForeignKeyMetadata. Configure it to use

the CarSales_Staging_OLEDB connection manager. Connect it to the task created

previously (GatherCheckConstraintMetadata). Set the SQL source type as Direct

Input, and the SQL Statement as the code given in Recipe 14-9, step 2.



9.



Add an Execute SQL task named DropForeignKeys. Configure it to use the CarSales_

Staging_OLEDB connection manager. Connect it to the task created previously

(GatherForeignKeyMetadata). Set the SQL source type as Direct Input, and the SQL

Statement as the code given in Recipe 14-9, step 3.



10.



Add an Execute SQL task named DropCheckConstraints. Configure it to use the

CarSales_Staging_OLEDB connection manager. Connect it to the task created

previously (DropForeignKeys). Set the SQL source type as Direct Input, and the SQL

Statement as the code given in Recipe 14-8, step 3.



11.



Add an Execute SQL task named DropXMLIndexes. Configure it to use the CarSales_

Staging_OLEDB connection manager. Connect it to the task created previously

(DropCheckConstraints). Set the SQL source type as Direct Input, and the SQL

Statement as the code given in Recipe 14-6, step 3.



12.



Add an Execute SQL task named DropIndexes. Configure it to use the CarSales_

Staging_OLEDB connection manager. Connect it to the task created previously

(DropXMLIndexes). Set the SQL source type as Direct Input, and the SQL Statement

as the code given in Recipe 14-4.



13.



Add an Execute SQL task named Set BulkLogged Recovery. Configure it to use

the CarSales_Staging_OLEDB connection manager. Connect it to the task created

previously (DropIndexes). Set the SQL source type as Direct Input, and the SQL

Statement as the code shown here:

ALTER DATABASE CarSales

SET RECOVERY BULK_LOGGED;



14.



Add a Data flow task, which you connect to the Execute SQL task named Set

BulkLogged Recovery. Double-click to edit.



847

www.it-ebooks.info



Chapter 14 ■ ETL Process Acceleration



15.



Add an OLEDB source, which you configure to use the CarSales_OLEDB connection

manager and the source query:

SELECT

FROM



ID, ClientName, Address1, Address2, Town, County, PostCode, ClientType

,ClientSize, ClientSince, IsCreditWorthy, DealerGroup

Client



16.



Add an OLEDB destination,which you connect to the OLEDB source. Double-click

to edit.



17.



Select the CarSales_Staging_OLEDB connection manager and the

dbo.FastLoadClients table using the Table or View—Fast Load data access mode.



18.



Ensure that Table Lock is checked.



19.



Click OK to finish your configuration.



20.



Return to the Data Flow pane.



21.



Add an Execute SQL task named CreateIndexes. Configure it to use the CarSales_

Staging_OLEDB connection manager. Connect it to the Data Flow task created

previously. Set the SQL source type as Direct Input, and the SQL Statement as the

code given in Recipe 14-5.



22.



Add an Execute SQL task named CreateXMLIndexes. Configure it to use the

CarSales_Staging_OLEDB connection manager. Connect it to the task created

previously (CreateIndexes). Set the SQL source type as Direct Input, and the SQL

Statement as the code given in Recipe 14-6, step 5.



23.



Add an Execute SQL task named CreateCheckConstraints. Configure it to use

the CarSales_Staging_OLEDB connection manager. Connect it to the task created

previously (CreateXMLIndexes). Set the SQL source type as Direct Input, and the SQL

Statement as the code given in Recipe 14-8, step 5.



24.



Add an Execute SQL task named CreateForeignKeys. Configure it to use the

CarSales_Staging_OLEDB connection manager. Connect it to the task created

previously (CreateCheckConstraints). Set the SQL source type as Direct Input, and the

SQL Statement as the code given in Recipe 14-9, step 5.



25.



Add an Execute SQL task named Set BulkLogged Recovery. Configure it to use

the CarSales_Staging_OLEDB connection manager. Connect it to the task created

previously (CreateForeignKeys). Set the SQL source type as Direct Input, and the SQL

Statement as the following code:

ALTER DATABASE CarSales

SET RECOVERY FULL;



You can now run the load process. Once it has finished, you should immediately back up the database.



848

www.it-ebooks.info



Chapter 14 ■ ETL Process Acceleration



How It Works

In this recipe, we are aiming for two, somewhat interdependent, objectives:bulk loading and minimal logging.

The reasons for trying to attain these objectives are simple:





Using the SQL Server BulkLoad API during a data load will write data in batches, and

not row by row. This will inevitably result in considerably shorter process duration,

because using the bulkload API will always be faster than a “normal” row-by-row write to

a table. It always requires table locking.







Minimal logged operations are nearly always faster than logged operations because

of the reduced overhead they imply. This is because minimally logged operations only

track extent allocations and metadata changes. This means lower memory requirements,

less disk I/O, and reduced processor load. Minimally logged operations also reduce one

potential area of risk, which is log space. Even if your log is optimized to perfection, a

large data load can fill the log—or disk—and thus stop the load process. If the log has to

be resized during a load, this will cause the load to slow down considerably while more

log space is added.



However, many bulkload operations require the database to be in SIMPLE or BULK_LOGGED recovery mode to

use the bulkload API. An operation can be a bulk load operation without being minimally logged, but ensuring

that the two are combined should always be the preferred outcome if load speed is the main aim.

However, attaining the nirvana of ultimate speed in data loads is not just as simple as setting a recovery

mode and ensuring that the TABLOCK hint is set. There are two other fundamental aspects to consider:replication

and indexes.

The first of these is simple: if replication is enabled on a table, then minimal logging is—by definition—

impossiblebecause the transaction log is used by the replication process.

Indexing is much more complicated because the outcome can depend onthe type of index (clustered or

nonclustered) and whether the table is empty or not.

Let’s look at these possibilities in more detail.





Heap table(no indexes): This is the simplest case, where data pages are minimally logged.







Heap table (with nonclustered indexes): Data pages are minimally logged. Index pages,

however, are minimally logged when the table is empty, but logged if there is data in the

table. There is an exception to this rule—the first batch that is loaded into an empty table.

For this first batch, both data and indexes are minimally logged. For the second and all

further batches, only the data is minimally logged—index pages are fully logged.







Clustered index: If the table is empty, data and index pages are minimally logged. If the

table contains data, both data and index pages are fully logged.



You will note that in this recipe, I switched to the Bulk-Logged recovery model for the duration of the load.

I also backed up the database once the load finished. This is an admittedly simplistic case where the destination

database is clearly a staging area and there are no transactions running during the load process. I am aware that

this might not be the case in many other situations, which could render this approach impracticable.



■■Note Efficient bulk loading can get more complex than it seems since there are many ways in which the core

requirements can mesh. All this is described exhaustively in the Microsoft whitepaper The Data Loading Performance

Guide (http://msdn.microsoft.com/en-us/library/dd425070(v=sql.100).aspx). This whitepaper may have

been written for SQL Server 2008, but its insights apply equally well to SQL Server 2012. It can take many months of

­experience, however, to appreciate all the subtleties and interactions of high-performance data loading using SQL Server.

849

www.it-ebooks.info



Chapter 14 ■ ETL Process Acceleration



One aspect of bulk inserts that was not covered in this recipe (it was covered exhaustively in many of the

recipes in Chapter 13) was parallel loading of data. In my experience, parallel loads will always be faster than

single loads. However, this means being able to perform parallel loads, which implies:





The destination table is not indexed—as in this case, parallel data importing is

supported by all three bulk import commands: BCP, BULK INSERT, and INSERT...

SELECT * FROM OPENROWSET(BULK...). This is because when indexes exist on a table, it

is impossible tocarry out a parallel load operation simply by applying the TABLOCK hint.

Consequently, before a bulk-import operation, you must drop indexes.







The TABLOCK hint must be specified. This is because concurrent threads block each

other if TABLOCK is not specified.







Parallelization works best for nonoverlapping data sources—files where any “key”

fields are completely separated in each source file.



If you are using BULK INSERT, BCP, or INSERT ... SELECT * FROM OPENROWSET(BULK...), and the table is

empty and has a clustered index (and the data in the data file is ordered in the same way as the clustered index

key columns), you must also carry out the following:





Bulk import the data with the clustered index already in place.







Specify the ORDER hint, as well as the TABLOCK hint.



If the destination table is empty, this should be faster than dropping the clustered index, loading the data,

and then regenerating the clustered index, since a sort step is not required.



Hints, Tips, and Traps





Creating indexes can take considerable time and (especially creating clustered indexes)

can prove exhausting for system resources. So the obvious approach is to try and run

indexing operations in parallel. This is handled easily from SSIS by using a sequence

container with a series of Execute SQL tasks, each of which handles one or more indexing

operations. My only advice is not to go overboard on this, and certainly do not set up too

many parallel tasks since the disk subsystem will probably saturate long before you get

any shorter processing. My advice is to try a couple parallel tasks, extend one at a time,

and then time the results. Indeed, when rebuilding or reorganizing indexes in parallel,

I also advise you to handle all the indexes for a single table inside the same process, and

not reprocess indexes from the same table using multiple threads.







If you drop secondary indexes, you can try re-creating them in parallel by running the

creation of each secondary index from a separate client. This can be tricky to set up,

however.







If you are importing a small amount of new data relative to the amount of existing data,

dropping and rebuilding the indexes may be counterproductive. The time required to

rebuild the indexes could be longer than the time saved during the bulk operation.







You can avoid Lock Escalation by using:

ALTER TABLE CarSales SET (LOCK_ESCALATION = DISABLE)







If trace flag 610 is enabled, inserts into a table with indexes are generally a minimally

logged operation. Tables with clustered indexes support multiple bulk load streams

inserting data concurrently, provided that these streams contain nonoverlapping data. If

data overlap is detected, data streams will block—but not deadlock.



850

www.it-ebooks.info



Chapter 14 ■ ETL Process Acceleration







BATCHSIZE = 0 should be not used for tables that have indexes—set a batch size as bulk

import operations tend to perform much better with large batch sizes. The only caveats

are that if there are multiple indexes on a destination, this could increase memory

pressure due to the sort requirement. Also, in parallel loads (without the TABLOCK hint

being used), a large batch size can lead to blocking.



Summary

It is my hope that this chapter has given you insight into some of the many techniques used to reduce the time

for an ETL process to load data. The search for efficiency centered on using the SQL Server Bulk Insert API

efficiently, and consequently learning to manage destination table indexes and constraints in an optimal fashion

to ensure that minimal logging is applied.

However, there are as many optimization techniques as there are different challenges to overcome and

issues to resolve. I can only advise you to take a step back when faced with a tight SLA to attain and assess the

problem in its entirety. Always aim for bulk loads, try to ensure that minimal logging is in place, and use the

most efficient drivers/providers that you can find. Narrow the source data down to the essential, and avoid any

superfluous columns. You can even narrow data types if you are sure that you will not be truncating source data.

Generally, it is more efficient to sort data in source databases, and ask for source files to be presorted

(then indicate this to SSIS, BCP, or BULK INSERT).

Above all, take nothing as a given—and test, test, and re-test.

When tailoring an ETL process, it is always worth thinking through your indexing strategy. This chapter has

shown you the following approaches:





Inhibit (disable) all but the clustered indexes. Then rebuild all the indexes on the

destination and/or staging tables.







Drop all indexes (clustered, unclustered, and any other). Then,load the data and finally

re-create all indexes—in the correct order to avoid wasting time and server resources.



Other potential strategies can include:





Disable all but the clustered indexes. Then drop any clustered indexes and load the data.

Finally, create any clustered indexes and rebuild other indexes.







Disable all but the clustered indexes. Then drop any clustered indexes and load the data.

Next, create any clusteredindexes. Potentially much later, rebuild any other indexes that

are essential for the ETL package to function optimally, leaving user indexes to be rebuilt

offline.



851

www.it-ebooks.info



Chapter 15



Logging and Auditing

Designing, developing, and implementing sophisticated data integration systems is always a test of your

knowledge and ingenuity. The challenge does not stop, however, when the system seems ready for production. As

no ETL system can ever be perfect, any industrial-strength process needs to be able to handle errors gracefully.

Not only that, but it must also be able to tell you that an error has occurred and be able to decide whether it can

continue with a data load or not. It also has to be able to track the progress of a process as it runs, and return all

the information and metrics that you need proactively to monitor the process as it runs and (more likely) after it

has run. This means gathering data on





What has run, and when.







What has succeeded, and when.







What has failed, including when and why, which most of the time includes error source,

error messages, error codes, and the severity of the error.



Equally important is the requirement to keep metrics (counters)—at every important stage of the

process—for both error rows and rows successfully processed.

On top of this, you will need another fundamental metric—time elapsed—for all important parts of the

process. There are three main reasons for wanting this data:





To find and correct errors as fast as possible if a process should fail.







To get an easily accessible, high-level overview of the process once it has run, and to keep

an eye on any potential warning signs.







To build comparisons over time of the essential metrics gathered during a process. These

comparisons will allow you to see—in the cold hard light of accumulated data—which

parts of the process take more time or resources. Consequently, you can be proactive

about resolving any potential issues.



Yet merely running the process from end to end is not enough in many cases. You then need to verify the

results, and for the finalized datasets, produce counters of (among other things) the following:





New records







Updated records







NULLS in essential final data sets







Counters for core data



853

www.it-ebooks.info



Tài liệu bạn tìm kiếm đã sẵn sàng tải về

14-9. Managing Foreign Key Constraints

Tải bản đầy đủ ngay(0 tr)

×