Tải bản đầy đủ - 0 (trang)
13-15. Executing SQL Statements and Procedures in Parallel Using SQL Server Agent

13-15. Executing SQL Statements and Procedures in Parallel Using SQL Server Agent

Tải bản đầy đủ - 0trang

Chapter 13 ■ Organising And Optimizing Data Loads



Table 13-2.  (continued)



Technique



Advantages



Disadvantages



Ordering and filtering file loads



Impossible with standard or

MULTIFLATFILE connection managers.



Trickier to set up.



Parallel file loads



Relatively easy to set up.



Files must be identical in format.



Considerable increase in throughput.



Only efficient on multiprocessor

servers.



Parallel file loads with load

balancing



Considerable increase in throughput.



Hard to set up.



Parallel loading from a single

data source



Relatively easy to set up.



Only efficient on multiprocessor

servers.



Some increase in throughput.



Only efficient on multiprocessor

servers.



Parallel reads and parallel writes Relatively easy to set up.

for SQL databases

Considerable increase in throughput.



Only efficient on mulbrocessor

servers.



Controlled batch file loads



Allows timed loads and controlled

numbers of files to be loaded.



Hard to set up.



Parallel SQL statements



Requires SSIS or xp_cmdshell.



No dependencies between the

procedures.



799

www.it-ebooks.info



Chapter 14



ETL Process Acceleration

Sometimes, getting source data to load fast into SQL Server requires a good look at the whole process. Certainly,

if you can take full of advantage of all available processor cores and attempt to parallelize the load, you could

shorten load times—as we saw in Chapter 13. Yet here are other aspects of ETL loads that also need to be taken

into consideration if you are trying to optimize a load and ensure that the entire time taken by a job (and not just

the data load) is reduced to fit into an acceptable timeframe.

In this chapter, we take a look at some of these other aspects of ETL loads, and how they can be tweaked to

ensure optimal load times. They include:





Using the SSIS Lookup Cache efficiently.







Index management for destination tables.







Ensuring minimal logging.







Ensuring bulk loads rather than row-by-row inserts.



As ever, any sample files are available on thee-book’s companion web site. Once downloaded and installed,

they are found in the C:\SQL2012DIRecipes\CH14 folder. I advise you to drop and re-create any tables or other

objects for each recipe.

As a precursor to examining some of these more “advanced” ideas, there are a handful of basic techniques

that bear repetition. So, before embarking on complex solutions to challenging load times, just remember to look

at the following as a first step:





When defining source data in a Lookup task, use a SELECT T-SQL query and not a table

source. This is because any unused source columns will take up SSIS pipeline bandwidth

unnecessarily.







Using NOLOCK (or any equivalent for non-SQL Server databases) when reading external

data sources can improve data reads.







If you are using flat file sources, only select the columns (in the Columns pane) that you

need to send through the SSIS pipeline. As is the case with database sources, this narrows

the row width, and thus allows for a greater number of rows in the SSIS pipeline.







For flat files you can also select the FastParse option if the data and time fields are not

locale-sensitive.







If you need to sort source data from a database source, it is frequently better to use ORDER

BY in the source database.







Performing datatype conversions in the source database can improve SSIS throughput.







Transferring flat files in a compressed format and then uncompressing them on a local server

disk (especially a fast one) can be more efficient than reading flat files across a network.



801

www.it-ebooks.info



Chapter 14 ■ ETL Process Acceleration







Inevitably, you are loading data into SQL Server—so an optimized server can make a

tremendous difference. It is always worth ensuring that TempDB is configured optimally

(as many files as processors, files on a separate disk array, etc.). Of course, separating data

from log files from indexes is equally fundamental.







Check out the allocated SQL Server memory on your servers. Numerous times, I have

seen an artificially low threshold applied for no reason. Nothing will slow down your

queries like lack of memory.







As SQL Server and SSIS love memory—can you get any more added? This can be the

simplest remedy for slow ETL processes.







Is the default network packet size suited to your environment? This is a complex subject,

but increasing the packet size (using the network packet size configuration option) from

the default of 4096 to 32767 – or to 16Kb in Secure Socket Layer (SSL) and Transport Layer

Security (TLS) environments can allow for increased throughput. In SSIS this is the Packet

Size option in the All pane of an OLEDB connection manager.







If you can avoid asynchronous transformations (sort and aggregate transforms, for

instance) – then do so. This can mean sorting data in the source database or convincing

flat file providers to deliver pre-sorted datasets.



In any case, ensuring that the basic optimization techniques are respected will never harm your load

process. So with the foundations in place, it could be time to turn to some of the more advanced possibilities.

As ETL optimization can never be attained by applying a single technique, there is, perhaps inevitably,

a certain overlap between the recipes in this chapter. This is specifically the case for the final recipe, which

amalgamates several of the techniques that are seen in many of the other recipes in this chapter. This is in an

attempt to give a final “holistic” overview to end with.



14-1. Accelerating SSIS Lookups

Problem

You want to build SSIS packages that use the Lookup task as efficiently as possible.



Solution

“Warm” the Lookup transform data cache so that a lookup requires little or no disk access.

1.



Create a destination table using the following DDL:

CREATE TABLE dbo.CarSales_Staging.CachedCountry

(

ID INT,

ClientName NVARCHAR(150),

CountryName_EN NVARCHAR(50)

)



2.



Create a new SSIS package.



3.



Add two OLEDB connection managers. The first named CarSales_OLEDB that

connects to the CarSales database, the second named CarSales_Staging_OLEDB that

connects to the CarSales_Staging database.



802

www.it-ebooks.info



Chapter 14 ■ etL proCess aCCeLeration



4.



Add a Data Flow task and name it Prepare Cache. Double-click to edit.



5.



Add an OLEDB source connection and configure as follows:

OLEDB Connection Manager:



CarSales_OLEDB



Data Access Mode:



SQL Command



SQL Command Text:



SELECT



CountryID, CountryName_EN



FROM



dbo. Countries WITH (NOLOCK)



6.



Confirm with OK.



7.



Add a Cache transform and connect the OLEDB source to it. Double-click to edit.



8.



Click New to create a new Cache connection manager. Name it

ClientDimensionCache, check Use File Cache, and enter the path to where the

cache file will be stored (C:\SQL2012DIRecipes\CH14\CachePreLoad.caw in this

example). The dialog box should look roughly like Figure 14-1.



Figure 14-1. The Cache connection manager Editor

9.



Click the Columns tab and set the Index Position to 1 for the ID (in this example).

The dialog box should look like Figure 14-2.



803

www.it-ebooks.info



Chapter 14 ■ ETL Process Acceleration



Figure 14-2.  Specifying an index column in the Cache connection manager

10.



Click OK to confirm the Cache connection manager specification. Return to the

OLEDB source editor.



11.



Click Columns on the left and ensure that the columns are mapped between the

source and destination.



12.



Click OK to finish modifications of the OLEDB source editor.



13.



Return to the Control Flow pane by clicking Control Flow.



14.



Add a new Data Flow task named Data Load, and connect the “Prepare Cache” Data

Flow task to it. Double-click to edit.



15.



Add an OLEDB source task. Configure it as follows:

OLEDB Connection Manager:



CarSales_OLEDB



Data Access Mode:



SQL Command



SQL Command Text:



SELECT



ID, ClientName, Country



FROM



Client



804

www.it-ebooks.info



Chapter 14 ■ ETL Process Acceleration



16.



Add a Lookup task. Connect the OLEDB source task to it. Configure the Lookup task

as follows:



Pane



Option



Setting



General



Cache Mode



Full Cache



Connection Type



Cache Connection Manager



No matching Entries



Fail Component



Cache Connection Manager



ClientDimensionCache



Connection

Columns



Map the Country and CountryID columns. Select

the CountryName_EN column from the Available

Lookup Columns.



The Lookup Transformation Editor dialog box will look like Figure 14-3.



Figure 14-3.  Mapping columns in the Cache connection manager



805

www.it-ebooks.info



Chapter 14 ■ ETL Process Acceleration



17.



Click OK to confirm your changes.



18.



Add an OLEDB destination task. Connect it to the Lookup transform and use the

Lookup Match Output. Double-click to edit.



19.



Configure the OLEDB destination like this:

OLEDB Connection Manager:



CarSales_Staging_OLEDB



Data Access Mode:



Table or view – Fast load



Name of Table or View:



CachedCountry



20.



Click Mapping on the left and verify that the columns are mapped.



21.



Click OK to confirm your changes.



You can now run the package and use the preloaded cache in the Lookup task.



How It Works

The Lookup transform has been part of SSIS since the product was released, and complaints about its

performance began at about that time. An efficient solution to the speed-hit a Lookup transform can cause

arrived with SQL Server 2008 and the Lookup transform data cache. Essentially, it is a way of pre-loading and/or

reusing the cache data in an SSIS project. So, assuming that your ETL job allows for it, this recipe shows how to

preload the Lookup cache.

The advantage of a file cache is that it can be used several times in a process, and from multiple packages. If

you are warming a cache for use in a single package—the same one that the cache is being prepared—then you

are probably better off only preloading the cache into memory.

A few SSIS Lookup Cache options need further explanation, as given in Table 14-1.

Table 14-1.  SSIS Cache Options



Option



Definition



Comments



Full cache



The results of the query (or table)

are loaded entirely. The query to fill

the cache is executed nearly at the

start of the SSIS package execution.



This is extremely resource-intensive because sufficient

memory must be available to load the required data. Also,

a full load can take some considerable time if a large table

is queried, as well as adding strain to the I/O subsystem.



Partial cache



Matching rows are loaded, and

the least recently-used rows are

removed from the cache.



The query is only executed when the lookup is requested.



No cache



No records are loaded into the

cache.



Requires no memory—but usually the slowest lookup

option. Can be I/O intensive because each lookup is a

separate query.



Enable cache

for rows with

no matching

entries



Stores rows that do not match to

avoid costly further lookups.



Ensures that pointless I/O does not occur. If a record

is not in the reference dataset, then the SSIS cache will

“remember” this and not look for it a second time.



Cache size



You can specify the cache size to

allocate.



Different for 32-bit and 64-bit environments. It can take

some practice to get this optimal.



806

www.it-ebooks.info



Chapter 14 ■ ETL Process Acceleration



■■Note  For the most efficient cache warming, you should have an index on the columns used in the lookup

(CountryID and CountryName_EN in this example). This could be a covering index using INCLUDE CountryName_EN.



Hints, Tips, and Traps





Preloading a data cache can nonetheless take a long time, and is not a magic bullet that

will solve all speed issues when using the Lookup transform. However, it can allow for

useful parallel processing if you are able to preload the cache early in a process ready for

later use. Equally, if the cache is persisted to disk it can be reused by other packages.







Columns must have matching data types for them to be mapped correctly.







The Lookup cache does not use disk spooling in the case of memory overflow. There

must be enough memory for a full cache load or the memory that you specify for a partial

cache.







If storing cache data on disk, the faster the disk array, the better the performance will be.







You can only preload the cache data if it will not change during the ETL process.



14-2. Disabling and Rebuilding Nonclustered Indexes

in a Destination Table

Problem

You want to disable indexes on a table (or tables) to speed up data loading.



Solution

Store a list of all the current nonclustered indexes on the destination table. Then disable all the nonclustered

indexes before performing the data load. Finally, rebuild all the indexes in the table. The following steps explain

how to do this in an SSIS package.

1.



Create the following table to hold the persisted list of indexes to disable

(C:\SQL2012DIRecipes\CH14\tblIndexList.Sql):

CREATE TABLE CarSales_Staging.dbo.IndexList

(

TableName VARCHAR(128)

,SchemaName VARCHAR(128)

,IndexScript VARCHAR(4000)

);

GO



2.



Run the following script to create a stored procedure that collates and stores the list of

indexes to disable (C:\SQL2012DIRecipes\CH14\pr_IndexesToDisable.Sql):

USE CarSales_Staging;

GO





807

www.it-ebooks.info



Chapter 14 ■ ETL Process Acceleration



CREATE PROCEDURE dbo.pr_IndexesToDisable

AS



TRUNCATE TABLE dbo.IndexList;



INSERT INTO

dbo.IndexList (TableName, SchemaName , IndexScript)



SELECT

,SSC.name

,SOB.name

,'ALTER INDEX ' + SIX.name + ' ON ' + SSC.name + '.' + SOB.name + '

DISABLE'



FROM

sys.indexes SIX

INNER JOIN sys.objects SOB

ON SIX.object_id = SOB.object_id

INNER JOIN sys.schemas AS SSC

ON SOB.schema_id = SSC.schema_id



WHERE

SOB.is_ms_shipped = 0

AND SIX.type_desc = 'NONCLUSTERED'

AND SOB.name = 'Clients' -- Enter the table name here



ORDER BY

SIX.type_desc, SOB.name, SIX.name ;

GO

3.



Execute the following code to create a stored procedure that will disable the indexes

in the selected database (C:\SQL2012DIRecipes\CH14\pr_DisableIndexes.Sql):



USE CarSales_Staging;

GO



CREATE PROCEDURE dbo.pr_DisableIndexes

AS



DECLARE @TableName NVARCHAR(128), @SchemaName NVARCHAR(128), @DisableIndex NVARCHAR(4000)



DECLARE DisableIndex_CUR CURSOR

FOR

SELECT DISTINCT TableName, SchemaName FROM dbo.IndexList



OPEN DisableIndex_CUR



FETCH NEXT FROM DisableIndex_CUR INTO @TableName, @SchemaName



WHILE @@FETCH_STATUS <> −1

BEGIN



SET @DisableIndex = 'ALTER INDEX ALL ON' + @SchemaName + '.' + @TableName + 'DISABLE'





808

www.it-ebooks.info



Tài liệu bạn tìm kiếm đã sẵn sàng tải về

13-15. Executing SQL Statements and Procedures in Parallel Using SQL Server Agent

Tải bản đầy đủ ngay(0 tr)

×