9 EXAMPLE: THE APPLICATION OF DOE IN MULTIVARIATE CALIBRATION
Tải bản đầy đủ - 0trang
DK4712_C008.fm Page 322 Saturday, March 4, 2006 1:59 PM
322
Practical Guide to Chemometrics
to encounter applications that have many different components. These can include
active components, inactive additives, dyes, etc., and the concentration levels of
these components can vary widely. In such cases, the method of experimental design
can be used to determine the optimum set of standards required to prepare calibration
models. One of the difficulties in developing a new calibration model is that the
mixtures must be within an appropriate range for the instrumental measurement
method. Using mixtures that are too concentrated gives rise to a nonlinear response
(e.g., in UV spectroscopy, for peaks that no longer obey Beer’s Law) and, as such,
the samples must be diluted. Conversely, using mixtures that are too dilute will cause
a loss in the signal-to-noise ratio, and hence introduce unnecessary noise into the
model. Given these constraints, it is possible to use the method of experimental
design with some reference spectra to build up a simulated calibration set with
proposed standards that are in the correct range for the chosen analytical measurement method, without the need to perform any preliminary experiments. These
simulated calibration data can be used to build calibration models using PLS, etc.
to test that there is sufficient variability in the calibration set to be useful for modeling
purposes. The net result of this approach is that the user can very quickly develop
a calibration model and test it prior to performing any actual experiments, which
thus maximizes productivity and reduces waste. A further advantage of the approach
described here is that one can perform some screening to discover inactive or
nonabsorbing components in the mixtures. The combination of screening, DOE, and
simulation of calibration mixture spectra proposed here can significantly reduce the
resources required for performing the actual measurements.
To illustrate the use of experimental designs in an analytical chemistry application, we will examine a problem taken from the agrochemical industry. The problem
under investigation was to develop a robust calibration model for several commercial
products based on UV spectral measurements. By the term “robust model,” we
assume that the model will be able to give acceptable predictions even if there is
some moderate variation in the controlled and uncontrolled variables.
The successful construction of any calibration model depends to a great extent on
the set of calibration points. Considering the components of the products as independent factors, we can construct the respective factor space in the terms discussed earlier
in this chapter (see Section 8.2.2). The calibration set will consist of points distributed
within this factor space, and the best distribution of these points will be achieved by
employing the experimental-design approach. Provided that the number of significant
factors and that the type of the required regression model are known, we should be
able to construct a successful experimental design. To implement the calibration design
and perform the necessary measurements, we also need to know the boundaries of the
factor space. The following discussion is directed toward these points.
8.9.1.1 Identifying of the Number of Significant Factors
In this example, 12 products, P1 to P12, were to be considered, each consisting of
one to nine components, coded here as C1 to C9. The list of products, their ingredients (components), and the amount of each are listed in Table 8.13. For reasons
of commercial confidentiality, we omitted the actual names of the products and
© 2006 by Taylor & Francis Group, LLC
DK4712_C008.fm Page 323 Saturday, March 4, 2006 1:59 PM
Response-Surface Modeling and Experimental Design
323
TABLE 8.13
List of Products under Consideration and the
Respective Quantities of the Components
Included in Each Product
P1
C6
C9
C7
C1
C8
C2
C3
C5
P2
100
50
2
24
80
0.4
2
10
C6
C7
C1
C8
C2
C3
C5
—
P5
C6
C9
C7
C1
C8
C2
C3
C5
C6
C7
C1
C8
C2
C3
C5
—
P6
120
80
0.5
35
117
0.25
2.5
10
C6
ó
ó
ó
ó
ó
ó
—
P9
C6
C9
C7
C4
200
0.5
35
116.7
0.25
2.5
5
—
P3
200
ó
ó
ó
ó
ó
ó
—
C9
C2
C4
—
200
0.6
12.5
—
C6
C7
C1
C8
C2
C3
C5
—
P7
C6
C9
C7
C1
C2
C3
C4
—
P10
8.77
8.77
0.07
4.6
200
0.5
35
117
0.25
2.5
1
—
P4
120
80
0.5
10
0.5
2.5
40
—
P8
C6
C7
C1
C2
C3
C5
C4
—
P11
C9
C2
C4
—
200
0.6
150
—
200
0.5
35
116.7
0.25
5
10
—
100
0.5
10
0.5
1
10
40
—
P12
C6
C9
C7
C4
7.93
7.93
0.128
4.2
ingredients and slightly changed the amount of the ingredients in each product.
These changes will not affect the generality of the approach described here.
Figure 8.25 shows the pure-component UV spectra of all nine components. The
respective concentrations of the solutions used to measure the pure-component
spectra are shown in Table 8.14. It is reasonable to assume that components having
large absorption values in the wavelength range of interest will have considerable
influence on the calibration model. Conversely, inactive components that have very
weak absorption values in the wavelength range of interest will have a weak influence
on the model while introducing some additional noise.
The level of UV absorption was used as a screening tool to divide the components
into two sets, active (Figure 8.26) and inactive (Figure 8.27), with the active components having strong UV absorption signals in the range from 250 to 360 nm and
inactive components having weak or insignificant UV absorption signals in this range.
Removing the UV-inactive components reduced the number of the components
to be considered from nine to five. The products and their corresponding UV-active
© 2006 by Taylor & Francis Group, LLC
DK4712_C008.fm Page 324 Saturday, March 4, 2006 1:59 PM
324
Practical Guide to Chemometrics
C1
C2
C3
C4
C5
C6
C7
C8
C9
1
Absorbance
0.8
0.6
0.4
0.2
0
200
250
Wavelength, nm
300
350
FIGURE 8.25 UV spectra of all nine pure components.
components are listed in Table 8.15. The number of components present in each
formulation is also shown.
Examining Table 8.15, we observe that products P6, P10, and P11 have only
one UV-active component; P9 and P12 have three UV-active components; P3, P4,
P5, P7, and P8 have four UV-active components; and P1 and P2 have five UV-active
components. Thus four types of experimental designs are needed for 1, 3, 4, and 5
independent (process) variables.
TABLE 8.14
Concentration of Components C1 to C9
Used to Measure Pure-Component Spectra
Concentration No.
Concentration, ppm
C1
C2
C3
C4
C5
C6
C7
C8
C9
10
10
10
10
100
100
10
5
1000
© 2006 by Taylor & Francis Group, LLC
DK4712_C008.fm Page 325 Saturday, March 4, 2006 1:59 PM
Response-Surface Modeling and Experimental Design
325
C1
C5
C6
C7
C9
1
Absorbance
0.8
0.6
0.4
0.2
0
250
300
Wavelength, nm
200
350
FIGURE 8.26 Spectra of the UV-active components.
8.9.1.2 Identifying the Type of the Regression Model
As the goal is to build a calibration model based on spectral data, we assume that
Beer-Lambert’s law is valid,
p
Aw =
∑ ε c l, w = 1,…
(8.79)
i i
i =1
C2
C3
C4
C8
1
Absorbance
0.8
0.6
0.4
0.2
0
200
250
300
Wavelength, nm
FIGURE 8.27 Spectra of the UV-inactive components.
© 2006 by Taylor & Francis Group, LLC
350
DK4712_C008.fm Page 326 Saturday, March 4, 2006 1:59 PM
326
Practical Guide to Chemometrics
TABLE 8.15
The “Reduced” Product Formulations
after the Removal of the UV-Inactive
Components Shown in Figure 8.27
P1
P2
P3
P4
1
2
3
4
5
C1
C5
C6
C7
C9
P5
1
2
3
4
5
C1
C2
C5
C6
C7
P6
1
2
3
4
C1
C5
C6
C7
ó
P7
1
2
3
4
ó
C1
C5
C6
C7
—
P8
1
2
3
4
C1
C5
C6
C7
P9
1
C9
ó
ó
ó
P10
1
2
3
4
C1
C6
C7
C9
P11
1
2
3
4
C1
C5
C6
C7
P12
1
2
3
C6
C7
C9
1
C9
1
C9
—
—
—
—
1
2
3
C6
C7
C9
where l is the sample cell path length, εi is the extinction coefficient, and ci is the
concentration of the ith component at the wth wavelength. Noting the theoretical
linear nature of the response described in Equation 8.79, we assume that a linear
polynomial model for p independent variables will be adequate; thus, the model
selected for our experimental designs has the following structure
pj
yˆ j = bo +
∑ b x , j = 1,12
i i
(8.80)
i =1
where pj, j = 1, 12 represents the number of the jth product, e.g., p1 = 5, p = 4, etc. As
previously noted, the products fall into one of four categories depending on the number
of UV-active components. Thus, four different types of models are needed, one each
for one, three, four, and five process variables. Respectively, the number of regression
coefficients will be k1 = 2, k2 = 4, k3 = 5, and k4 = 6. The minimum number of points
in the designs for each of these models will be determined by the corresponding number
of regression coefficients.
Having the number and the type of the variables (components) and the type of
the regression model required, we can begin the task of constructing the appropriate
experimental designs. In this project it was decided to use exact D-optimal designs
having Ni = ki + 5 points. The number of the points selected provides sufficient
degrees of freedom to calculate the regression coefficients. The resulting D-optimal
designs are shown in Table 8.16.
© 2006 by Taylor & Francis Group, LLC
DK4712_C008.fm Page 327 Saturday, March 4, 2006 1:59 PM
Response-Surface Modeling and Experimental Design
327
TABLE 8.16
Catalog of Four Exact D-Optimal Experimental
Designs for the Spectroscopic Calibration Problem
ξ1(1,8)
ξ2(3,9)
x1c
−1
−1
−1
−1
1
1
1
1
x1c,
1,
−1,
−1,
−1,
−1,
1,
1,
1,
1,
—
—
—
x2c, x3c
–1, −1
–1, 1
–1, –1
1, –1
1, 1
1, –1
1, 1
–1, 1
−1, 1
—
—
ξ3(4,10)
x1c,
−1,
−1,
−1,
−1,
−1,
1,
1,
1,
1,
1,
x2c, x3c, x4c
–1, –1, –1
1, –1, 1
−1, –1, –1
1, 1, 1
−1, 1, 1
1, 1, –1
−1, –1, 1
1, 1, −1
1, –1, 1
−1, 1, 1
—
ξ4(5,11)
x1c, x2c,
1, 1,
1, –1,
1, 1,
−1, 1,
1, −1,
−1, −1,
−1, 1,
1, −1,
1, 1,
−1, 1,
−1, –1,
x3c,
−1,
1,
1,
1,
−1,
−1,
1,
1,
–1,
–1,
1,
x4c,
−1,
–1,
1,
1,
–1,
1,
–1,
1,
1,
–1,
–1,
x 5c
−1
1
−1
1
1
−1
1
−1
1
−1
−1
Note: The numbers in parentheses at the top of the table represent,
respectively, the number of variables and the number of measurements.
8.9.1.3 Defining the Bounds of the Factor Space
The coded values for the two levels of the controlled factors in these designs are +1 and
–1, which represent the upper and lower boundaries for each variable. To implement the
designs, we transform these two levels into the real values. By finding the lower and
upper boundaries of the variables for each product, the four generic designs (in coded
values) will be transformed to 12 calibration sets (in real values), one for each product.
To define the boundaries, we assume that the models should be valid over a
working range of up to ±10% of each component’s target value in the formulated
products. Considering each of the product formulations individually, we calculate
the bounds using Equation 8.81,
ximin = 0.90 pi*
ximax = 1.10 pi*
(8.81)
where pi* designates the target value of the ith component, and, ximin and ximax are
the lower and upper bounds, respectively. For example, if the target value of the
ith factor is xic = 200, the respective boundaries will be ximin = 0.9 × 200 = 180
and ximax = 1.1 × 200 = 220. The general formula for the transformation from
coded to natural (real) variables and vice versa is
xic =
© 2006 by Taylor & Francis Group, LLC
xi − xic
.
ximax − xic
DK4712_C008.fm Page 328 Saturday, March 4, 2006 1:59 PM
328
Practical Guide to Chemometrics
TABLE 8.17
List of Components Included in
Product P1, with UV-Inactive
Components Shaded
Component
C6
C9
C7
C1
C8
C2
C3
C5
Quantity
100
50
2
24
80
0.4
2
10
For the example given here, the formula becomes:
−1 =
xi − 200
180 − 200 −20
=
=
220 − 200 220 − 200
20
The reverse transformation is obvious.
The process of translating the coded values to real values is illustrated in detail
for the construction of the calibration set for product P1. The target values for each
component in product P1 are shown in Table 8.17, with UV-inactive components
shown as shaded rows.
By taking the entries of Table 8.17 as the target values, we calculate the respective
upper and lower bounds for each component. The results are shown in Table 8.18.
Now, using the correspondence between the real and coded upper and lower
bounds shown in Table 8.18, we can choose the appropriate design from Table 8.16
and replace the coded entries with the real ones. The set of the calibration points in
coded and real values, constructed using design ξ4 (5,11) (5 variables, 11 measurements), is shown in Table 8.19.
TABLE 8.18
Translation of Coded Factor Levels to
Real Experimental Levels for Product P1
Lower Bound
Upper Bound
Component
Real
Coded
Real
Coded
C6
C9
C7
C1
C5
90
45
1.8
21.6
9
−1
−1
−1
−1
−1
110
55
2.2
26.4
11
1
1
1
1
1
© 2006 by Taylor & Francis Group, LLC
DK4712_C008.fm Page 329 Saturday, March 4, 2006 1:59 PM
Response-Surface Modeling and Experimental Design
329
TABLE 8.19
Translated Experimental Design for Product
P1
1,
1,
1,
−1,
1,
−1,
−1,
1,
1,
–1,
−1,
1,
–1,
1,
1,
−1,
−1,
1,
−1,
1,
1,
–1,
−1,
1,
1,
1,
−1,
−1,
1,
1,
–1,
−1,
1,
−1, –1
–1, 1
1, –1
1, 1
–1, 1
1, –1
–1, 1
1, –1
1, 1
–1, –1
–1, –1
C6
C9
C7
C1
C5
110
110
110
90
110
90
90
110
110
90
90
55
45
55
55
45
45
55
45
55
55
45
1.8
2.2
2.2
2.2
1.8
1.8
2.2
2.2
1.8
1.8
2.2
21.6
21.6
26.4
26.4
21.6
26.4
21.6
26.4
26.4
21.6
21.6
9
11
9
11
11
9
11
9
11
9
9
8.9.1.4 Estimating Extinction Coefficients
Using the formulations of the calibration set listed in Table 8.19 and the purecomponent spectra measured earlier, we can generate a set of simulated calibration
spectra without performing any experimental work and investigate some important
properties of the calibration set. The first step is to estimate the matrix of extinction
coefficients, E, using the pure-component spectra. Assuming the path length, l = 1,
the ith pure-component spectrum can be represented by
Ai = εi ci* ,
i = 1, mt
(8.82)
where Ai is the vector of measured absorbances for the ith component at concentration ci*, and εi is the respective vector of extinction coefficients. Solving for the
vector of extinction coefficients, εI, gives Equation 8.83.
εi =
Ai
,
ci*
i = 1, mt
(8.83)
The matrix of extinction coefficients for the components of product P1 can be
assembled by arranging the vectors of extinction coefficients into the rows of
EP1 = [ε1 , ε5 , ε6 , ε 7 , ε9 ] . The matrix of concentrations, C, for the subset of active
species in product P1 is given in the right-hand side of Table 8.19, or in matrix form,
C = [C1, C5, C6, C7, C9]. According to the Beer-Lambert law in Equation 8.79, the
product of these two matrices gives the matrix of simulated mixture spectra, A, for
the calibration set, where the path length, l, is assumed equal to 1.
A = CE
(8.84)
Figure 8.28 shows the predicted calibration spectra listed in Table 8.19 for product P1.
© 2006 by Taylor & Francis Group, LLC
DK4712_C008.fm Page 330 Saturday, March 4, 2006 1:59 PM
330
Practical Guide to Chemometrics
10
9
8
Absorbance
7
6
5
4
3
2
1
0
200
250
300
Wavelength, nm
350
FIGURE 8.28 Predicted calibration spectra for product P1.
For calibration work in the UV range, spectra should have a maximum absorbance less than 1 for Beer’s law to be obeyed and to obtain good linear response.
Clearly, since this condition does not hold for the simulated calibration spectra shown
in Figure 8.28, a simple dilution of the calibration samples should be performed
before measuring their UV spectra. This is applicable to any sample type, as the
dilution only affects the analysis method and not the final result.
8.9.2 IMPROVING QUALITY
FROM
HISTORICAL DATA
As was mentioned previously, in process analytical applications we are usually
limited in how we do experiments and collect data. Sometimes we are not able to
adjust the controlled factors of a process according to the principles of experimental design because it would cause production of product that fails to meet
quality standards. In such cases, the only option is to measure the process and
deal with the data as received. Experiments performed in this manner are called
passive experiments. The values of the measured variables change according to
normal variation in the production process. This can cause correlation in the
measurements, which in turn can affect the numerical stability of fitting regression
models. In cases where it is desirable to achieve on-line or at-line control with a
regression model derived from measurements of the process, a procedure is needed
to avoid making unnecessary measurements and improve the accuracy of the
resulting models.
As a practical example, we consider data provided by BP Amoco from their
naphtha processing plant in Saltend, Hull, U.K. Briefly, naphtha is a mixture of
hydrocarbons and aromatics. The most important components in the feedstock are
naphthalenes and aromatics. Periodically, samples are collected. The near-infrared
(NIR) spectra of these samples are measured, and the amount of naphthalenes and
© 2006 by Taylor & Francis Group, LLC
DK4712_C008.fm Page 331 Saturday, March 4, 2006 1:59 PM
Response-Surface Modeling and Experimental Design
331
aromatics is measured by gas chromatography (GC). A calibration model is
constructed using PCR or PLS (see Chapter 6), and the predicted values of naphthalene and aromatic content are used to control the process. Here the goal is to
replace costly, time-consuming off-line GC measurements with rapid, on-line NIR
measurements. It is possible to collect hundreds or even thousands of NIR spectra
at relatively low cost, while it would be cost prohibitive to perform GC analysis on
each one. By analysis of the design matrix, X, which can be cheaply and quickly
measured, we can select a small subset of samples for GC analysis that will give an
optimal design, thus minimizing the time and expense of performing GC analysis
while maximizing the information we gather as well as the performance of the
regression model that we will build from these measurements.
Once the initial model is developed and placed on-line, a large historical database
of measurements and predictions can be accumulated. If the process or the measurement instrument drifts over time, the usual practice is to recalibrate the NIR model
periodically by collecting new plant samples and performing NIR and GC measurements. To avoid performing costly GC analysis on a large set of samples during
normal process operation, some method of using the inexpensive NIR data is needed
to select the most informative samples for off-line GC analysis. The resulting historical
data can be used in this way to augment the original experimental design with
maximum information and minimum effort. As a side effect, better performance of
the calibration model could be expected.
Following commonly accepted terminology, X represents an N × m data matrix
of NIR spectra with N rows (samples) measured at m variables. The predicted value
yˆi , i = 1… N of the response (naphthalene content or aromatic content) yi, i = 1, …, N
can be estimated using some appropriate form of a regression model,
k
yˆi =
∑ b f (x), i = 1, …, N
j j
(8.85)
j =1
By applying regression analysis, a k × 1 vector of the regression coefficients, b,
is calculated using the formula in Equation 8.86.
b = (FT F)−1 FT y
(8.86)
Using the notation of experimental design, F represents the extended design matrix,
where the elements of its k × 1 row-vectors, f, are known functions of x. The matrix
(FTF) is the Fisher information matrix and its inverse, (FTF)−1, is the dispersion
matrix of the regression coefficients.
As previously noted, in a typical process analytical application, the measured
data set might consist of spectral data recorded at a number of wavelengths much
higher than the number of samples. The rank, R, of the measured matrix of spectra
will be equal to or smaller than the number of the samples N. This causes rank
deficiency in X, and the direct calculation of a regression or calibration model by
use of the matrix inverse using Equation 8.85 and Equation 8.86 is problematic.
© 2006 by Taylor & Francis Group, LLC
DK4712_C008.fm Page 332 Saturday, March 4, 2006 1:59 PM
332
Practical Guide to Chemometrics
This problem can be solved using the multivariate calibration approach of
principal component regression (PCR) or partial least squares (PLS), described in
Chapter 6. In PCR, the matrix of spectra is decomposed into a matrix of principal
component scores, S, consisting of the vectors [s1, s2, …, sR], and loadings, P,
consisting of the eigenvectors [p1, p2,…, pR] of X [32]. During the process of
principal component analysis, we retain an appropriate number of principal components (latent variables), i.e., those that describe statistically significant variation of
the data. By deleting eigenvectors and scores associated with undesirable noise, a
new matrix, X′, is calculated
X′ = s1p1T + s 2 pT2 +,
, + s pc pTpc ,
pc ≤ R
(8.87)
so that the rank deficiency problem is resolved.
At the core of this approach is the improvement of the condition number of the
data matrix, X. The condition number of a matrix, cond(X), is the ratio of the largest
and smallest eigenvalue of X. It takes on values from 1 to +infinity, and can be used
as a measure of the numerical stability with which the inverse of X can be computed.
Values in the range from 1 to 1000 usually indicate that the matrix inverse calculation
will be very stable. In the limit, as the smallest eigenvalue of X goes to zero, cond(X)
tends toward infinity, indicating that matrix X is singular, i.e., it has a determinant
equal to zero, in which case the corresponding regression problem is rank deficient
and the inverse of X does not exist. When the condition number of X is extremely
large, the matrix X is close to being singular, which means computation of its inverse
will be numerically unstable. In PCA and PCR, the rank-deficiency problem is solved
by transforming the original variable space into PCA space and deleting principal
components corresponding to the smallest, closest to zero, eigenvalues. The result
is that the condition number of the new matrix X′ is better (lower) than the condition
number of the original matrix, X.
cond (X) =
λ1
λ
> cond (X′) = 1 ; λ R < λ pc
λR
λ pc
(8.88)
Finally, we turn to the problem of selecting the best experimental design, i.e.,
a subset of samples for passive experiments, as was outlined in the naphtha example.
To construct an optimal design that is robust against ill conditioning of the design
matrix, X, we use the E-optimality criterion. A design is E-optimal if it minimizes
the maximum eigenvalue of the dispersion matrix, M−1 = (FTF)−1. The name of the
criterion originates from the first letter of the word “eigenvalue.”
max δ i (M* )−1 = min max δ i [(M)−1 ] , i = 1, … , R,
i
x
i
(8.89)
where δi[X] represents the eigenvalues of X, and R designates the rank of the
dispersion matrix.
© 2006 by Taylor & Francis Group, LLC