6 EXTENSION OF SELF-MODELING CURVE RESOLUTION TO MULTIWAY DATA: MCR-ALS SIMULTANEOUS ANALYSIS OF MULTIPLE CORRELATED DATA MATRICES
Tải bản đầy đủ - 0trang
DK4712_C011.fm Page 441 Thursday, March 16, 2006 3:40 PM
Multivariate Curve Resolution
441
related to the variation along the columns of the data matrix. This is also the reason
why a data matrix is called a two-way data set. A data matrix is not the most complex
data set that can be found in chemistry. Let us consider a chemical process monitored
ﬂuorimetrically. At each reaction time, a series of emission spectra recorded at
different excitation wavelengths are obtained. This means that we collect a data
matrix at each stage of the reaction. Because the goal is to obtain a picture of the
global process, the matrices should be considered altogether. The information about
the whole chemical process could be organized into a cube of data (tensor) with
three informative directions, i.e., in a three-way data set. Another usual example is
coupling data matrices from different HPLC-DAD runs that share all or some of
their compounds. In this case, the third direction of the data set accounts for the
quantitative differences among runs. Specialized methods have been developed by
chemometricians to treat these kinds of problems, and these are covered in greater
detail in Chapter 12. An overview of some of the methods useful for multivariate
curve resolution is provided here.
Though there is a clear gain in the quality and quantity of information when
going from two- to three-way data sets, the mathematical complexity associated with
the treatment of three-way data sets can seem, at ﬁrst sight, a drawback. To avoid
this problem, most of the three-way data analysis methods transform the original
cube of data into a stack of matrices, where simpler mathematical methods can be
applied. This process is often known as unfolding (see Figure 11.10).
A cube of data sized (m × n × p) can be unfolded in three different directions:
along the row space, along the column space, and along the third direction of the
cube, also called the tube space. The three unfolding procedures give a row-wise
Row-wise, horizontal-wise unfolding
i = 1, ... , I
j = 1, ... , J j = 1, ... , J
j = 1, ... , J
k = 1, ... , K
k = 1, ... , K
J=1
k=K
j = 1, ... , J
Column-wise,
vertical
unfolding
J=1
i = 1, ... , I i = 1, ... , I
J=1
i = 1, ... , I
i = 1, ... , I
k=2
i = 1, ... , I i = 1, ... , I i = 1, ... , I
k=1
j = 1, ... J
k=1
k=2
k=K
Tube-wise, depth-wise unfolding
FIGURE 11.10 Three-way data array (cube) unfolding or matricization. Two-way data matrix
augmentation.
© 2006 by Taylor & Francis Group, LLC
DK4712_C011.fm Page 442 Thursday, March 16, 2006 3:40 PM
442
Practical Guide to Chemometrics
augmented matrix Dr (m × np), a columnwise augmented matrix Dc (n × mp), or a
tubewise augmented matrix Dt (p × mn), respectively (see Figure 11.10). When rank
analysis of the three augmented matrices is carried out, the number of components
obtained for the three different directions (modes) of the data set may be the same
or not. When Dr, Dc, and Dt have the same rank, the three-way data set is said to
be trilinear, and when their ranks are different from each other, the data set is
nontrilinear. (Please note that this deﬁnition holds for by far most of the chemical
data sets, except those for which phenomena of rank deﬁciency or rank overlap are
present [76].) The resolution of a three-way data set into the matrices X, Y, and Z,
which contain the pure proﬁles related to each of the directions of the three-way
data set, changes for trilinear and nontrilinear systems. For trilinear systems, X, Y,
and Z have the same number of proﬁles (nc), and the three-way core, C, is an identity
cube (nc × nc × nc) whose unity elements are found on the superdiagonal. In this
case, the three-way core is often omitted because it does not modify numerically
the reproduction of the original tensor. Each element in the original three-way data
set can be reproduced as follows:
nc
dijk =
∑x y z
if
jf kf
(11.17)
f =1
Equation 11.17 is the fundamental expression of the PARAFAC (parallel factor
analysis) model [77], which is used to describe the decomposition of trilinear data
sets. For nontrilinear systems, the core C is no longer a regular cube (ncr × ncc ×
nct), and the non-null elements are spread out in different manners, depending on
each particular data set. The variables ncr, ncc, and nct represent the rank in the
row-wise, columnwise, and tubewise augmented data matrices, respectively. Each
element in the original data set can now be obtained as shown in Equation 11.18:
ncr
dijk =
ncc
nct
∑∑∑ x
if
y jg zkh c fgh
(11.18)
f =1 g =1 h =1
Equation 11.18 deﬁnes the decomposition of nontrilinear data sets and is the underlying expression of the Tucker3 model [78]. Detailed descriptions of the PARAFAC
and Tucker3 models are given in Chapter 12, Section 12.4.
Decompositions of three-way arrays into these two different models require
different data analysis methods; therefore, ﬁnding out if the internal structure of a
three-way data set is trilinear or nontrilinear is essential to ensure the selection of
a suitable chemometric method. In the previous paragraphs, the concept of trilinearity
was tackled as an exclusively mathematical problem. However, the chemical information is often enough to determine whether a three-way data set presents this
feature. How to link chemical knowledge with the mathematical structure of a threeway data set can be easily illustrated with a real example.
Let us consider a three-way data set formed by several HPLC-DAD runs. If the
data set is trilinear, X, Y, and Z will have as many proﬁles as chemical compounds
in the original data set, and this number will be equal to the rank of the data set.
© 2006 by Taylor & Francis Group, LLC
DK4712_C011.fm Page 443 Thursday, March 16, 2006 3:40 PM
Multivariate Curve Resolution
443
For each chemical compound, there will be only one proﬁle in X, in Y, and in Z
common to all of the appended matrices in the original data set. In the case of
different HPLC-DAD data runs analyzed simultaneously, the decomposition of the
three-way array gives an X matrix with chromatographic proﬁles, a Y matrix with
pure spectra, and a Z matrix with the quantitative information about the amount of
each compound in the different chromatographic runs. In this case, a trilinear structure would imply that the shape of the pure spectrum and the pure chromatogram
of a compound remain invariant in the different chromatographic runs. If the experimental conditions in the different runs are similar enough, the UV spectrum of a
pure compound should not change; however, small run-to-run differences in peak
shape and position are commonly found in practice. Assuming that the elution
process of the same compound in different runs always yields an identically shaped
chromatographic proﬁle does not make sense from a chemical point of view and,
therefore, the data set should be considered nontrilinear. In the example related to
the ﬂuorimetric monitoring of a kinetic process, the decomposition of the original
data set gives a matrix X with pure excitation spectra, a matrix Y with pure emission
spectra, and a matrix Z with the kinetic proﬁles of the process. A trilinear structure
would indicate that the shapes of the excitation and emission spectra of a compound
do not change at the different reaction times of the kinetic process. This invariability
of the spectra is an acceptable statement if the experimental conditions during the
process are not modiﬁed. Therefore, this data set can be considered trilinear.
In practice, however, most of the systems are nontrilinear, due to either the
underlying chemical process (e.g., UV-reaction-monitoring coupling experiments
with different reagent ratios) or to the instrumental lack of reproducibility in the
response proﬁles (e.g., chromatographic proﬁles in different HPLC-DAD runs).
Therefore, multivariate curve-resolution methods are mainly focused on the study
of real examples lacking the trilinear structure. Despite the higher abundance of
nontrilinear data sets, many of the algorithms proposed to study three-way arrays
rely on the assumption of trilinear structure. This is the case with the generalized
rank annihilation method (GRAM) [79], designed to work with two matrices, or its
natural extension, direct trilinear decomposition (DTD) [80], which can handle larger
data sets with more appended matrices. Both GRAM and DTD are noniterative
methods and use latent variables to resolve the proﬁles in X, Y, and Z. When these
methods are applied to nontrilinear data sets, the proﬁles obtained often contain
imaginary numbers (see Chapter 12, Section 12.6). Iterative methods are also used
in three-way data, and the scheme followed in their application is the same as for
a single data matrix, i.e., determination of number of components, use of initial
estimates, application of constraints, and iterative optimization until convergence.
Most of the iterative algorithms are based on least-squares calculations. As for twoway data sets, three-way iterative methods are more ﬂexible in how constraints can
be applied and can deal with data sets that are more diverse.
We have noted that three-way resolution methods generally work with the unfolded
matrices. Depending on the algorithm used, all three types of unfolded matrices may
be used, or only some of them. In the PARAFAC decomposition of a trilinear data set,
all three types of unfolded data matrices are used, whereas in the resolution of a
nontrilinear data set by the MCR-ALS method, only one type of unfolded matrix is used.
© 2006 by Taylor & Francis Group, LLC
DK4712_C011.fm Page 444 Thursday, March 16, 2006 3:40 PM
444
Practical Guide to Chemometrics
Three-way data sets have been presented as cubes of data formed by appending
several matrices together. This means implicitly that all of the data matrices in a
tensor should be equally sized; otherwise, the cube cannot be constructed. Additionally, the information in the rows, in the columns, and in the third direction of the
array must be synchronized for each of the layers of the cube. For example, in the
HPLC-DAD example, if the columns are wavelengths, all of the runs should span
the same wavelength range, and if the rows are retention times, the elution time
range should also coincide. In experimental measurements, it may not be easy or
convenient to fulﬁll these two requirements. Synchronization can be difﬁcult when
the parameter that changes in one of the directions of the array cannot be controlled
in a simple manner. Obtaining equally sized matrices may also be inconvenient if
synchronization forces the inclusion of irrelevant information in some of the twoway appended arrays. An example of difﬁcult synchronization is the combination
of experiments in which a pH-dependent process is monitored by UV spectroscopy.
In this case, the pH variations may not be easily reproducible among the experiments.
The inconvenience of appending equally sized two-way arrays is also evident when
several HPLC-DAD runs of a mixture and matrices of its single standards are treated
together. If the standard runs cover the same elution time range as the mixture runs,
then most of the information in the standard matrices will be formed by baseline
spectra that are not relevant for the resolution of the mixture.
When building a typical three-way data set is not possible, there is no need to give
up the simultaneous analysis of a group of matrices that have something in common.
Some methods, such as MCR-ALS, are designed to work with only one of the three
possible unfolded matrices. This operating procedure greatly relaxes the demands in
how the two-way arrays are combined. Indeed, MCR-ALS requires only one common
direction in all of the matrices to be analyzed. In both of the previous examples, the
common direction is the wavelength range of the spectra collected (see Figure 11.11).
PCA
Dk
Uk
VT
=
(n, J)
ST, VT
Dk
=
Ck
Uk
(I ¥ J)
(I, n)
Dk = Uk VT
MCR
D
C, U
Dk
Ck
Z
ST
=
(n, J)
(I ¥ J)
(I, n)
Dk = Ck ST
FIGURE 11.11 Bilinear models for three-way data, unfolded PCA and unfolded MCR.
© 2006 by Taylor & Francis Group, LLC
DK4712_C011.fm Page 445 Thursday, March 16, 2006 3:40 PM
Multivariate Curve Resolution
445
The MCR-ALS decomposition method applied to three-way data can also
deal with nontrilinear systems [81]. Whereas the spectrum of each compound of
the columnwise augmented matrix is considered to be invariant for all of the
matrices, the unfolded C matrix allows the proﬁle of each compound in the
concentration direction to be different for each appended data matrix. This freedom in the shape of the C proﬁles is appropriate for many problems with a
nontrilinear structure. The least-squares problems solved by MCR-ALS, when
applied to a three-way data set, are the same as those in Equation 11.11 and
Equation 11.12; the only difference is that D and C are now augmented matrices.
The operating procedure of the MCR-ALS method has already been described
in Section 11.5.4, but some particulars regarding the treatment of three-way data
sets deserve further comment.
In the resolution of a columnwise augmented data matrix, the initial estimates
can be either a single ST matrix or a columnwise augmented C matrix. The
columnwise concentration matrix is built by placing the initial C-type estimates
obtained for each data matrix in the three-way data set one on top of each other.
The appended initial estimates must be sorted into the same order as the initial
data matrices in D, and they must keep a correct correspondence of species, i.e.,
each column in the augmented C matrix must be formed by appended concentration
proﬁles related to the same chemical compounds. When no prior information about
the identity of the compounds in the different data matrices is available, the correct
correspondence of species can be estimated from the two-way resolution results
of each single matrix.
The same constraints used in the resolution of a two-way data matrix can be
applied to three-way data sets [21, 42]. Selectivity and nonnegativity affect the
spectrum and the augmented concentration proﬁle of each species, whereas unimodality is applied separately to each of the proﬁles appended to form the augmented concentration proﬁle. The closure constraint operates by applying the
corresponding closure constant to each of the single matrices in the columnwise
concentration matrix. Another constraint speciﬁc of three-way data sets is the socalled correspondence among species. In each single matrix of a three-way data
set, the concentration proﬁles of absent compounds are set equal to zero after each
iterative cycle.
Although MCR-ALS is especially able to cope with nontrilinear data sets formed
by matrices of varying sizes, it can also work with trilinear data sets. Because of the
inherent freedom in the modeling of the proﬁles of the augmented C matrix, trilinear
structure can be included in the MCR-ALS method as an optional constraint [81, 82].
The application of this constraint is performed separately on the concentration proﬁle
of each species. To implement this type of constraint, proﬁles of a certain species are
placed one beside each other to form a new augmented concentration proﬁle matrix,
and PCA is performed on it. If the system is trilinear, the score vector related to the
ﬁrst PC will show the real shape of the concentration proﬁle, and the rest of PCs must
be related to noise contributions. The loadings related to the ﬁrst PC are scaling factors
accounting for the species concentration level in the different appended matrices. Therefore, the new single proﬁles can be calculated as the product of the score vector by
their corresponding scaling factors. The constrained single proﬁles are ﬁnally appended
© 2006 by Taylor & Francis Group, LLC
DK4712_C011.fm Page 446 Thursday, March 16, 2006 3:40 PM
446
Practical Guide to Chemometrics
D1
D2
=
D3
Selection D
of profile
C
ST
Updating
profile
C
PCA
Folding
profile
Loadings
1st score
Unfolding
profile
FIGURE 11.12 Implementation of a trilinearity constraint in MCR bilinear models for threeway data.
to form the new augmented concentration proﬁles. This process is shown graphically
in Figure 11.12. In contrast to other three-way resolution methods specially designed
to work with trilinear systems, the implementation of this constraint in MCR-ALS need
not necessarily be all inclusive, i.e., some or all of the compounds can be forced to
have common proﬁles in the C matrix. This ﬂexibility allows a more representative
modeling of some real situations, such as in (a) systems with trilinear proﬁles related
to the evolution of chemical compounds common to each experiment and (b) freely
modeled proﬁles related to important background contributions that differ in each
experiment.
The information in the third direction of the array, i.e., the Z matrix, is directly
extracted from the augmented matrix C in MCR-ALS. This dimension of the data
set is usually the smallest in size and represents scaling differences among the
appended matrices. Because the ST proﬁle of each compound is common to all of
the appended data matrices, the area of the concentration proﬁle of each compound
is scaled according to the concentration level of the species in each single data
matrix. Thus, the proﬁle of a compound in the Z matrix accounts for the relative
concentration of a particular compound in each of the appended matrices and can
be obtained from (a) the ratio between the area of its concentration proﬁle in a given
matrix and (b) the area related to the concentration proﬁle of the same compound
in a matrix taken as a reference.
11.7 UNCERTAINTY IN RESOLUTION RESULTS,
RANGE OF FEASIBLE SOLUTIONS, AND ERROR
IN RESOLUTION
The main sources of uncertainty associated with the resolution results are the ambiguity of the recovered proﬁles and the experimental noise of the data. Providing
methodologies to quantify this uncertainty is not only a topic of interest in the current
© 2006 by Taylor & Francis Group, LLC
DK4712_C011.fm Page 447 Thursday, March 16, 2006 3:40 PM
Multivariate Curve Resolution
447
literature, but a necessary requirement to enable the use of resolution methods in
standard analytical procedures.
The possible existence of ambiguity in resolution results has been known since
the earliest research in this area [1, 2, 23]. After years of experience, it has been
possible to set resolution theorems that indicate clearly the conditions needed to
recover uniquely the pure concentration and signal proﬁles of a compound in a data
set. These conditions depend mainly on the degree of selectivity in the column mode
or row mode of the measurements. The degree of selectivity, in turn, depends on (a)
the amount of overlap in the region of occurrence for the compound of interest with
the rest of constituents and (b) the general distribution of the different compound
windows in the data set [21, 22]. Therefore, in the same system, some proﬁles can
be recovered uniquely, and some others will necessarily be affected by a certain
ambiguity. When ambiguity exists, a compound is represented by a band of feasible
solutions instead of a unique proﬁle. Calculating the boundaries of these bands is
not straightforward, and the ﬁrst attempts proposed were valid only for systems with
two or three components [1–4]. More recent approaches extended their applicability
to systems with no limit in the number of contributions [83]. The most recent
methods use optimization strategies to ﬁnd the minimum and maximum solution
boundaries by minimizing and maximizing objective functions subject to selected
constraints. The objective functions represent the ratio between the signal contributed
by the compound of interest and the total signal from all compounds in the data set
[84, 85]. These strategies are more powerful than previous ones and allow for an
accurate study of the effect of the different constraints in the magnitude of the bands
of feasible solutions (see Figure 11.13).
•0.5
Tmax
•0.4
•0.3
•0.2
Tmin
•0.1
•0
•0
•5
•10
•15
•20
•25
•30
•35
•40
•45
•50
Tmax
•1.5
•1
•0.5
•0
•01
Tmin
•5
•10
•15
•20
•25
•30
•35
•40
FIGURE 11.13 Effect of constraints in feasible bands. Tmax and Tmin deﬁne maximum and
minimum feasible ranges/bands (dashed and dotted lines, respectively) around the resolved
solutions (solid lines).
© 2006 by Taylor & Francis Group, LLC