Tải bản đầy đủ - 0 (trang)
6 EXTENSION OF SELF-MODELING CURVE RESOLUTION TO MULTIWAY DATA: MCR-ALS SIMULTANEOUS ANALYSIS OF MULTIPLE CORRELATED DATA MATRICES

6 EXTENSION OF SELF-MODELING CURVE RESOLUTION TO MULTIWAY DATA: MCR-ALS SIMULTANEOUS ANALYSIS OF MULTIPLE CORRELATED DATA MATRICES

Tải bản đầy đủ - 0trang

DK4712_C011.fm Page 441 Thursday, March 16, 2006 3:40 PM



Multivariate Curve Resolution



441



related to the variation along the columns of the data matrix. This is also the reason

why a data matrix is called a two-way data set. A data matrix is not the most complex

data set that can be found in chemistry. Let us consider a chemical process monitored

fluorimetrically. At each reaction time, a series of emission spectra recorded at

different excitation wavelengths are obtained. This means that we collect a data

matrix at each stage of the reaction. Because the goal is to obtain a picture of the

global process, the matrices should be considered altogether. The information about

the whole chemical process could be organized into a cube of data (tensor) with

three informative directions, i.e., in a three-way data set. Another usual example is

coupling data matrices from different HPLC-DAD runs that share all or some of

their compounds. In this case, the third direction of the data set accounts for the

quantitative differences among runs. Specialized methods have been developed by

chemometricians to treat these kinds of problems, and these are covered in greater

detail in Chapter 12. An overview of some of the methods useful for multivariate

curve resolution is provided here.

Though there is a clear gain in the quality and quantity of information when

going from two- to three-way data sets, the mathematical complexity associated with

the treatment of three-way data sets can seem, at first sight, a drawback. To avoid

this problem, most of the three-way data analysis methods transform the original

cube of data into a stack of matrices, where simpler mathematical methods can be

applied. This process is often known as unfolding (see Figure 11.10).

A cube of data sized (m × n × p) can be unfolded in three different directions:

along the row space, along the column space, and along the third direction of the

cube, also called the tube space. The three unfolding procedures give a row-wise



Row-wise, horizontal-wise unfolding



i = 1, ... , I



j = 1, ... , J j = 1, ... , J



j = 1, ... , J



k = 1, ... , K



k = 1, ... , K



J=1



k=K



j = 1, ... , J



Column-wise,

vertical

unfolding

J=1



i = 1, ... , I i = 1, ... , I



J=1

i = 1, ... , I



i = 1, ... , I



k=2



i = 1, ... , I i = 1, ... , I i = 1, ... , I



k=1



j = 1, ... J

k=1



k=2



k=K



Tube-wise, depth-wise unfolding



FIGURE 11.10 Three-way data array (cube) unfolding or matricization. Two-way data matrix

augmentation.



© 2006 by Taylor & Francis Group, LLC



DK4712_C011.fm Page 442 Thursday, March 16, 2006 3:40 PM



442



Practical Guide to Chemometrics



augmented matrix Dr (m × np), a columnwise augmented matrix Dc (n × mp), or a

tubewise augmented matrix Dt (p × mn), respectively (see Figure 11.10). When rank

analysis of the three augmented matrices is carried out, the number of components

obtained for the three different directions (modes) of the data set may be the same

or not. When Dr, Dc, and Dt have the same rank, the three-way data set is said to

be trilinear, and when their ranks are different from each other, the data set is

nontrilinear. (Please note that this definition holds for by far most of the chemical

data sets, except those for which phenomena of rank deficiency or rank overlap are

present [76].) The resolution of a three-way data set into the matrices X, Y, and Z,

which contain the pure profiles related to each of the directions of the three-way

data set, changes for trilinear and nontrilinear systems. For trilinear systems, X, Y,

and Z have the same number of profiles (nc), and the three-way core, C, is an identity

cube (nc × nc × nc) whose unity elements are found on the superdiagonal. In this

case, the three-way core is often omitted because it does not modify numerically

the reproduction of the original tensor. Each element in the original three-way data

set can be reproduced as follows:

nc



dijk =



∑x y z

if



jf kf



(11.17)



f =1



Equation 11.17 is the fundamental expression of the PARAFAC (parallel factor

analysis) model [77], which is used to describe the decomposition of trilinear data

sets. For nontrilinear systems, the core C is no longer a regular cube (ncr × ncc ×

nct), and the non-null elements are spread out in different manners, depending on

each particular data set. The variables ncr, ncc, and nct represent the rank in the

row-wise, columnwise, and tubewise augmented data matrices, respectively. Each

element in the original data set can now be obtained as shown in Equation 11.18:

ncr



dijk =



ncc



nct



∑∑∑ x



if



y jg zkh c fgh



(11.18)



f =1 g =1 h =1



Equation 11.18 defines the decomposition of nontrilinear data sets and is the underlying expression of the Tucker3 model [78]. Detailed descriptions of the PARAFAC

and Tucker3 models are given in Chapter 12, Section 12.4.

Decompositions of three-way arrays into these two different models require

different data analysis methods; therefore, finding out if the internal structure of a

three-way data set is trilinear or nontrilinear is essential to ensure the selection of

a suitable chemometric method. In the previous paragraphs, the concept of trilinearity

was tackled as an exclusively mathematical problem. However, the chemical information is often enough to determine whether a three-way data set presents this

feature. How to link chemical knowledge with the mathematical structure of a threeway data set can be easily illustrated with a real example.

Let us consider a three-way data set formed by several HPLC-DAD runs. If the

data set is trilinear, X, Y, and Z will have as many profiles as chemical compounds

in the original data set, and this number will be equal to the rank of the data set.



© 2006 by Taylor & Francis Group, LLC



DK4712_C011.fm Page 443 Thursday, March 16, 2006 3:40 PM



Multivariate Curve Resolution



443



For each chemical compound, there will be only one profile in X, in Y, and in Z

common to all of the appended matrices in the original data set. In the case of

different HPLC-DAD data runs analyzed simultaneously, the decomposition of the

three-way array gives an X matrix with chromatographic profiles, a Y matrix with

pure spectra, and a Z matrix with the quantitative information about the amount of

each compound in the different chromatographic runs. In this case, a trilinear structure would imply that the shape of the pure spectrum and the pure chromatogram

of a compound remain invariant in the different chromatographic runs. If the experimental conditions in the different runs are similar enough, the UV spectrum of a

pure compound should not change; however, small run-to-run differences in peak

shape and position are commonly found in practice. Assuming that the elution

process of the same compound in different runs always yields an identically shaped

chromatographic profile does not make sense from a chemical point of view and,

therefore, the data set should be considered nontrilinear. In the example related to

the fluorimetric monitoring of a kinetic process, the decomposition of the original

data set gives a matrix X with pure excitation spectra, a matrix Y with pure emission

spectra, and a matrix Z with the kinetic profiles of the process. A trilinear structure

would indicate that the shapes of the excitation and emission spectra of a compound

do not change at the different reaction times of the kinetic process. This invariability

of the spectra is an acceptable statement if the experimental conditions during the

process are not modified. Therefore, this data set can be considered trilinear.

In practice, however, most of the systems are nontrilinear, due to either the

underlying chemical process (e.g., UV-reaction-monitoring coupling experiments

with different reagent ratios) or to the instrumental lack of reproducibility in the

response profiles (e.g., chromatographic profiles in different HPLC-DAD runs).

Therefore, multivariate curve-resolution methods are mainly focused on the study

of real examples lacking the trilinear structure. Despite the higher abundance of

nontrilinear data sets, many of the algorithms proposed to study three-way arrays

rely on the assumption of trilinear structure. This is the case with the generalized

rank annihilation method (GRAM) [79], designed to work with two matrices, or its

natural extension, direct trilinear decomposition (DTD) [80], which can handle larger

data sets with more appended matrices. Both GRAM and DTD are noniterative

methods and use latent variables to resolve the profiles in X, Y, and Z. When these

methods are applied to nontrilinear data sets, the profiles obtained often contain

imaginary numbers (see Chapter 12, Section 12.6). Iterative methods are also used

in three-way data, and the scheme followed in their application is the same as for

a single data matrix, i.e., determination of number of components, use of initial

estimates, application of constraints, and iterative optimization until convergence.

Most of the iterative algorithms are based on least-squares calculations. As for twoway data sets, three-way iterative methods are more flexible in how constraints can

be applied and can deal with data sets that are more diverse.

We have noted that three-way resolution methods generally work with the unfolded

matrices. Depending on the algorithm used, all three types of unfolded matrices may

be used, or only some of them. In the PARAFAC decomposition of a trilinear data set,

all three types of unfolded data matrices are used, whereas in the resolution of a

nontrilinear data set by the MCR-ALS method, only one type of unfolded matrix is used.



© 2006 by Taylor & Francis Group, LLC



DK4712_C011.fm Page 444 Thursday, March 16, 2006 3:40 PM



444



Practical Guide to Chemometrics



Three-way data sets have been presented as cubes of data formed by appending

several matrices together. This means implicitly that all of the data matrices in a

tensor should be equally sized; otherwise, the cube cannot be constructed. Additionally, the information in the rows, in the columns, and in the third direction of the

array must be synchronized for each of the layers of the cube. For example, in the

HPLC-DAD example, if the columns are wavelengths, all of the runs should span

the same wavelength range, and if the rows are retention times, the elution time

range should also coincide. In experimental measurements, it may not be easy or

convenient to fulfill these two requirements. Synchronization can be difficult when

the parameter that changes in one of the directions of the array cannot be controlled

in a simple manner. Obtaining equally sized matrices may also be inconvenient if

synchronization forces the inclusion of irrelevant information in some of the twoway appended arrays. An example of difficult synchronization is the combination

of experiments in which a pH-dependent process is monitored by UV spectroscopy.

In this case, the pH variations may not be easily reproducible among the experiments.

The inconvenience of appending equally sized two-way arrays is also evident when

several HPLC-DAD runs of a mixture and matrices of its single standards are treated

together. If the standard runs cover the same elution time range as the mixture runs,

then most of the information in the standard matrices will be formed by baseline

spectra that are not relevant for the resolution of the mixture.

When building a typical three-way data set is not possible, there is no need to give

up the simultaneous analysis of a group of matrices that have something in common.

Some methods, such as MCR-ALS, are designed to work with only one of the three

possible unfolded matrices. This operating procedure greatly relaxes the demands in

how the two-way arrays are combined. Indeed, MCR-ALS requires only one common

direction in all of the matrices to be analyzed. In both of the previous examples, the

common direction is the wavelength range of the spectra collected (see Figure 11.11).

PCA

Dk



Uk



VT



=

(n, J)



ST, VT

Dk



=



Ck

Uk



(I ¥ J)



(I, n)



Dk = Uk VT

MCR



D



C, U



Dk



Ck



Z



ST



=

(n, J)

(I ¥ J)



(I, n)



Dk = Ck ST



FIGURE 11.11 Bilinear models for three-way data, unfolded PCA and unfolded MCR.



© 2006 by Taylor & Francis Group, LLC



DK4712_C011.fm Page 445 Thursday, March 16, 2006 3:40 PM



Multivariate Curve Resolution



445



The MCR-ALS decomposition method applied to three-way data can also

deal with nontrilinear systems [81]. Whereas the spectrum of each compound of

the columnwise augmented matrix is considered to be invariant for all of the

matrices, the unfolded C matrix allows the profile of each compound in the

concentration direction to be different for each appended data matrix. This freedom in the shape of the C profiles is appropriate for many problems with a

nontrilinear structure. The least-squares problems solved by MCR-ALS, when

applied to a three-way data set, are the same as those in Equation 11.11 and

Equation 11.12; the only difference is that D and C are now augmented matrices.

The operating procedure of the MCR-ALS method has already been described

in Section 11.5.4, but some particulars regarding the treatment of three-way data

sets deserve further comment.

In the resolution of a columnwise augmented data matrix, the initial estimates

can be either a single ST matrix or a columnwise augmented C matrix. The

columnwise concentration matrix is built by placing the initial C-type estimates

obtained for each data matrix in the three-way data set one on top of each other.

The appended initial estimates must be sorted into the same order as the initial

data matrices in D, and they must keep a correct correspondence of species, i.e.,

each column in the augmented C matrix must be formed by appended concentration

profiles related to the same chemical compounds. When no prior information about

the identity of the compounds in the different data matrices is available, the correct

correspondence of species can be estimated from the two-way resolution results

of each single matrix.

The same constraints used in the resolution of a two-way data matrix can be

applied to three-way data sets [21, 42]. Selectivity and nonnegativity affect the

spectrum and the augmented concentration profile of each species, whereas unimodality is applied separately to each of the profiles appended to form the augmented concentration profile. The closure constraint operates by applying the

corresponding closure constant to each of the single matrices in the columnwise

concentration matrix. Another constraint specific of three-way data sets is the socalled correspondence among species. In each single matrix of a three-way data

set, the concentration profiles of absent compounds are set equal to zero after each

iterative cycle.

Although MCR-ALS is especially able to cope with nontrilinear data sets formed

by matrices of varying sizes, it can also work with trilinear data sets. Because of the

inherent freedom in the modeling of the profiles of the augmented C matrix, trilinear

structure can be included in the MCR-ALS method as an optional constraint [81, 82].

The application of this constraint is performed separately on the concentration profile

of each species. To implement this type of constraint, profiles of a certain species are

placed one beside each other to form a new augmented concentration profile matrix,

and PCA is performed on it. If the system is trilinear, the score vector related to the

first PC will show the real shape of the concentration profile, and the rest of PCs must

be related to noise contributions. The loadings related to the first PC are scaling factors

accounting for the species concentration level in the different appended matrices. Therefore, the new single profiles can be calculated as the product of the score vector by

their corresponding scaling factors. The constrained single profiles are finally appended



© 2006 by Taylor & Francis Group, LLC



DK4712_C011.fm Page 446 Thursday, March 16, 2006 3:40 PM



446



Practical Guide to Chemometrics



D1

D2



=



D3

Selection D

of profile



C



ST

Updating

profile



C

PCA

Folding

profile



Loadings

1st score



Unfolding

profile



FIGURE 11.12 Implementation of a trilinearity constraint in MCR bilinear models for threeway data.



to form the new augmented concentration profiles. This process is shown graphically

in Figure 11.12. In contrast to other three-way resolution methods specially designed

to work with trilinear systems, the implementation of this constraint in MCR-ALS need

not necessarily be all inclusive, i.e., some or all of the compounds can be forced to

have common profiles in the C matrix. This flexibility allows a more representative

modeling of some real situations, such as in (a) systems with trilinear profiles related

to the evolution of chemical compounds common to each experiment and (b) freely

modeled profiles related to important background contributions that differ in each

experiment.

The information in the third direction of the array, i.e., the Z matrix, is directly

extracted from the augmented matrix C in MCR-ALS. This dimension of the data

set is usually the smallest in size and represents scaling differences among the

appended matrices. Because the ST profile of each compound is common to all of

the appended data matrices, the area of the concentration profile of each compound

is scaled according to the concentration level of the species in each single data

matrix. Thus, the profile of a compound in the Z matrix accounts for the relative

concentration of a particular compound in each of the appended matrices and can

be obtained from (a) the ratio between the area of its concentration profile in a given

matrix and (b) the area related to the concentration profile of the same compound

in a matrix taken as a reference.



11.7 UNCERTAINTY IN RESOLUTION RESULTS,

RANGE OF FEASIBLE SOLUTIONS, AND ERROR

IN RESOLUTION

The main sources of uncertainty associated with the resolution results are the ambiguity of the recovered profiles and the experimental noise of the data. Providing

methodologies to quantify this uncertainty is not only a topic of interest in the current



© 2006 by Taylor & Francis Group, LLC



DK4712_C011.fm Page 447 Thursday, March 16, 2006 3:40 PM



Multivariate Curve Resolution



447



literature, but a necessary requirement to enable the use of resolution methods in

standard analytical procedures.

The possible existence of ambiguity in resolution results has been known since

the earliest research in this area [1, 2, 23]. After years of experience, it has been

possible to set resolution theorems that indicate clearly the conditions needed to

recover uniquely the pure concentration and signal profiles of a compound in a data

set. These conditions depend mainly on the degree of selectivity in the column mode

or row mode of the measurements. The degree of selectivity, in turn, depends on (a)

the amount of overlap in the region of occurrence for the compound of interest with

the rest of constituents and (b) the general distribution of the different compound

windows in the data set [21, 22]. Therefore, in the same system, some profiles can

be recovered uniquely, and some others will necessarily be affected by a certain

ambiguity. When ambiguity exists, a compound is represented by a band of feasible

solutions instead of a unique profile. Calculating the boundaries of these bands is

not straightforward, and the first attempts proposed were valid only for systems with

two or three components [1–4]. More recent approaches extended their applicability

to systems with no limit in the number of contributions [83]. The most recent

methods use optimization strategies to find the minimum and maximum solution

boundaries by minimizing and maximizing objective functions subject to selected

constraints. The objective functions represent the ratio between the signal contributed

by the compound of interest and the total signal from all compounds in the data set

[84, 85]. These strategies are more powerful than previous ones and allow for an

accurate study of the effect of the different constraints in the magnitude of the bands

of feasible solutions (see Figure 11.13).



•0.5

Tmax



•0.4

•0.3

•0.2



Tmin



•0.1

•0

•0



•5



•10



•15



•20



•25



•30



•35



•40



•45



•50



Tmax



•1.5

•1

•0.5

•0

•01



Tmin

•5



•10



•15



•20



•25



•30



•35



•40



FIGURE 11.13 Effect of constraints in feasible bands. Tmax and Tmin define maximum and

minimum feasible ranges/bands (dashed and dotted lines, respectively) around the resolved

solutions (solid lines).



© 2006 by Taylor & Francis Group, LLC



Tài liệu bạn tìm kiếm đã sẵn sàng tải về

6 EXTENSION OF SELF-MODELING CURVE RESOLUTION TO MULTIWAY DATA: MCR-ALS SIMULTANEOUS ANALYSIS OF MULTIPLE CORRELATED DATA MATRICES

Tải bản đầy đủ ngay(0 tr)

×