Tải bản đầy đủ - 0 (trang)
7 UNCERTAINTY IN RESOLUTION RESULTS, RANGE OF FEASIBLE SOLUTIONS, AND ERROR IN RESOLUTION

7 UNCERTAINTY IN RESOLUTION RESULTS, RANGE OF FEASIBLE SOLUTIONS, AND ERROR IN RESOLUTION

Tải bản đầy đủ - 0trang

DK4712_C011.fm Page 447 Thursday, March 16, 2006 3:40 PM



Multivariate Curve Resolution



447



literature, but a necessary requirement to enable the use of resolution methods in

standard analytical procedures.

The possible existence of ambiguity in resolution results has been known since

the earliest research in this area [1, 2, 23]. After years of experience, it has been

possible to set resolution theorems that indicate clearly the conditions needed to

recover uniquely the pure concentration and signal profiles of a compound in a data

set. These conditions depend mainly on the degree of selectivity in the column mode

or row mode of the measurements. The degree of selectivity, in turn, depends on (a)

the amount of overlap in the region of occurrence for the compound of interest with

the rest of constituents and (b) the general distribution of the different compound

windows in the data set [21, 22]. Therefore, in the same system, some profiles can

be recovered uniquely, and some others will necessarily be affected by a certain

ambiguity. When ambiguity exists, a compound is represented by a band of feasible

solutions instead of a unique profile. Calculating the boundaries of these bands is

not straightforward, and the first attempts proposed were valid only for systems with

two or three components [1–4]. More recent approaches extended their applicability

to systems with no limit in the number of contributions [83]. The most recent

methods use optimization strategies to find the minimum and maximum solution

boundaries by minimizing and maximizing objective functions subject to selected

constraints. The objective functions represent the ratio between the signal contributed

by the compound of interest and the total signal from all compounds in the data set

[84, 85]. These strategies are more powerful than previous ones and allow for an

accurate study of the effect of the different constraints in the magnitude of the bands

of feasible solutions (see Figure 11.13).



•0.5

Tmax



•0.4

•0.3

•0.2



Tmin



•0.1

•0

•0



•5



•10



•15



•20



•25



•30



•35



•40



•45



•50



Tmax



•1.5

•1

•0.5

•0

•01



Tmin

•5



•10



•15



•20



•25



•30



•35



•40



FIGURE 11.13 Effect of constraints in feasible bands. Tmax and Tmin define maximum and

minimum feasible ranges/bands (dashed and dotted lines, respectively) around the resolved

solutions (solid lines).



© 2006 by Taylor & Francis Group, LLC



DK4712_C011.fm Page 448 Thursday, March 16, 2006 3:40 PM



448



Practical Guide to Chemometrics



Even in the absence of ambiguity, the experimental error contained in real data

propagates into the resolution results. This source of uncertainty affects the results

of all kinds of data analysis methods and, in simpler approaches, like multivariate or

univariate calibration, is easily quantified with the use of well-established and generally accepted figures of merit. Although some figures of merit have been proposed

for higher-order calibration methods [86], finding analytical expressions to assess the

error associated with resolution results is an extremely complex problem because of

the huge number of nonlinear parameters that are calculated, as many as the number

of elements in all of the pure profiles resolved. To overcome this problem and still

give a reliable approximate estimation of the error propagation in resolution, other

strategies known under the general name of resampling methods are used [87, 88].

In these strategies, an estimate of the dispersion in the resolution results is obtained

by the resolution of a huge number of replicates. To simulate these replicates, the

complete data set can be resolved multiple times after adding a certain amount of

noise on top of the experimental measurements (noise-added method), or the replicates of the data set can be constructed by addition of a certain amount of noise to

a noise-free simulated or reproduced data set (Monte Carlo simulations). Finally, a

data matrix can be resolved repeatedly after removing different rows or columns, or

in the case of a three-way data set, by removing complete data matrices (jackknife).

These strategies provide an enormous number of results from the different resolution

runs to allow for an estimation of the uncertainty due to the noise in the resolved

profiles. The estimate of uncertainty in resolved profiles in turn allows for the computation of the accuracy of parameters estimated from the resolved profiles, such as

rate constants or equilibrium constants [64]. Resampling and Monte Carlo simulation

methods have been recently proposed for estimating uncertainty of multivariate curveresolution profiles and of the parameters derived from them [89].

Although ambiguity and noise are two distinct sources of uncertainty in resolution, their effect on the resolution results cannot be considered independently. For

example, the boundaries of the compound windows can be clearly blurred due to

the effects of noise, and this can give rise to ambiguities that would be absent in

noise-free data sets. A definite advance would be the development of approaches

that can consider this combined effect in the estimation of resolution uncertainty.



11.8 APPLICATIONS

Within the variety of multicomponent systems, “processes” and “mixtures” can be

placed at the two extremes. The term “process” holds for reaction data, where the

compositional changes respond to a known physicochemical model, or for any

evolving chemical system (e.g., a chromatographic elution) whose sequential compositional variation is caused by physical or chemical changes and whose underlying

physicochemical model, if any, is too complex or simply unknown. “Mixtures”

would have a completely random variation along the compositional direction of the

data set. An example could be a series of spectra collected from independent multicomponent samples. Other data sets lie between these two extremes because they

lack the global continuous compositional evolution of a process, although they can

present it locally. For example, spectroscopic images can have a smooth compositional



© 2006 by Taylor & Francis Group, LLC



DK4712_C011.fm Page 449 Thursday, March 16, 2006 3:40 PM



Multivariate Curve Resolution



449



variation in neighboring pixels. In this respect, environmental data sets are similar,

where close geographical sampling points can be compositionally related.

The examples that are given in the following subsections show the power of

multivariate curve resolution to resolve very diverse chemical problems. Different

strategies adapted to the chemical and mathematical features of the data sets are

chosen, and resolution of two-way or three-way data sets is carried out according

to the information that has to be recovered. Because MCR-ALS has proved to be a

very versatile resolution method, able to deal with two-way and three-way data sets,

this is the method used in all of the following examples.



11.8.1 BIOCHEMICAL PROCESSES

Biochemical processes are among the most challenging and interesting reaction

systems. Due to the nature of the constituents involved, macromolecules such as

nucleic acids or proteins, the processes to be analyzed do not follow a simple

physicochemical model, and their mechanism cannot be easily predicted. For example, well-known reactions for simple molecules, e.g., protonation equilibria, increase

in complexity for macromolecules due to the presence of polyelectrolytic effects or

conformational transitions. Because the data analysis cannot be supported in a

model-fitting procedure (hard-modeling methods), the analysis of these processes

requires soft-modeling methods that can unravel the contributions of the process

without the assumption of an a priori model.

Examples of biochemical processes successfully studied by spectroscopic monitoring and multivariate resolution techniques include protonation and complexation

of nucleic acids and other events linked to these biomolecules, such as drug intercalation processes and salt, solvent, or temperature-induced conformational transitions [90–97]. In general, any change (thermodynamic or structural) that these

biomolecules undergo is manifested through a distinct variation in an instrumental

signal (usually spectroscopic) and can be potentially analyzed by multivariate resolution techniques.

A relevant example in this field is the study of protein folding processes [94].

They have an intrinsic biochemical interest linked to the relationship between protein

structure and biological activity. Protein structure is organized into four hierarchical

levels, including primary structure (the sequence of amino acids in the polypeptide

chains), secondary structure (the regular spatial arrangements of the backbone of

the polypeptide chain stabilized by hydrogen bonds that give rise to helical or flat

sheet arrangements), tertiary structure (spatial arrangement of the secondary structure motifs within a polypeptide chain, responsible for the globular or fibrillar nature

of proteins), and quaternary structure (union of several polypeptide chains by weak

forces or disulfide bridges). All proteins must adopt specific three-dimensional folded

structures to acquire the so-called native conformation and be active, i.e., their

secondary and tertiary structure should be fully organized.

Protein folding can take place as a one-step process, where only the folded or

native (N) and the unfolded (U) states are detected, or as a multistep process, where

intermediate conformations occur. The so-called molten globular state has often been

reported as a well-characterized intermediate conformation that shows organized



© 2006 by Taylor & Francis Group, LLC



DK4712_C011.fm Page 450 Thursday, March 16, 2006 3:40 PM



450



Practical Guide to Chemometrics



secondary structure motifs and unordered tertiary structure [94–96]. Indeed, protein

folding events most often follow one of the two following mechanisms: (a) one-step

process: Native (N) ↔ Unfolded (U); or (b) two-step process: Native (N) ↔ Intermediate (“molten globule”) (I) ↔ Unfolded (U). The detection and characterization

of intermediate conformations is not easy because either the lifetime of these transient intermediates is frequently too short to be detected, or it is not possible to

separate and isolate them from other protein conformations simultaneously present.

The mechanism and the identity of the protein conformations involved in a

folding process can be studied by monitoring spectrometrically the changes in the

protein tertiary and secondary structures. Far-UV circular dichroism has long been

used to elucidate the secondary structure of proteins and to follow specifically

changes in this structural level. Near-UV circular dichroism is known to be sensitive

to changes in the protein tertiary structure. These two techniques were used to

monitor the thermal unfolding of bovine α-apolactalbumin, a globular protein present

in milk with major alpha helical content in the secondary structure (see Figure 11.14).

Protein folding can be studied by recording a complete spectrum at each stage

in the monitored process. The spectra recorded in a thermal-dependent protein

folding process are organized in a data matrix D, where rows represent spectra

recorded at each temperature and columns represent the melting curves (absorbance

vs. temperature profiles) at each wavelength. Recalling the general expression of

resolution methods (Equation 11.1), D = C ST + E, the columns in matrix C represent



FIGURE 11.14 Structure of α-apolactalbumin.



© 2006 by Taylor & Francis Group, LLC



DK4712_C011.fm Page 451 Thursday, March 16, 2006 3:40 PM



Multivariate Curve Resolution



451



the thermal-dependent concentration profiles of the detected protein conformations,

and the rows in matrix ST represent their corresponding pure spectra. Matrix E

describes the experimental error.

The results presented below were obtained by multivariate curve resolution–

alternating least squares (MCR-ALS). MCR-ALS was selected because of its flexibility in the application of constraints and its ability to handle either one data matrix

(two-way data sets) or several data matrices together (three-way data sets). MCRALS has been applied to the folding process monitored using only one spectroscopic

technique and to a row-wise augmented matrix, obtained by appending spectroscopic

measurements from several different techniques.

In general, MCR-ALS resolution analyses of protein folding have been performed using initial estimates obtained by evolving-factor analysis (EFA), the

method most suitable to describe the evolution of the contributions present in

processes. The concentration profiles in C were constrained to be nonnegative

and unimodal with closure. Unimodality was used because the evolution of each

protein conformation can be appropriately represented by an emergence-decay

profile having a single peak maximum as long as the temperature changes during

the experiment always increase or always decrease. The condition of closure is

appropriate in these systems because the total concentration of protein remains

constant during the unfolding/folding process. Selectivity constraints were also

applied at the lowest temperatures, where only the native conformation of the

protein is supposed to be present. Circular dichroism (CD) spectra in ST are not

forced to be nonnegative because negative ellipticities can naturally occur in CD

spectra.

Different arrangements of the data from these experiments have allowed the

study of several aspects linked to protein folding, namely: (a) changes in the protein

secondary structure, (b) changes in the protein tertiary structure, and (c) global

mechanistic and structural description of the protein-folding process. The results

obtained are briefly presented in the following subsections.

11.8.1.1 Study of Changes in the Protein Secondary Structure

Changes in protein secondary structure were studied by the analysis of the

far-UV (190 to 250 nm) circular dichroism spectra from a single data matrix.

Resolved concentration and spectra profiles are shown in Figure 11.15a. The

resolved CD spectrum associated with the native conformation shows the typical

spectral features associated with a major contribution of the α-helix motif in its

secondary structure, i.e., an intense negative band with two shoulders located

around 220 and 210 nm [95, 96]. The resolved spectrum for the unfolded conformation shows the typical spectral features associated with a random coil motif

(a sharper negative band at short wavelengths and weaker features around 220 nm).

The resolved concentration profiles show the thermal evolution of the concentration of the different protein conformations in the process. The crossing point

in the plot of these concentration profiles gives the temperature at which 50%

of the native protein has acquired its native secondary structure (Tsec), which is

about 60°C.



© 2006 by Taylor & Francis Group, LLC



20

40

60

80

Temperature(°C)



1

0.8

42.73°C



0.6



Ellipticity



59.7°C



100

0

−100

−200

−300

−400

−500

−600

−700

−800

200



Concentration (mg/ml)



Ellipticity



Concentration



(b)

0.1

0.09

0.08

0.07

0.06

0.05

0.04

0.03

0.02

0.01

0



0.4

0.2

0



220

240

Wavelengths (nm)



20



40



60



80



5

0

−5

−10

−15

−20

−25

−30



260 280 300 320



Temperature(°C)



Wavelength (nm)



(c)

S1T

D1



ST



D



C

5



44°C



58.8°C



0.4



−5

−10

−15

−20



0.2



−25

10



20



30



40



50



Temperature(°C)



60



70



80



Ellipticity



Ellipticity



0.6



0



0



0



C



S1



T



260 280 300 320

Wavelength (nm)



−200

−400

−600

−800



S2 T

200

220

240

Wavelength (nm)



FIGURE 11.15 Resolution of the protein folding of α-apolactalbumin. (a) Detection of changes in protein secondary structure (far-UV circular

dichroism measurements). (b) Detection of changes in protein tertiary structure (near-UV circular dichroism measurements). (c) Complete

description of protein folding. Resolution of the row-wise data set formed by near-UV (D1) and far-UV (D2) circular dichroism measurements.

Solid line: native conformation, dash-dotted line: intermediate conformation, dotted line: unfolded conformation.

© 2006 by Taylor & Francis Group, LLC



Practical Guide to Chemometrics



Concentration (mg/ml)



1

0.8



S2T



=



D2



DK4712_C011.fm Page 452 Thursday, March 16, 2006 3:40 PM



452



(a)



DK4712_C011.fm Page 453 Thursday, March 16, 2006 3:40 PM



Multivariate Curve Resolution



453



11.8.1.2 Study of Changes in the Tertiary Structure

Changes in tertiary structure during protein folding processes have been studied

using near-UV (250 to 330 nm) circular dichroism. Figure 11.15b shows the MCRALS resolved concentration profiles and spectra related to the different tertiary

structures present during the folding of α-apolactalbumin. The CD spectrum of the

folded protein conformation shows a very intense negative band [95], whereas the

denatured/unfolded species shows a nearly flat signal in the CD spectrum. In this

experiment, the crossing point in the plot of concentration profiles represents the

temperature at which 50% of the native protein has acquired its initial tertiary

structure (Ttert), which happens at about 40°C for α-apolactalbumin.

11.8.1.3 Global Description of the Protein Folding Process

To check for the presence of an intermediate in a protein folding process, the

temperatures at which the secondary structure (Tsec) and the tertiary structure (Ttert)

of the folded conformation are half-formed can be compared. If both coincide,

the protein loses the tertiary and the secondary structures simultaneously, and only

a native conformation with secondary and tertiary structures ordered or an unfolded

conformation with both structural levels unordered describe the process. If significant differences are observed in the crossing temperatures of concentration profiles, a new, intermediate third species with the secondary structure ordered and

the tertiary unordered may be needed to explain the shift in the appearance of the

tertiary and secondary structures. The difference of almost 20°C found between

Tsec and Ttert in the above two experiments seems to guarantee the presence of an

intermediate conformation in the folding of α-apolactalbumin, but only the multivariate resolution analysis of the suitable measurements (far-UV and near-UV

CD spectra) together can confirm this hypothesis and model the appearance of the

intermediate conformation.

Figure 11.15c shows the resolved concentration profiles and spectra coming from

the row-wise appended matrix containing data from the three techniques mentioned

previously. The need for one additional intermediate conformation has been proven to

be necessary to explain the protein folding process of α-apolactalbumin. Additionally,

the thermal range of occurrence and the evolution of this intermediate can now be

known. The resolved spectrum obtained for the α-apolactalbumin intermediate

shows that it has an ordered secondary structure similar to the native folded protein

and an unordered tertiary structure similar to the unfolded protein at high temperatures. These spectral features match the spectral description attributed to the molten

globular state and provide additional evidence to confirm the presence of this species

as a real intermediate conformation.

As has been shown, complex protein folding processes involving the presence

of intermediate conformations can be successfully described combining multispectroscopic monitoring and multivariate curve resolution. The detection and modeling

of intermediate species that cannot be isolated either by physical or chemical means

is fully achieved. The fate of the intermediate during the process, i.e., when it is

present and in what amount, is unraveled from the original raw measurements.



© 2006 by Taylor & Francis Group, LLC



DK4712_C011.fm Page 454 Thursday, March 16, 2006 3:40 PM



454



Practical Guide to Chemometrics



The spectral information obtained by using appropriate deconvolution approaches,

particularly the resolved pure CD spectrum of the intermediate, is the essential

starting point, unobtainable by other methods, for deducing the secondary structure

of the intermediate.



11.8.2 ENVIRONMENTAL DATA

Principal component analysis and multivariate curve resolution are powerful tools

for the investigation and modeling of large multivariate environmental data arrays

measured over long periods of time in environmental monitoring programs [98].

The goals of these studies are the computation, resolution, modeling, screening, and

graphical display of patterns in large environmental data sets, looking for possible

data groupings and sources of environmental pollution, as well as for their temporal

and geographical distribution. The fundamental assumption in these studies is that

variance in the measured concentrations of contaminants (or properties) can be

attributed to a small number of contamination sources of different origin (industrial,

agricultural, etc.) and that they can be modeled by profiles describing their chemical

composition, their geographical distribution, and their temporal distribution. Large

environmental analytical data arrays containing concentration information of multiple chemical compounds collected at different sampling sites and at different

sampling periods are arranged in large data tables or matrices, or in more complex

data arrays according to different dimensions, ways, modes, orders, or directions of

experimental measurement. In the chemometrics literature, these complex ordered

data arrays are commonly called multiway data arrays or higher-order tensor data.

As discussed in Chapter 4, principal component analysis, PCA [99] is one of

the multivariate data analysis methods more frequently used in exploratory analysis

and modeling of two-way data arrays (data tables or data matrices). PCA allows the

transformation and visualization of complex data sets into a new perspective in

which the most relevant information is made more obvious. In environmental studies,

by use of PCA, main contamination sources can be identified, and their geographical

and temporal distributions can be interpreted and further investigated. Multivariate

curve resolution using alternating least squares (MCR-ALS) has also been proposed

to achieve similar goals [15, 16, 98, 100, 101]. Although this method is traditionally

used for curve-resolution purposes, i.e., to resolve spectra and concentration profiles

having a smooth curved shape, there is no fundamental reason why this method

cannot also be used for resolution and modeling of noncurved and nonsmooth types

of profiles, including those describing geographical and temporal pollution patterns

observed in environmental data studies.

Both PCA and MCR-ALS can be easily extended to complex data arrays ordered

in more than two ways or modes, giving three-way data arrays (data cubes or

parallelepipeds) or multiway data arrays. In PCA and MCR-ALS, the multiway data

set is unfolded prior to data analysis to give an augmented two-way data matrix.

After analysis is complete, the resolved two-way profiles can be regrouped to recover

the profiles in the three modes. The current state of the art in multiway data analysis

includes, however, other methods where the structure of the multiway data array is

explicitly built into the model and fixed during the resolution process. Among these



© 2006 by Taylor & Francis Group, LLC



DK4712_C011.fm Page 455 Thursday, March 16, 2006 3:40 PM



Multivariate Curve Resolution



455



methods, those based in parallel factor analysis (PARAFAC) and Tucker multiway

models have been developed in recent years [77, 78] and have been used for the

analysis of environmental data sets [102, 103]. A complete description of these

methods is given in Chapter 12.

In the example described here, multiway principal component analysis and

multivariate curve resolution are compared in the analysis of a data set obtained in

an exhaustive and systematic monitoring program in Portugal [16]. The study

included measurements of 19 priority semivolatile organic compounds (SVOC) in

a total of 644 surface-water samples distributed among 46 different geographical

sites during a period of 14 months. These data arrays were organized into 14

data matrices, each one corresponding to one month of the sampling campaign

(Figure 11.16). The resulting data set was arranged as one single columnwise

augmented large concentration data matrix of dimensions 644 × 19. In the columnwise

augmented data matrix, the individual concentration data matrices from different

months were stacked consecutively one on top of the other. Only one of the three

modes is unambiguous in the columns, the composition mode (19 SVOCs). The

other two modes (geographical and temporal) are mixed in the rows of the columnwise

data matrix.



N



19 SVOCs

Month 1



N

VT (N, 19)

loadings



46 sampling

sites



............

............

Daug

(644, 19)

............

............



Uaug (644, N)

scores



14



Month 14

46 × 14

644

samples



Column mean

46

Re-folding

uaug (644, 1) U (46, 14)

46



14



Utemp

(N, 14)



Row mean

Ugeo (46, N)



FIGURE 11.16 Data matrix augmentation arrangement and bilinear model for PCA or MCRALS decompositions. Resolved loadings, VT, provide the identification of the main sources

of data variance (contamination sources). Resolved scores, Uaug, provide the identification of

the temporal and geographical distribution of these sources after appropriate refolding: Ugeo

geographical distribution scores and Utemp temporal distribution scores. For each component,

a scores matrix is obtained by refolding its resolved long augmented score vector. Taking

averages of the rows or columns of this score matrix, temporal and geographical distributions

are obtained for this component.



© 2006 by Taylor & Francis Group, LLC



DK4712_C011.fm Page 456 Thursday, March 16, 2006 3:40 PM



456



Practical Guide to Chemometrics



SVOC concentration values below the limit of detection and nondetected values

were set to half the detection limit value [104]. Missing values were estimated by PCA

using the PLS Toolbox “missing” function (Eigenvector Research, Inc., Manson, WA)

for MATLAB. Different data pretreatment methods were tested and compared. They

included column mean centering, column autoscaling, and log transformation and they

were performed on the columns of the columnwise augmented data matrix. Column

mean centering removed constant background contributions, which usually were of

no interest for data variance interpretation. However, in this particular case, mean

centering caused little changes in the results, since most of the values of the different

variables (SVOC) were so low that their averages were also low and close to zero.

Column scaling to unit variance increased the weight of variables that initially had

lower variances. In some cases, this effect may distort significantly the results, making

interpretation more difficult, especially for those noisy variables having only very few

values higher than the detection limit. Log transformation of experimental data is

another procedure that has been frequently recommended in the literature for skewed

data sets, like those in environmental studies where the majority of the values are low

values with a minor contribution of high values. With log data pretreatment, a more

symmetrical distribution of experimental data is expected. To remove large negative

values, a constant value equal to 1 was added to all of the variable entries. In this way,

log values were always nonnegative [105].

Principal component analysis (PCA) and multivariate curve resolution–alternating

least squares (MCR-ALS) were applied to the augmented columnwise data matrix

Daug, as shown in Figure 11.16. In both cases, a linear mixture model was assumed

to explain the observed data variance using a reduced number of contamination

sources. The bilinear data matrix decomposition used in both cases can be written

by Equation 11.19:

N



d aug

=

ij



∑u



aug

in



v jn +e ij



(11.19)



n=1



In this equation, daugij is the concentration of SVOC j in sample i in the augmented

experimental data matrix Daug. The variable uaugin (the score of component n on row i)

is the contribution of contamination source n in sample i. The variable vjn (the loading

of variable j on component n) is the contribution of SVOC j in contamination source

n. The residual, eij, is the variance in sample i and variable j of dij not modeled by

the N environmental contamination sources. The same equation can be written in

matrix form as:

Daug = Uaug VT + Eaug

aug



(11.20)



where D is the whole data array arranged in an augmented data matrix of dimensions 46 × 14 rows (46 sampling sites, 14 months) and 19 columns (SVOC), as

shown in Figure 11.16.

Equation 11.20 describes the factorization of the experimental data matrix into two

factor matrices, the loadings matrix VT and the augmented scores matrix Uaug. The

loadings matrix VT identifies the nature and composition of the N main contamination

sources defined by means of their chemical composition (SVOC concentrations)



© 2006 by Taylor & Francis Group, LLC



DK4712_C011.fm Page 457 Thursday, March 16, 2006 3:40 PM



Multivariate Curve Resolution



457



profiles. The augmented scores matrix Uaug gives the geographical and temporal

distribution of these contamination sources. This geographical and temporal information is intermixed in the columns of the resolved augmented scores matrix Uaug,

and it is not directly available from it. A relatively easy way to recover geographical

(Ugeo) and temporal (Utemp) information is by properly rearranging columns of scores

(Uaug) into matrices having dimensions 46 × 14 (46 geographical sites × 14 months),

as shown in Figure 11.16. Analysis of the resulting matrices by taking the row and

column averages or by computing the SVD gives the average contribution of the

resolved source profiles as a function of geographical location (46 × N) or by month

(N × 14), as shown in Figure 11.16. In this way, Ugeo, Utemp, and VT (the three mode

components) were estimated and directly compared with those resolved by threeway model-based methods like PARAFAC and Tucker3 [77, 78]. Finally, Eaug gives

the residual part of Daug not modeled by the N contamination sources, i.e., the

unexplained variance associated with noise and minor nonmodeled environmental

contamination sources. The proper complexity of the PCA model or MCR-ALS

model, i.e., the number of components or contamination sources included in the

model, is a compromise between different goals: model simplicity (few components), maximum variance explained by the model (more components), and model

interpretability.

Whereas PCA models provide a least-squares solution of Equations 11.19 and

11.20 under orthogonal constraints and maximum variance explained by each successively extracted component, MCR-ALS models give a nonnegative least-squares

solution of the same equation without use of orthogonal constraints or maximum

explained variance. Also, whereas PCA orthogonal solutions of Equations 11.19 and

11.20 for a two-way data matrix are unique, MCR-ALS solutions of the same

equation are not unique, and they may be rotationally ambiguous [1, 2, 21–23].

However, MCR-ALS models provide solutions that hopefully are more similar to

the real sources of pollution than PCA solutions. MCR-ALS solutions can be seen

as oblique rotated PCA solutions fulfilling nonnegativity constraints. MCR-ALS

models provide a complementary insight to the problem under study by helping to

resolve and interpret real environmental sources of data variance. In other works,

MCR-ALS has been shown to be a powerful tool for resolving species profiles in

different chemical systems [7, 10–12, 17, 20, 21, 42, 45, 46, 51, 55, 62, 67, 68, 74,

81, 82, 90–92] and, more recently, it has also been applied for the resolution of

environmental contamination sources [15, 16, 98, 100, 101].

MCR-ALS solutions can be additionally constrained to fulfill a trilinear model

[82]. When this trilinearity constraint is applied, the profiles in the three different

modes (Ugeo, Utemp, and VT) are directly recovered and can be compared with the

profiles obtained using PARAFAC- or Tucker-based model decompositions. MCRALS results have already been compared with Tucker3-ALS and PARAFAC-ALS

results in the resolution of different chemical systems [81].

Table 11.1 gives the results from the application of PCA on column mean-centered

data, on column autoscaled data, and on log-transformed column mean-centered data.

Using five PCs, the amount of variance explained was 84.0, 46.1, and 69.6%, respectively. The results for the column mean-centered data were nearly identical to those

obtained for the nonmean-centered raw data. The reason for this is that the means of



© 2006 by Taylor & Francis Group, LLC



Tài liệu bạn tìm kiếm đã sẵn sàng tải về

7 UNCERTAINTY IN RESOLUTION RESULTS, RANGE OF FEASIBLE SOLUTIONS, AND ERROR IN RESOLUTION

Tải bản đầy đủ ngay(0 tr)

×