Tải bản đầy đủ - 0 (trang)
Introduction: Why Multiway Data Analysis?

# Introduction: Why Multiway Data Analysis?

Tải bản đầy đủ - 0trang

Chapter

7

Multiway Methods

267

FIGURE 1 Example of the arrangement of the data by using two different techniques. The result

is the same—a three-way array. Nevertheless, the structure of the array depends on the nature of

the measurement technique.

FIGURE 2 Three matrices with their different nomenclatures.

In this chapter several equations will be shown explaining the main mathematical interpretation of three-way models. Scalars will be indicated with

lowercase italics (e.g. x) and vectors with bold lowercase characters (e.g. y).

Ordinary two-way arrays (matrices) will be denoted with bold uppercase

(e.g. X), whereas higher-order arrays will be indicated with underscored bold

uppercase (e.g. X). The ijkth element of a three-way array X will be xijk where

the indices run as follows: i ¼ 1, . . ., I; j ¼ 1, . . ., J; k ¼ 1, . . ., K. Three-way

arrays will often be denoted X (I Â J Â K) where I, J and K are the dimensions

of each one of the modes of the array.

Three important operations with matrices should be also introduced here:

Hadamard, Kronecker and Khatri-Rao products are basic operations that must

be understood. To understand them, we will use the following matrices as

examples (Figure 2).

The Hadamard product of two matrices (A∘B) produces another matrix

where each element ij is the product of the elements ij of the original matrices.

The condition is that A and B matrices must have the same dimensions. For

instance, in the given examples of Figure 2 the Hadamard product could be done

between A and B or even A and CT (where the superindex T denotes the transpose of the matrix), but not between A and C (Figure 3). The Kronecker product

268

PART

I

Theory

FIGURE 3 Some examples of Hadamard, Kronecker and Khatri-Rao products of the matrices in

Figure 2.

Chapter

7

269

Multiway Methods

of two matrices (A B) produces another matrix with the dimensions IK Â JL as

indicated in Figure 3. This is one of the most widespread and useful products, as

the matrices do not need to have any dimension in common. One advantageous

variation of the Kronecker product is the Khatri-Rao product, also known as the

column-wise Kronecker product. Khatri-Rao assumes the partitions of the

matrices are their columns. Therefore, both matrices must have the same number of columns (Figure 3).

As we will see further in this chapter, knowing the structure of the data

plays a fundamental role when applying any multiway technique. For illustrating this, we will comment on the data collected from the two most popular

instrumentations used in food sciences nowadays that are able to produce

multiway data: Excitation–emission fluorescence spectroscopy (EEM) and

hyphenated chromatographic systems (i.e. gas chromatography connected to

mass spectrometry—GC–MS). The benefit and drawbacks of both techniques

in the framework of food analysis will be discussed in successive chapters.

Here we will just focus on the structure of the three-way array. Figure 1 shows

the final three-way structure that is obtained when several samples are analysed by both EEM and hyphenated chromatography. However, the inner

structure of this tensor varies due to the different nature of the measurement.

EEM is a function of two variables: excitation and emission. One sample

measured by EEM can thus conveniently be presented as a matrix of fluorescence intensities as a function of excitation and emission wavelengths. The

fluorescence landscape X(I,J) can therefore be described as a function of a

concentration-dependent factor, a, and its excitation, b(F,lEm), and emission,

c(F,lEx), thus establishing the following linear relationship for each xij member of X:

xij ẳ

F

X

aijf bijf lEm ị cijf lEx ị

(1)

f ẳ1

where F is the total number of fluorescent species present in the sample. Having F independent fluorophores with different concentrations, Equation (1)

can easily be extended with an additional dimension referring to the samples.

The structure of X obtained for hyphenated chromatographic systems is similar to the one for EEM (Figure 1) . In this case, the signal X(I,J) is proportional to the concentration a of each analyte, having a specific elution time

b(et) and a spectral signal c(l) (if, for instance, the detector is a spectral

detector) as indicated in Equation (2):

xij ẳ

F

X

aijf bijf etị cijf lị

(2)

f ẳ1

One can argue that the differences between Equations (1) and (2) are merely

semantic (both equations look quite similar). Nevertheless, the chemistry

270

PART

I

Theory

behind each one is totally different, making the choice of the proper multiway

method an essential step before the analysis. For instance, EEM data can be analysed by using parallel factor analysis (PARAFAC). Nevertheless, in most of

the cases GC–MS data cannot be handled directly with PARAFAC without

any pre-processing method prior to the analysis, making the use of PARAFAC2

necessary [8,9]. The suitability of each multiway technique depending of the

kind of data will be discussed more closely in further sections of this chapter.

3 PARALLEL FACTOR ANALYSIS

PARAFAC is a decomposition method that can be conceptually compared to

principal component analysis (PCA) for multiway data . It was developed

independently by Harshman  and Carroll and Chang under the name

CANDECOMP ; and both were based on the principle of parallel proportional profiles suggested by Cattell .

We have chosen EEM to exemplify and visualize the working procedure

of PARAFAC as they are becoming an essential binomial due to the characteristics of the EEM signal. EEM measurements are fast and, usually, do

not require any previous step of sample preparation. The huge amount of

information obtained for one single sample can be visualized as a fingerprint

of the sample throughout its fluorophores. The structure of the data (two independent sets of variables—excitation and emission profiles—and one variable

equally dependent on both spectral profiles—concentration profiles) makes

EEM data fulfil the requirement of trilinearity (this concept will be explained

later) if no uncontrolled effects/artefacts are present in the samples. Consequently, the combination of EEM and PARAFAC is becoming a popular tool

for fast analysis of intact food [4,14], where many applications have already

demonstrated this suitability (analysing and authenticating different food systems , differentiating the botanical origin of honey , monitoring the

ripening of cheeses , classifying wines according to variety, typicality

and age [17–19], monitoring the texture of meat emulsions , characterizing ice cream formulations  and ripening of Cabernet Franc grapes ).

3.1 The General PARAFAC Model

B(J,F) and C(K,F), each one corresponding to the modes/directions of the

data cube with elements aif, bjf and ckf, respectively. The model minimizes

the sum of squares of the residuals, eijk, in Equation (3):

xijk ẳ

F

X

f ẳ1

F

X

aif bjf ckf ỵ eijk xijk ẳ

aif bjf ckf ỵ eijk

(3)

f ẳ1

where F denotes the number of factors. Figure 1 shows a graphical depiction

of decomposition of X considering two factors (F ¼ 2). The decomposition is

2

2

2

1

2

3

pl

e

1

m

Excitation S

am

3

X

271

Multiway Methods

+

=

+

Emission

E

Emission

c2

c1

b1

X

Excitation S

a

7

pl

e

Chapter

b2

+

=

a1

+

E

a2

FIGURE 4 Graphical representation of a two-factor PARAFAC model of the data array X. The

top part of the figure shows the chemical interpretation, whereas the bottom part shows the mathematical representation.

Using the Khatri-Rao product the PARAFAC model can be formulated in

terms of the unfolded array as in Equation (4):

XIJK ị ẳ ACBịT ỵ EIJKị

(4)

3.2 PARAFAC Iterations. Convergence to the Solution.

Alternating Least Squares

From the Equation 4, it can be assumed that PARAFAC is aimed to find the

combination of A, B and C that best fits with X(IÂJK) for the assigned number

of factors. In other words, the aim is to minimize the difference between the

reconstructed data (gathered from A, B and C) and the original data or, better

said, to minimize the Euclidean norm (Equation 5):

minA, B, C jjX À AðCBÞT jj2F

(5)

Minimizing this difference, thus leaving the noise out, is a classical least

squares problem that can be handled by different algorithms. One of the most

popular ones in curve resolution is alternating least squares (ALS). The main

benefit of ALS with respect to others is the simplicity of the involved substeps:

1. Initialize B and C.

2. Z ẳ CBị

A ẳ XIJKị ZZT Zị

1

272

3. Z ẳ CAị

B ẳ XJIKị ZZT Zị

4. Z ẳ BAị

C ẳ XKIJị ZZT Zị

T 2

5. jjX AðCBÞ jjF < critical value: If not, go to step 2

PART

I

Theory

(6)

In these sub-steps the symbol ỵ stands for the MoorePenrose inverse. There

are two main facts in these sub-steps. The first one is the need of initial estimations of B and C. Good starting values can help speed up the algorithm and

help in assuring that the global minimum is found. In the literature, several

possible kinds of initializations have been proposed [6,23–25]. The second

main fact is the need of the establishment of an end point of iterations. That

is, the point in which the obtained reconstructed data are most similar to the

original ones. In most of the cases, a stopping criterion of 10À6 is enough to

assure that the absolute minimum in the iterations has been reached. However,

if the model parameters are very difficult to estimate, a lower criterion may be

chosen.

3.3 Properties of PARAFAC Model

There are several properties of PARAFAC that make it an attractive technique

from an analytical point of view. The first one is that, different from PCA,

there is no need for requiring orthogonality in the computation of the factors

to identify the model. This means that, under the proper constraints, PARAFAC loadings will resemble the real physico-chemical behaviour of the analytes involved in the variability of the signal. Further, if the data are

approximately trilinear, the true underlying phenomena will be found if the

right number of components is used and the signal-to-noise ratio is appropriate . That is, taking as example the EEM data, the loading matrices B and

C will recover the true underlying excitation and emission spectra of the

involved fluorophores, respectively, while A will contain their relative concentration (relative abundance according to B and C). This property is especially appreciated in the interpretation of the obtained factors, where it

The possibility of modelling factors that are not directly related to the target and, consequently, not present in the calibration samples, is the so-called

second-order advantage [6,26]. This advantage states that if the true concentrations of the analytes are known in one or more samples, the concentrations

in the remaining samples can be estimated even in presence of uncalibrated

species. This property is inherent of second-order instruments (see Brooks

and Kowalski  for further information about the definition of the order

Chapter

7

Multiway Methods

273

FIGURE 5 Two samples with different amounts of three fluorophores measured by EEM giving

two landscapes/matrices of data shown in the top middle. The data can be arranged and decomposed as a three-way array (left) or as an unfolded two-way array (right). The excitation vectors

obtained by PARAFAC are shown in the bottom left corner. In the bottom right corner the

of the instruments) and is especially relevant in food science, where seasonal

and species variation may lead to new uncalibrated interferents in future

samples.

Another much appreciated property of PARAFAC is its uniqueness in the

solution. In most circumstances the model is uniquely identified from the

structure, and hence no post-processing is necessary as the model is the best

model in the least squares sense.

PARAFAC concedes to the concept of trilinear data . Trilinearity can

be viewed as an extension of the bilinear relationship between a dependent

variable and an independent one to a scenario with two independent variables

and a dependent one. In this way, trilinearity could be seen as a natural extension of the Lambert Beer’s law to second-order data. As an example for

second-order data, EEM signal is characterized by a concentration that follows a linear relationship with both excitation and emission spectral profiles.

Trilinearity assumes that the measured signal is the sum of the individual

peaks of each analyte and that the profiles in each mode for the analytes are

proportional in all the samples [14,28].

3.4 Model Validation. Selection of the Number of Factors

The features of uniqueness and trilinearity are closely related [10,11]. If the

data are trilinear, the true underlying signal will be found if the right number

274

PART

I

Theory

of factors is estimated and the signal-to-noise ratio is appropriate [6,10].

Nevertheless, both concepts are inherently linked to one of the main issues

of curve resolution methods: finding the proper number of factors.

In general, a factor must be understood as any effect that causes variations in

the signal in a higher level than the signal-to-noise ratio expected from the

device and/or the samples. This definition encompasses the signal of the analytes of interest, the signal of other analytes present in the sample and, what is

more important (due to their difficulty of detection in some cases), the different

artefacts that affect the signal. A typical example of the latter is the possibility of

considering the baseline drift between samples in chromatography as an additional factor, as its effect is usually higher than the general signal-to-noise ratio.

Choosing the proper number of factors (i.e. the chemical rank of the data)

is, probably, the most crucial (and complicated) step. Extracting too few factors (under-fitted model) is usually an easy problem to detect. The nonrandom distribution of the residuals and their values can give a good clue that

the model should be modelled with more factors. On the contrary, extracting

too many factors (overfitted model) does not only mean that noise is being

increasingly modelled, but also that the true factors are being modelled by

more (correlated) factors .

There are several dedicated methods proposed to estimate the correct number of factors for PARAFAC, the following being the most common: splithalf analysis, combining core consistency  and percentage of explained

variance, judging residuals and previous chemical knowledge of the data.

Split-half analysis [30,31] uses the intrinsic properties of PARAFAC and

the samples, stating that the same B and C loadings should be found in different subsets of the data. The method is based on performing independent analysis to different subsets. Due to the uniqueness of the PARAFAC model, the

same loadings will be obtained in the non-splitted models from models of any

suitable subset of the data, if the correct number of components is chosen. The

split-half approach may also sometimes be used for verifying whether nontrilinearities are present .

Another common method is to check the core consistency and explained

variance of the model. The core consistency [6,29] estimates the appropriateness of the PARAFAC solution. It is based on the fact that the PARAFAC

model can be posed as a restricted Tucker3 model (the Tucker3 model is

introduced in Section 5), where the core array is fixed to be a superidentity

array (superdiagonal array of ones). The core consistency diagnostic consists

of first calculating the optimal unconstrained core array for a Tucker3 model

where the loading matrices are the ones obtained by the PARAFAC model at

hand, and then calculating the relative sum-of-squared difference between this

core and the superdiagonal core of ones . The closer the core consistency is

to 100%, the better the Tucker3 core fits to the assumption of the model. If the

core consistency is below zero, the PARAFAC model is inappropriate or the

variation is purely random.

Chapter

7

Multiway Methods

275

Core consistency tells you whether a model is appropriate or not. Nevertheless, it does not tell you if the model is the correct model. Assuming, for

instance, a three-component PARAFAC model, one will find that a two-factor

model has a high value of core consistency. The core consistency will show

that all these models are valid in the sense that they do not overfit . That

is one of the main reasons why it is important to check other parameters

simultaneously, like the explained variance of the model. The explained variance is the amount of variance explained for the assumed number of factors. It

is calculated by taking into account the sum of the squares of the residuals

(SSE) and the sum of the squares of the elements in the original data (SSX)

, as follows:

 2

 

E

SSE

2

¼ 1 À   2

(7)

RX ¼ 1 À

 

SSX

X



The explained variance and the core consistency are usually checked

together. As a general rule, the explained variance will increase and the core

consistency tends to decrease with the number of factors [4,33]. The fact that

the explained variance increases with the number of factors is just a mathematical fact, as the more factors are added to the model, the more information

will be explained. The point is to guess which information is real chemical

behaviour and which one is just noise. The core consistency, instead, may

decrease, for example, in a model with three factors, but increase again in a

model with four factors. This is due to the nature of the signal and also to

the signal-to-noise level, giving an account for possible unstable models.

The main point is to find the proper agreement between both parameters. This

will depend also on other parameters like randomness of the residuals, shape

of the profiles and quality of the raw data (signal-to-noise ratio).

Other methods have been proposed in the bibliography for assessing the

correct number of factors (see, for instance, Hoggard and Synovec ,

who evaluated the so-called degenerate solution that can be observed for

PARAFAC models with too many factors).

3.5 Imposing Constraints to the Model

A constraint is a chemical or mathematical property that the profiles should

fulfil . For this criterion, the chemical structure of the data is taken into

consideration in the selection of the proper constraints. The most common

constraints are non-negativity and unimodality. Non-negativity forces the profiles to only contain non-zero values. This is especially useful for spectral and

chromatographic profiles. Unimodality constraints can help to preserve the

presence of only one peak in each profile extracted. Nevertheless, there are

a number of constraints that can be used, mainly to improve the performance

276

PART

I

Theory

of the algorithm and to obtain more meaningful results . Despite the PARAFAC model being unique, the model may not provide a completely satisfactory description of the data. By definition, a constrained model will fit the

data poorer than an unconstrained one, but if the constrained model is more

interpretable and realistic, this may justify the decrease in fit. That is the main

reason why constraints must be applied in a sensible manner, by critically

evaluating the model obtained afterwards.

Both validation methods and constraints must be handled with care. They

are just mathematical interpretations of the model. Therefore, the results

might be biased for a bad sampling, low signal-to-noise ratio and so on. Moreover, one of the most cumbersome issues is that guessing the proper number

of factors or knowing the best constraints to apply is a task to be performed

after running several PARAFAC models with different number of factors.

Once the models have been calculated, the comparison of the parameters

between the models with different factors must be done.

3.6 PARAFAC in Practice

The following example will show different features of how to choose the

number of factors in PARAFAC models. We have chosen a data set comprised of different vinegars measured by EEM. The aim of the work was to

classify different qualities in vinegars according to their age. Further information can be found in Callejon et al. . Sherry vinegar is produced in

Jerez-Xe´re`s-Sherry, Manzanilla de Sanlu´car and Vinagre de Jerez Protected

Designation of origin in south-western Spain. Its main features are a high

acetic degree (legally at least 7 ) and a special flavour, properties that rate

the vinegar into three categories for Sherry vinegar according to their ageing

time in oak wood barrels: Vinagre de Jerez (minimum of 6 months), Reserva

(at least 2 years) and Gran Reserva (at least 10 years). These three qualities of

vinegars have different prices in the market due to the fact that the longer the

ageing, the better is the quality and the higher the cost of its production. This

fact results in these products being subject to frequent frauds. Therefore, fast

and reliable methods of classification are needed for speeding up the detection

of possible frauds. Thus, it seems a perfect issue that can be solved by using

EEM and PARAFAC.

3.6.1 Selection of the Best Model

Figure 6 shows the EEM landscape of several samples of vinegars from different classes. Vinegar is a fermentation product from wine. Therefore, its

matrix is complex and can be formed by a wide range of fluorescent compounds, most of which are polyphenols, but also amino acids (e.g. tryptophan)

and vitamins (e.g. vitamin A). Moreover, EEM fluorescence spectroscopy has

been rarely applied in vinegar; hence, there is little information about its fluorescence profile.

FIGURE 6 EEM landscape of several vinegars obtained from Ref. . The black lines correspond to the first- and second-order Rayleigh scattering. ### Tài liệu bạn tìm kiếm đã sẵn sàng tải về

Introduction: Why Multiway Data Analysis?

Tải bản đầy đủ ngay(0 tr)

×