Introduction: Why Multiway Data Analysis?
Tải bản đầy đủ - 0trang
Chapter
7
Multiway Methods
267
FIGURE 1 Example of the arrangement of the data by using two different techniques. The result
is the same—a three-way array. Nevertheless, the structure of the array depends on the nature of
the measurement technique.
FIGURE 2 Three matrices with their different nomenclatures.
In this chapter several equations will be shown explaining the main mathematical interpretation of three-way models. Scalars will be indicated with
lowercase italics (e.g. x) and vectors with bold lowercase characters (e.g. y).
Ordinary two-way arrays (matrices) will be denoted with bold uppercase
(e.g. X), whereas higher-order arrays will be indicated with underscored bold
uppercase (e.g. X). The ijkth element of a three-way array X will be xijk where
the indices run as follows: i ¼ 1, . . ., I; j ¼ 1, . . ., J; k ¼ 1, . . ., K. Three-way
arrays will often be denoted X (I Â J Â K) where I, J and K are the dimensions
of each one of the modes of the array.
Three important operations with matrices should be also introduced here:
Hadamard, Kronecker and Khatri-Rao products are basic operations that must
be understood. To understand them, we will use the following matrices as
examples (Figure 2).
The Hadamard product of two matrices (A∘B) produces another matrix
where each element ij is the product of the elements ij of the original matrices.
The condition is that A and B matrices must have the same dimensions. For
instance, in the given examples of Figure 2 the Hadamard product could be done
between A and B or even A and CT (where the superindex T denotes the transpose of the matrix), but not between A and C (Figure 3). The Kronecker product
268
PART
I
Theory
FIGURE 3 Some examples of Hadamard, Kronecker and Khatri-Rao products of the matrices in
Figure 2.
Chapter
7
269
Multiway Methods
of two matrices (A B) produces another matrix with the dimensions IK Â JL as
indicated in Figure 3. This is one of the most widespread and useful products, as
the matrices do not need to have any dimension in common. One advantageous
variation of the Kronecker product is the Khatri-Rao product, also known as the
column-wise Kronecker product. Khatri-Rao assumes the partitions of the
matrices are their columns. Therefore, both matrices must have the same number of columns (Figure 3).
As we will see further in this chapter, knowing the structure of the data
plays a fundamental role when applying any multiway technique. For illustrating this, we will comment on the data collected from the two most popular
instrumentations used in food sciences nowadays that are able to produce
multiway data: Excitation–emission fluorescence spectroscopy (EEM) and
hyphenated chromatographic systems (i.e. gas chromatography connected to
mass spectrometry—GC–MS). The benefit and drawbacks of both techniques
in the framework of food analysis will be discussed in successive chapters.
Here we will just focus on the structure of the three-way array. Figure 1 shows
the final three-way structure that is obtained when several samples are analysed by both EEM and hyphenated chromatography. However, the inner
structure of this tensor varies due to the different nature of the measurement.
EEM is a function of two variables: excitation and emission. One sample
measured by EEM can thus conveniently be presented as a matrix of fluorescence intensities as a function of excitation and emission wavelengths. The
fluorescence landscape X(I,J) can therefore be described as a function of a
concentration-dependent factor, a, and its excitation, b(F,lEm), and emission,
c(F,lEx), thus establishing the following linear relationship for each xij member of X:
xij ẳ
F
X
aijf bijf lEm ị cijf lEx ị
(1)
f ẳ1
where F is the total number of fluorescent species present in the sample. Having F independent fluorophores with different concentrations, Equation (1)
can easily be extended with an additional dimension referring to the samples.
The structure of X obtained for hyphenated chromatographic systems is similar to the one for EEM (Figure 1) [3]. In this case, the signal X(I,J) is proportional to the concentration a of each analyte, having a specific elution time
b(et) and a spectral signal c(l) (if, for instance, the detector is a spectral
detector) as indicated in Equation (2):
xij ẳ
F
X
aijf bijf etị cijf lị
(2)
f ẳ1
One can argue that the differences between Equations (1) and (2) are merely
semantic (both equations look quite similar). Nevertheless, the chemistry
270
PART
I
Theory
behind each one is totally different, making the choice of the proper multiway
method an essential step before the analysis. For instance, EEM data can be analysed by using parallel factor analysis (PARAFAC). Nevertheless, in most of
the cases GC–MS data cannot be handled directly with PARAFAC without
any pre-processing method prior to the analysis, making the use of PARAFAC2
necessary [8,9]. The suitability of each multiway technique depending of the
kind of data will be discussed more closely in further sections of this chapter.
3 PARALLEL FACTOR ANALYSIS
PARAFAC is a decomposition method that can be conceptually compared to
principal component analysis (PCA) for multiway data [10]. It was developed
independently by Harshman [11] and Carroll and Chang under the name
CANDECOMP [12]; and both were based on the principle of parallel proportional profiles suggested by Cattell [13].
We have chosen EEM to exemplify and visualize the working procedure
of PARAFAC as they are becoming an essential binomial due to the characteristics of the EEM signal. EEM measurements are fast and, usually, do
not require any previous step of sample preparation. The huge amount of
information obtained for one single sample can be visualized as a fingerprint
of the sample throughout its fluorophores. The structure of the data (two independent sets of variables—excitation and emission profiles—and one variable
equally dependent on both spectral profiles—concentration profiles) makes
EEM data fulfil the requirement of trilinearity (this concept will be explained
later) if no uncontrolled effects/artefacts are present in the samples. Consequently, the combination of EEM and PARAFAC is becoming a popular tool
for fast analysis of intact food [4,14], where many applications have already
demonstrated this suitability (analysing and authenticating different food systems [14], differentiating the botanical origin of honey [15], monitoring the
ripening of cheeses [16], classifying wines according to variety, typicality
and age [17–19], monitoring the texture of meat emulsions [20], characterizing ice cream formulations [21] and ripening of Cabernet Franc grapes [22]).
3.1 The General PARAFAC Model
PARAFAC decomposes the data cube into three loading matrices, A(I,F),
B(J,F) and C(K,F), each one corresponding to the modes/directions of the
data cube with elements aif, bjf and ckf, respectively. The model minimizes
the sum of squares of the residuals, eijk, in Equation (3):
xijk ẳ
F
X
f ẳ1
F
X
aif bjf ckf ỵ eijk xijk ẳ
aif bjf ckf ỵ eijk
(3)
f ẳ1
where F denotes the number of factors. Figure 1 shows a graphical depiction
of decomposition of X considering two factors (F ¼ 2). The decomposition is
2
2
2
1
2
3
pl
e
1
m
Excitation S
am
3
X
271
Multiway Methods
+
=
+
Emission
E
Emission
c2
c1
b1
X
Excitation S
a
7
pl
e
Chapter
b2
+
=
a1
+
E
a2
FIGURE 4 Graphical representation of a two-factor PARAFAC model of the data array X. The
top part of the figure shows the chemical interpretation, whereas the bottom part shows the mathematical representation.
made into triads or trilinear components. Instead of one score vector and one
loading vector, each factor consists of three loading vectors (Figure 4).
Using the Khatri-Rao product the PARAFAC model can be formulated in
terms of the unfolded array as in Equation (4):
XIJK ị ẳ ACBịT ỵ EIJKị
(4)
3.2 PARAFAC Iterations. Convergence to the Solution.
Alternating Least Squares
From the Equation 4, it can be assumed that PARAFAC is aimed to find the
combination of A, B and C that best fits with X(IÂJK) for the assigned number
of factors. In other words, the aim is to minimize the difference between the
reconstructed data (gathered from A, B and C) and the original data or, better
said, to minimize the Euclidean norm (Equation 5):
minA, B, C jjX À AðCBÞT jj2F
(5)
Minimizing this difference, thus leaving the noise out, is a classical least
squares problem that can be handled by different algorithms. One of the most
popular ones in curve resolution is alternating least squares (ALS). The main
benefit of ALS with respect to others is the simplicity of the involved substeps:
1. Initialize B and C.
2. Z ẳ CBị
A ẳ XIJKị ZZT Zị
1
272
3. Z ẳ CAị
ỵ
B ẳ XJIKị ZZT Zị
4. Z ẳ BAị
ỵ
C ẳ XKIJị ZZT Zị
T 2
5. jjX AðCBÞ jjF < critical value: If not, go to step 2
PART
I
Theory
(6)
In these sub-steps the symbol ỵ stands for the MoorePenrose inverse. There
are two main facts in these sub-steps. The first one is the need of initial estimations of B and C. Good starting values can help speed up the algorithm and
help in assuring that the global minimum is found. In the literature, several
possible kinds of initializations have been proposed [6,23–25]. The second
main fact is the need of the establishment of an end point of iterations. That
is, the point in which the obtained reconstructed data are most similar to the
original ones. In most of the cases, a stopping criterion of 10À6 is enough to
assure that the absolute minimum in the iterations has been reached. However,
if the model parameters are very difficult to estimate, a lower criterion may be
chosen.
3.3 Properties of PARAFAC Model
There are several properties of PARAFAC that make it an attractive technique
from an analytical point of view. The first one is that, different from PCA,
there is no need for requiring orthogonality in the computation of the factors
to identify the model. This means that, under the proper constraints, PARAFAC loadings will resemble the real physico-chemical behaviour of the analytes involved in the variability of the signal. Further, if the data are
approximately trilinear, the true underlying phenomena will be found if the
right number of components is used and the signal-to-noise ratio is appropriate [6]. That is, taking as example the EEM data, the loading matrices B and
C will recover the true underlying excitation and emission spectra of the
involved fluorophores, respectively, while A will contain their relative concentration (relative abundance according to B and C). This property is especially appreciated in the interpretation of the obtained factors, where it
becomes easier to assign sources of variability in PARAFAC loadings than,
for example, in PCA loadings (Figure 5).
The possibility of modelling factors that are not directly related to the target and, consequently, not present in the calibration samples, is the so-called
second-order advantage [6,26]. This advantage states that if the true concentrations of the analytes are known in one or more samples, the concentrations
in the remaining samples can be estimated even in presence of uncalibrated
species. This property is inherent of second-order instruments (see Brooks
and Kowalski [27] for further information about the definition of the order
Chapter
7
Multiway Methods
273
FIGURE 5 Two samples with different amounts of three fluorophores measured by EEM giving
two landscapes/matrices of data shown in the top middle. The data can be arranged and decomposed as a three-way array (left) or as an unfolded two-way array (right). The excitation vectors
obtained by PARAFAC are shown in the bottom left corner. In the bottom right corner the
corresponding orthogonal PCA excitation loadings are shown.
of the instruments) and is especially relevant in food science, where seasonal
and species variation may lead to new uncalibrated interferents in future
samples.
Another much appreciated property of PARAFAC is its uniqueness in the
solution. In most circumstances the model is uniquely identified from the
structure, and hence no post-processing is necessary as the model is the best
model in the least squares sense.
PARAFAC concedes to the concept of trilinear data [14]. Trilinearity can
be viewed as an extension of the bilinear relationship between a dependent
variable and an independent one to a scenario with two independent variables
and a dependent one. In this way, trilinearity could be seen as a natural extension of the Lambert Beer’s law to second-order data. As an example for
second-order data, EEM signal is characterized by a concentration that follows a linear relationship with both excitation and emission spectral profiles.
Trilinearity assumes that the measured signal is the sum of the individual
peaks of each analyte and that the profiles in each mode for the analytes are
proportional in all the samples [14,28].
3.4 Model Validation. Selection of the Number of Factors
The features of uniqueness and trilinearity are closely related [10,11]. If the
data are trilinear, the true underlying signal will be found if the right number
274
PART
I
Theory
of factors is estimated and the signal-to-noise ratio is appropriate [6,10].
Nevertheless, both concepts are inherently linked to one of the main issues
of curve resolution methods: finding the proper number of factors.
In general, a factor must be understood as any effect that causes variations in
the signal in a higher level than the signal-to-noise ratio expected from the
device and/or the samples. This definition encompasses the signal of the analytes of interest, the signal of other analytes present in the sample and, what is
more important (due to their difficulty of detection in some cases), the different
artefacts that affect the signal. A typical example of the latter is the possibility of
considering the baseline drift between samples in chromatography as an additional factor, as its effect is usually higher than the general signal-to-noise ratio.
Choosing the proper number of factors (i.e. the chemical rank of the data)
is, probably, the most crucial (and complicated) step. Extracting too few factors (under-fitted model) is usually an easy problem to detect. The nonrandom distribution of the residuals and their values can give a good clue that
the model should be modelled with more factors. On the contrary, extracting
too many factors (overfitted model) does not only mean that noise is being
increasingly modelled, but also that the true factors are being modelled by
more (correlated) factors [10].
There are several dedicated methods proposed to estimate the correct number of factors for PARAFAC, the following being the most common: splithalf analysis, combining core consistency [29] and percentage of explained
variance, judging residuals and previous chemical knowledge of the data.
Split-half analysis [30,31] uses the intrinsic properties of PARAFAC and
the samples, stating that the same B and C loadings should be found in different subsets of the data. The method is based on performing independent analysis to different subsets. Due to the uniqueness of the PARAFAC model, the
same loadings will be obtained in the non-splitted models from models of any
suitable subset of the data, if the correct number of components is chosen. The
split-half approach may also sometimes be used for verifying whether nontrilinearities are present [6].
Another common method is to check the core consistency and explained
variance of the model. The core consistency [6,29] estimates the appropriateness of the PARAFAC solution. It is based on the fact that the PARAFAC
model can be posed as a restricted Tucker3 model (the Tucker3 model is
introduced in Section 5), where the core array is fixed to be a superidentity
array (superdiagonal array of ones). The core consistency diagnostic consists
of first calculating the optimal unconstrained core array for a Tucker3 model
where the loading matrices are the ones obtained by the PARAFAC model at
hand, and then calculating the relative sum-of-squared difference between this
core and the superdiagonal core of ones [4]. The closer the core consistency is
to 100%, the better the Tucker3 core fits to the assumption of the model. If the
core consistency is below zero, the PARAFAC model is inappropriate or the
variation is purely random.
Chapter
7
Multiway Methods
275
Core consistency tells you whether a model is appropriate or not. Nevertheless, it does not tell you if the model is the correct model. Assuming, for
instance, a three-component PARAFAC model, one will find that a two-factor
model has a high value of core consistency. The core consistency will show
that all these models are valid in the sense that they do not overfit [6]. That
is one of the main reasons why it is important to check other parameters
simultaneously, like the explained variance of the model. The explained variance is the amount of variance explained for the assumed number of factors. It
is calculated by taking into account the sum of the squares of the residuals
(SSE) and the sum of the squares of the elements in the original data (SSX)
[32], as follows:
2
E
SSE
2
¼ 1 À 2
(7)
RX ¼ 1 À
SSX
X
The explained variance and the core consistency are usually checked
together. As a general rule, the explained variance will increase and the core
consistency tends to decrease with the number of factors [4,33]. The fact that
the explained variance increases with the number of factors is just a mathematical fact, as the more factors are added to the model, the more information
will be explained. The point is to guess which information is real chemical
behaviour and which one is just noise. The core consistency, instead, may
decrease, for example, in a model with three factors, but increase again in a
model with four factors. This is due to the nature of the signal and also to
the signal-to-noise level, giving an account for possible unstable models.
The main point is to find the proper agreement between both parameters. This
will depend also on other parameters like randomness of the residuals, shape
of the profiles and quality of the raw data (signal-to-noise ratio).
Other methods have been proposed in the bibliography for assessing the
correct number of factors (see, for instance, Hoggard and Synovec [34],
who evaluated the so-called degenerate solution that can be observed for
PARAFAC models with too many factors).
3.5 Imposing Constraints to the Model
A constraint is a chemical or mathematical property that the profiles should
fulfil [10]. For this criterion, the chemical structure of the data is taken into
consideration in the selection of the proper constraints. The most common
constraints are non-negativity and unimodality. Non-negativity forces the profiles to only contain non-zero values. This is especially useful for spectral and
chromatographic profiles. Unimodality constraints can help to preserve the
presence of only one peak in each profile extracted. Nevertheless, there are
a number of constraints that can be used, mainly to improve the performance
276
PART
I
Theory
of the algorithm and to obtain more meaningful results [6]. Despite the PARAFAC model being unique, the model may not provide a completely satisfactory description of the data. By definition, a constrained model will fit the
data poorer than an unconstrained one, but if the constrained model is more
interpretable and realistic, this may justify the decrease in fit. That is the main
reason why constraints must be applied in a sensible manner, by critically
evaluating the model obtained afterwards.
Both validation methods and constraints must be handled with care. They
are just mathematical interpretations of the model. Therefore, the results
might be biased for a bad sampling, low signal-to-noise ratio and so on. Moreover, one of the most cumbersome issues is that guessing the proper number
of factors or knowing the best constraints to apply is a task to be performed
after running several PARAFAC models with different number of factors.
Once the models have been calculated, the comparison of the parameters
between the models with different factors must be done.
3.6 PARAFAC in Practice
The following example will show different features of how to choose the
number of factors in PARAFAC models. We have chosen a data set comprised of different vinegars measured by EEM. The aim of the work was to
classify different qualities in vinegars according to their age. Further information can be found in Callejon et al. [33]. Sherry vinegar is produced in
Jerez-Xe´re`s-Sherry, Manzanilla de Sanlu´car and Vinagre de Jerez Protected
Designation of origin in south-western Spain. Its main features are a high
acetic degree (legally at least 7 ) and a special flavour, properties that rate
the vinegar into three categories for Sherry vinegar according to their ageing
time in oak wood barrels: Vinagre de Jerez (minimum of 6 months), Reserva
(at least 2 years) and Gran Reserva (at least 10 years). These three qualities of
vinegars have different prices in the market due to the fact that the longer the
ageing, the better is the quality and the higher the cost of its production. This
fact results in these products being subject to frequent frauds. Therefore, fast
and reliable methods of classification are needed for speeding up the detection
of possible frauds. Thus, it seems a perfect issue that can be solved by using
EEM and PARAFAC.
3.6.1 Selection of the Best Model
Figure 6 shows the EEM landscape of several samples of vinegars from different classes. Vinegar is a fermentation product from wine. Therefore, its
matrix is complex and can be formed by a wide range of fluorescent compounds, most of which are polyphenols, but also amino acids (e.g. tryptophan)
and vitamins (e.g. vitamin A). Moreover, EEM fluorescence spectroscopy has
been rarely applied in vinegar; hence, there is little information about its fluorescence profile.
FIGURE 6 EEM landscape of several vinegars obtained from Ref. [33]. The black lines correspond to the first- and second-order Rayleigh scattering.