1 The Al04 Study: Rain Predictions Based on Kriging
Tải bản đầy đủ
Temporal and Spatial Statistical Methods
247
particularly large hydrographic basin considered. To obtain a more relevant analysis,
cumulative rain measurements over the whole basin must be taken into account.
In order to catch the rain effects on the piezometric levels, the study is conducted
at a daily scale. Rain data are collected daily from the 5 rain gauges – as they have
been collected regularly for many years. However, to find a cumulative measure of
rain precipitation, a kriging reconstruction over the whole area is necessary.
Kriging is a statistical tool used to make predictions on unobserved space points
and, more generally, to estimate the surface of the values taken by a variable of
interest over a region. The methodology, first proposed in the 1950s by the mining
engineer Krige, is still widely used because of its ability to fit suitably the prediction
surface on the area considered, no matter what the geometric locations of the point
observations are.
The spatial approach described in Diggle and Ribeiro (2007) is used to reconstruct the rain surface of the AL04 region. Let R.x; y/ denote a stationary Gaussian
process with constant mean 0 and variance 2 describing the true underlying
rain precipitation at point .x; y/ in AL04. The region is assumed to be a twodimensional portion of the plane, due to its small size (relative to earth diameter) and
to its flatness. For the same reasons, no spatial trend specification is required. The
spatial correlation structure of the process is modeled through the Mat´ern correlation
function
.d I k; / D .d= /k Kk .d= /f2k 1 .k/g 1 ;
where Kk . / denotes the second order modified Bessel function and k and
represent the shape and scale parameters, respectively.
The rain gauges in the region provide observable variables Zi , i D 1; : : : ; 5, the
point rain levels. By applying a Box-Cox transformation, the following model is
considered for them:
p
Z i WD yi D R.xi ; yi / C Ni ;
where Ni are i.i.d. normal random variables N .0; 2 / and 2 represents the
so-called nugget effect. This implies that the random vector Œy1 ; : : : ; y5 is multivariate Gaussian:
y M N .; 2 H. / C 2 I /;
where WD Œ 0 ; : : : ; 0 is a constant vector, H is a function of the scale parameter
and I is the identity matrix.
Unlike other work regarding rain prediction, such as Ravines et al. (2008), in
our case the small number of rain gauges available in the region does not allow
for a graphical analysis of empirical variograms in order to estimate the model
parameters. We estimate instead these parameters by maximizing the likelihood
L.;
2
;
2
; jy/ D
2
1=2fn log.2 / C log.j
C.y
/T .
2
H. / C
2
H. / C
I / 1 .y
2
I j/C
/g:
248
D. Imparato et al
Once parameters have been estimated, the ordinary kriging method is used
to predict the observed rain precipitation levels on a regular grid of AL04. The
O
ordinary kriging predicts a rain value R.x;
y/ at the point .x; y/ as a solution to the
following constrained minimum problem:
8
O
y/ R.x; y/g2
< min EŒfR.x;
P
O
sub R.x;
y/ D i wi .x; y/zi
P
:
sub
i wi .x; y/ D 1:
As a result, with the ordinary kriging the prediction is expressed as a weighted linear
combination of the point observations, in such a way to minimize the mean squared
prediction error. Moreover, the mathematical constraint that the weights must sum
to one allows us to make spatial predictions without estimating the mean process
0 . In our case, such an estimation would not be reliable due to the small number of
rain gauges. Finally, cumulative rain values for the whole area are obtained through
two-dimensional integration of the predicted surface along the whole domain. The
sample mean of all the predicted values of rain levels on the regular grid gives a
good approximation of this integral.
3.2 Modelling the Joint Effects of Rain and Neighboring Rivers
A pre-processing plus transfer function approach, similar to the one described in
Sect. 2, is now used in order to remove the effects of external predictors for the
piezometers in AL04.
As a first example, the results concerning the piezometer near Tortona is shown in
Fig. 4, where the observed piezometric levels are compared with the reconstructed
piezometric levels, after removal of the rain effects. No near waterway is found
to be a relevant predictor in this case. Let .Yt / and .Wt / represent, respectively,
the pre-processed series of the piezometric levels near the city of Tortona and the
series of cumulative rain precipitations, reconstructed with the methods described
in Sect. 3.1. The resulting estimated model is
Yt D 0:013Wt C 0:008Wt
1
C Át ;
where Át is the residual time series of the model. This time series represents the
reconstructed virgin model, as described in Sect. 2.
A second model is discussed for the piezometer near the village of Lobbi. In
this case, the river Tanaro is found to be a significant predictor, to be added to the
cumulative rain amounts, providing the final estimated model
Yt D 0:00082Wt C 0:04201Xt
0:04186Xt
1
C "t ;
Temporal and Spatial Statistical Methods
249
Fig. 4 Tortona piezometer: comparison between daily trend of the piezometric levels (solid line)
and virgin trend without rain effects (dashed line)
where, Xt is the time series of Tanaro, preprocessed in a similar way, and "t is the
residual term, which is interpreted as the final virgin level.
The results are shown in Fig. 5, in which the observed piezometric levels are
plotted together with different reconstructed trends. Trends are estimated based
on the reconstructed piezometric levels using a LOWESS interpolation. The
dashed trend refers to the model in which only the rain effect was removed;
the dashed-pointed one was obtained by removing the Tanaro effect; finally, the
pointed one is the reconstructed “virgin” trend, where both the effects are removed.
From a geological point of view, an interesting similarity seems to hold between
the estimated groundwater virgin levels obtained in this way and the so-called
exhaustion curves of water springs.
4 Discussion
In this paper we describe the statistical modelling of hydrological external contributions to groundwater levels through a transfer function approach. To this end,
the neighboring rivers and rain precipitations are considered as the main predictors.
Removing these external contributions in order to restore a virgin groundwater level
makes our work different from other literature. Groundwater time series are widely
250
D. Imparato et al
Livello piezometrico (m)
–2
–4
–6
–8
–10
2007
2008
Time
Fig. 5 Lobbi piezometer: comparison between different trend reconstructions of the piezometric
levels: original trend (solid line), without rain (dashed line), without Tanaro river (dashed-pointed
line), without both rain and Tanaro (pointed line)
discussed in Ravines et al. (2008), where a Bayesian approach is taken, and in
Yi and Lee (2004). In the latter, a Kalman filtering approach is used on irregularly
spaced data to regularize the original time series. In both references, very different
geographical and hydrological situations are considered.
In Sect. 2, the monthly scale study is considered and only the river effect is found
to be of some interest. As can be argued from Fig. 2, the new estimated trend appears
smoother than the original one: fluctuation effects due to the biological rhythm of
the Po river have been successfully removed.
In order to deal instead with local rain effects, the daily scale is considered in
Sect. 3. In this case, when rain and river contributions are removed, the estimated
trend shows a significantly slow and continuous exhaustion of the groundwater
levels, similar to exhaustion curves of water springs. From the study of these curves,
the health of the groundwater can be evaluated more accurately. This analysis
depends on many factors, such as the geological conformation and the hydrological
network near the piezometer. In fact, the Tortona example shows a faster decay of
the virgin level than the Lobbi case, where a similar effect is evident only after a
two-year monitoring. Moreover, the Lobbi piezometer shows a stronger seasonality
than in Tortona. This is due to the presence of the nearby river, and, probably, to a
more complex hydrological underground equilibrium.
The statistical exercise presented here is the starting point for several possible
actions the local government could take, according to EU guidelines:
• Construction of reliable nowcasting predictions: according to the geological
officers involved, alarm thresholds for such predictions may be discussed, aimed
Temporal and Spatial Statistical Methods
251
at building semi-automatic procedures for controlling locally the health of the
groundwater, in a way similar to on line process control in industrial quality
control.
• Careful modelling of the local water cycle: stochastic models could be built to
replace more rigid existing deterministic models based on partial differential
equations.
• Improved control over private irrigation: the incumbency of water depletion may
suggest that actions be taken to discourage private tapping.
Acknowledgements Work partially supported by Regione Piemonte, RSA Project CIPE 2004:
“Rilevamento di inquinanti in diverse matrici ambientali”. We thank an anonymous referee for
useful suggestions and comments.
References
Battaglia, F: Metodi di previsione statistica. Springer Verlag Italia, Milano (2007)
Diggle, P.J, Ribeiro, P.J. Jr.: Model-based Geostatistics. Springer Verlag, New York (2007)
European Union: Directive 2006/118/ec on the protection of groundwater against pollution
and deterioration. Official Journal of the European Union, L372/19–L372/31 (2006)
Ravines, R.R.,. Schmidt, A.M., Migon, H.S., Renn´o, C.D.: A joint model for rainfall-runoff:
the case of Rio Grande Basin. Journal of Hydrology 353, 189–200 (2008)
Yi, M., Lee, K.: Transfer function-noise modelling of irregularly observed groundwater heads
using precipitation data. Journal of Hydrology 288, 272–287 (2004)
This page intentionally left blank
Reduced Rank Covariances for the Analysis
of Environmental Data
Orietta Nicolis and Doug Nychka
Abstract In this work we propose a Monte Carlo estimator for non stationary
covariances of large incomplete lattice or irregularly distributed data. In particular,
we propose a method called “reduced rank covariance” (RRC), based on the
multiresolution approach for reducing the dimensionality of the spatial covariances.
The basic idea is to estimate the covariance on a lower resolution grid starting from
a stationary model (such as the Math´ern covariance) and use the multiresolution
property of wavelet basis for evaluating the covariance on the full grid. Since this
method doesn’t need to compute the wavelet coefficients, it is very fast in estimating
covariances in large data sets. The spatial forecasting performances of the method
has been described through a simulation study. Finally, the method has been applied
to two environmental data sets: the aerosol optical thickness (AOT) satellite data
observed in Northern Italy and the ozone concentrations in the eastern United States.
1 Introduction
The analysis of many geophysical and environmental problems requires the application of interpolation techniques based on the estimation of covariance matrices. Due
to the non stationary nature of the data and to the large size of the data set it the usual
covariance models can not be applied. When the spatial dimension of the sample is
very large, the operations of reducing the size of covariance matrices need to be
O. Nicolis ( )
Department of Information Technology and Mathematical Methods, University of Bergamo,
Italy
e-mail: orietta.nicolis@unibg.it
D. Nychka
Institute for Mathematics Applied to Geosciences, National Center for Atmospheric Research,
Boulder, CO (USA)
e-mail: nychka@ucar.edu
A. Di Ciaccio et al. (eds.), Advanced Statistical Methods for the Analysis
of Large Data-Sets, Studies in Theoretical and Applied Statistics,
DOI 10.1007/978-3-642-21037-2 23, © Springer-Verlag Berlin Heidelberg 2012
253
254
O. Nicolis and D. Nychka
applied to make their calculation feasible. Many approaches have been proposed in
literature, mainly based on multiresolution analysis, on tapering methods or on the
approximating the likelihood function (Cressie and Johannesson 2008; Matsuo et
al. 2008; Zhang and Du 2008; Banerjee et al. 2008; Fuentes 2007; Stein 2008).
In this work we proposed a non parametric method for computing the covariance
matrices of massive data sets based on the multiresolution approach introduced
by Nychka et al. (2003). In particular, this method is based on the wavelet
decomposition of covariance matrix as follows.
Let y be the m data points of the field on a fine grid and ˙ the (m m) covariance
matrix among grid points. By the multiresolution approach (Nychka et al. 2003),
a spatial covariance matrix ˙ can be decomposed as
˙ D WDW T D W HH T W T
(1)
where W is a matrix of basis functions evaluated on the grid, D is the matrix of
coefficients, H is a square root of D, and the apex T denotes transposition. Unlike
the eigenvector/eigenvalue decomposition of a matrix, W need not be orthogonal
and D need not be diagonal. Since for massive data sets ˙ may be very large, some
authors (Nychka et al. 2003) suggested an alternative way of building the covariance
by specifying the basis functions and a matrix H . The basic idea of this work is to
estimate in a iterative way the matrix H on a lower resolution grid starting from
a stationary model for ˙. The evaluation of the wavelet basis on a fine grid in (1)
provides a reduced rank covariance matrix.
The method can be used for the estimation of covariance structures of irregularly
distributed data points and lattice data with many missing values.
In this paper, the multiresolution method based on the reduced rank covariance is
applied to two environmental data sets: the AOT satellite data (Nicolis et al. 2008)
and to daily ozone concentrations (Nychka 2005).
Next section discusses the multiresolution approach for the analysis of observational data. Section 3 describes the Reduced Rank Covariance (RCC) algorithm
for the estimation of conditional variance in large data sets. Section 4 shows some
simulation results. Applications to satellite and ozone data are described in Sect. 5.
Section 6 presents conclusions and further developments.
2 Modelling Observational Data
In many geophysical applications, the spatial fields are observed over time and one
can exploit temporal replication to estimate sample covariances. In this section we
focus on this case and also for gridded data with the goal of deriving a estimator
that scale to large problems. Suppose that the point observations y are samples of a
centered Gaussian random field on a fine grid and are composed of the observations
at irregularly distributed locations yo , and the missing observations ym . In other
words, we assume that the grid is fine enough in resolution so that any observation
Reduced Rank Covariances for the Analysis of Environmental Data
255
can registered on the grid points (as in Fig. 1a). Hence,
yo
yD
ym
!
MN.0; ˙/
(2)
where ˙ D WDW T is the covariance fine gridded data described in (1), D is a
non diagonal matrix and W is a matrix of non-orthogonal scaling and wavelet
functions. Although the non-orthogonality property of wavelet basis provide off
diagonal coefficients in the matrix D, the localized support of these functions
ensures that many covariance terms in D will be close to zero, reducing the
computational complexity in the interpolation problems of surfaces. An important
class of compactly supported basis function, used in the applications of this work,
is the Wendland family proposed by Wendland (1995). The Wendland functions
are also implemented in the R statistical language (http://www.r-project.org) in the
fields package (Nychka 2005). Figure 2 provides an example for the matrices D and
H , obtained by (1),
D D W 1 ˙W T ;
(3)
where ˙ is the covariance resulting from the fitting of a Mat´ern model to a regular
grid, W is a matrix whose columns contain Wendland functions, and W T is the
transpose of the inverse of W .
The observational model (2) can be written as zo D Ky C " where " is a
multivariate normal MN.0; 2 I /, zo is a vector of m observations, and y is the
underlying spatial field on the grid. The matrix K denotes an incidence matrix
of ones and zeroes with a single one in each row indicating the position of each
observation with respect to the grid. The conditional distribution of y given zo is
Gaussian with mean
˙o;m .˙o;o / 1 zo
(4)
and variance
˙m;m
˙m;o .˙o;o / 1 ˙o;m
a
1
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
(5)
b
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
1
0
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
1
Fig. 1 Gridded data: irregularly distributed data (squares); missing data (small points) and knots
(circles)
256
O. Nicolis and D. Nychka
a
b
D matrix
1.0
1.0
0.8
0.8
0.6
0.6
0.4
0.4
0.2
0.2
H matrix
0.0
0.0
0.0
0.2
0.4
0.6
0.8
1.0
Fig. 2 Example of D (a) and H matrix (b) on a 8
stationary covariance
0.0
0.2
0.4
0.6
0.8
1.0
8 grid data (b) using wavelet-based non
where ˙o;m D Wo HH T WmT is the cross-covariance between observed and
missing data, ˙o;o D Wo HH T WoT C 2 I is covariance of observed data and
˙m;m D Wm HH T WmT is the covariance of missing data. The matrices Wo and Wm
are wavelet basis evaluated at the observed and missing data, respectively. For
a chosen multiresolution basis and a sparse matrix H there are fast recursive
algorithms for computing the covariance ˙. Matsuo et al. (2008) proposed a method
that allows for sparse covariance matrices for the basis coefficients. However the
evaluation of wavelet coefficients can be slow for large data sets.
3 The Reduced Rank Covariance (RRC) Method
In this section we propose an estimation method for ˙ based on the evaluation of a
reduced rank matrices. We denote by “knots” the spatial points on a lower resolution
grid G of size .g g/, where g Ä m. The idea is to estimate the matrix H on the
grid of knots starting from a stationary model for ˙ and using the Monte Carlo
simulation for providing an estimator for the conditional covariance. A flexible
model of stationary covariance is the Mat´ern covariance given by
C.h/ D
2
1
. /2
1
Ã
Ã
Â
Â
p h
p h
K 2
;
2
Â
Â
Â > 0; > 0
where h is the distance, Â is the spatial range and K . / is the Bessel function of the
second kind whose order of differentiability is (smoothing parameter). Since W
is fixed for a chosen basis, the estimation procedure for the conditional covariance
is given by the estimation of the matrix H after a sequence of approximations.
Reduced Rank Covariances for the Analysis of Environmental Data
257
Following this approach the covariance in (1) can be approximated as
˙
W HQ g HQ gT W T ;
(6)
where HQ g is an estimate of the matrix H on the grid G. The RRC estimation
algorithm can be described by the Monte Carlo EM algorithm in the following
steps.
1. Find Kriging prediction on the grid G:
yO g D ˙o;g .˙o;o / 1 zo ;
where HQ g D .Wg 1 ˙g;g WgT /1=2 and ˙g;g is stationary covariance model
(es. Matern).
2. Generate synthetic data: zso D Kysg C " where ysg D Wg HQ g a with a N.0; 1/.
3. Compute Kriging errors:
u D ysg yO sg ;
where yO sg D ˙o;g .˙o;o / 1 zso .
4. Find conditional field ym jzo :
yO u D yO g C u :
5. Compute the conditional covariance on T replications, ˙u D COV .Oyu / and use
the new HQ g in the step 1.
Performing this several times will give an ensemble of fields and, of course, finding
the sample covariance across the ensemble provides a Monte Carlo based estimate
of the conditional covariance.
3.1 Simulation Study
The purpose of this study is to investigate the forecasting ability of the proposed
RRC method in two different contexts: (a) approximation of stationary covariance
models and (b) estimation of non-stationary covariances. In order to study the properties of approximation of the RRC method, we simulated n D 20 Gaussian random
fields on a 20 20 grid using a Mat`ern model with parameters Â D 0:1 and D 0:5.
In order to generate the missing data we removed randomly 50% of the simulated
data. An example of simulated random field with missing data is shown in Fig. 3.
For each simulated random field we estimated the missing data using the RRC
method on a grid of 8 8 knots and then we computed the root mean square
errors on the predictions. The parameters of the Mat`ern model used in the step
1. of the algorithm has been chosen by cross validation. Figure 4a compares the
RMSE for each simulated random field for the Mat`ern model and the non-stationary