2 ACRA Learning Strategies Scales (Acquisition, Coding, Retrieval andSupport)
Tải bản đầy đủ - 0trang
710
S. Cuomo et al.
Fig. 1 A CPU multi-threaded approach on a set of small matrices. Each thread performs a SVD
computation on a Xi matrix of small size. Depending on the CPU caching capabilities, performance
can be enhanced for small problem sizes.
Fig. 2 CUDA streams approach to multiple SVD problems. Each stream performs a SVD decomposition on a small matrix. A pipeline is executed using multiple streams. In this example, 3
streams are used to compute 3 SVD decomposition. The upper side of the image shows a CPU
executing 3 SVDs. The bottom side shows 3 streams working together on the same problem.
GPU Profiling of Singular Value Decomposition in OLPCA …
711
Listing 1 Calculate size for pre-allocated buffer
cusolverStatus_t cuSolverStatus;
cusolverDnHandle_t cuSolverHandle;
cusolverDnCreate(&cuSolverHandle);
cuSolverStatus = cusolverDnSgesvd_bufferSize(cuSolverHandle, M, N,
&Lwork);
float *Work;
cudaMalloc(&Work, WorkSize*sizeof(*Work));
float *rwork;
cudaMalloc(&rwork, M*M*sizeof(*rwork));
However, it is necessary to be careful in choosing a parallel approach to tackle this
problem. Although it is possible to achieve good performance and speed-ups parallelizing the SVD computation, for instance by using a custom kernel or a library
of choice (cuBLAS, CULA or other), we stress that in this case the size of all matrices Xi is very small (in the order of 10 to 14 squared matrices). Performing a
SVD decomposition, which in the truncated case has a computational complexity
of O(min(mn2 , m2 n)), of such small matrices on a CPU requires only a hundred of
FLOPS thus allowing to compute a solution in a range of milliseconds.
The process of initiating a GPU and then perform a SVD decomposition requires
more time than just performing it on a CPU. For instance, the theoretical number of
threads that could be used for a 5 × 5 matrix is 25. In practice, the number of simultaneous threads per matrix is 1 or 5, meaning many thousands of simultaneous matrices are needed to obtain a sufﬁcient number of threads. In order to justify a GPU
computation, more operations should be done for which thousands of simultaneous
threads are preferable to achieve a noticeable difference in overall performance.
Theoretically, one CUDA stream per problem could be used as a solution for one
problem at a time, as depicted in Fig. 2. However, two issues arise:
• the number of threads per block would be very low. Typically, no less than 32
threads per block (the warp size, effectively the SIMD width, of all current
CUDA-capable GPUs is 32 threads) are preferred to fully exploit the potential
of a GPU card, and this is not practical, as previously mentioned;
• the prepare-lauch code would be more expensive than just performing the operation on a CPU and the resulting overhead would be unacceptable.
NVIDIA cuBLAS has introduced a batch-mode feature but this is only available
for multiply, inverse routines and QR decomposition. It is intended for solving a
speciﬁc class of problems by allowing a user to request a solution for a group of
identical problems (a SIMD operation in this case). cuBLAS batch interface is intended for speciﬁc circumstances with small problems (matrix size < 32) and very
712
S. Cuomo et al.
Listing 2 An example of a cuBLAS SVD routine call. Proﬁling events registration is included.
cudaEventCreate(&start);
cudaEventCreate(&stop);
cudaEventRecord(start, 0);
cuSolverStatus = cusolverDnSgesvd(cuSolverHandle, ’A’, ’A’, M, N,
devA, M, devS, devU, M, devV, N, Work, Lwork, NULL, devInfo);
cudaEventRecord(stop, 0);
cudaEventSynchronize(stop);
cudaEventElapsedTime(&executionTime, start, stop);
cudaEventDestroy(start);
cudaEventDestroy(stop);
Listing 3 An example of a SVD routine call using CULA. Proﬁling event registration is included.
cudaEventCreate(&start);
cudaEventCreate(&stop);
cudaEventRecord(start, 0);
status = culaSgesvd(’A’, ’A’, M, N, A, lda, S, U, ldu, V, ldvt);
cudaEventRecord(stop, 0);
cudaEventSynchronize(stop);
cudaEventElapsedTime(&executionTime, start, stop);
cudaEventDestroy(start);
cudaEventDestroy(stop);
large batch sizes (number of matrices > 1000). Our case study falls in this category
but a SVD decomposition should be derived and implemented using the batch QR
decomposition. This goes beyond the scope of this case study. Also, using QR iterations to achieve a SVD decomposition would result in a slower algorithm than
LAPACK optimized SVD.
The GPU implementation is based on two libraries: NVIDIA cuBLAS and CULA.
Here we report a sample code for each library. In block 1 and 2, we exploit the
cuSOLVER APIs. They represent a high-level package based on cuBLAS and cuSPARSE, combining three separate libraries. Here we make use of the ﬁrst part of
cuSOLVER, called cuSolverDN. It deals with dense matrix factorization and solve
routines such as LU, QR and SVD. We use the S data type for real valued single
precision inputs.
In Listing 1, we use a helper function to calculate the sizes needed for pre-allocated
buffer Lwork. Moreover, in Listing 2, the cusolverDnSgesvd function computes the
SVD of a m × n matrix devA (the preﬁx dev accounts for a matrix placed on the
GPU Profiling of Singular Value Decomposition in OLPCA …
713
device) and corresponding left and/or right singular vectors. The working space
needed is pointed by the Work parameter, declared as a vector of ﬂoat, while the
size of the working space is speciﬁed in Lwork, as calculated in Listing 1.
Finally, in Listing 3, we report another test code based on the CULA library. Here,
the SVD functionality is implemented in the gesvd routine. Initialization and actual
routine call is simpliﬁed as opposed to cuBLAS and cuSOLVER libraries, as it is
not necessary to perform any kind of preallocation. A matrix A is passed to the
culaSgesvd routine and the result is stored in the S, U and V matrices. Here we use
the S data type for real valued single precision inputs. CULA also provides a version
of its routines to speciﬁcally work on the device. In this case, culaDeviceSgesvd
would be used instead of culaSgesvd and all input and output matrices need to be
properly moved from host to device and vice-versa.
4 Performance Analysis and Proﬁling
To support our analysis, we report a test for a group of SVD decompositions. In
Fig. 3 we have compared the execution times of a SVD decomposition performed
on a CPU (Intel Core i7 860 @ 2.80Ghz) and a GPU (NVIDIA Quadro K5000
4Gb). In the GPU version, we perform a SVD decomposition on a random test
matrix and compare the libraries: CULA and cuBLAS. GPU performance gains
are noticeable when the size of a squared input matrix increases. In fact, the CPU
version performs better when the problem size is smaller than 256 × 256 (Fig. 4).
These results show how the GPU overhead has an important impact on performance
for very small matrices. Moreover, it naturally follows that calling a SVD routine,
of a chosen library, several thousands of times affects the overall performance. In
our tests, issuing several SVD decompositions using CULA or cuBLAS resulted in
a stalling computation.
The CPU version we use in this comparative study is based on the SVD implementation provided by the OpenCV library. We use the compute::SVD routine which
stores the results in user-provided matrices, similarly to the cuBLAS and CULA
routines. This choice is mainly motivated by the intention of testing state of the art
and/or production-ready routines used in many application of image processing and
to investigate the impact each module, CPU or GPU -based, gives to algorithms that
speciﬁcally need to manipulate a considerable amount of small matrices.
5 Conclusion
In many image processing methods, heavy computational kernels are implemented
by using suitable methods and scientiﬁc computing strategies. We propose a solution
based on optimized standard scientiﬁc libraries in order to improve the performance
714
S. Cuomo et al.
Fig. 3 CPU and GPU Execution time comparison for squared matrices.
Fig. 4 CPU and GPU execution time comparison for squared matrices. A section of Fig. 3 in order
to highlight the performance scatter for small problem sizes. As shown, CPU performance is still
preferable for small problem sizes.
GPU Profiling of Singular Value Decomposition in OLPCA …
715
of same tasks in the denoising OLPCA algorithm. In a future work we will develop
a more optimized parallel software for this algorithm.
Acknowledgment
The work is partially founded thanks to Big4H (Big data analyics for e-Health applications) Project, Bando Sportello dell’Innovazione - Progetti di trasferimento tecnologico cooperativi e di prima industrializzazione per le imprese innovative ad alto
potenziale - della Regione Campania.
References
1. Nvidia cuda programming guide.
[online] (2012).
Tech. Rep. available at
http://www.nvidia.com/content/cudazone/download/OpenCL/
NVIDIAOpenCLProgrammingGuide.pdf
2. Amato, F., Barbareschi, M., Casola, V., Mazzeo, A.: An fpga-based smart classiﬁer for decision support systems. In: Intelligent Distributed Computing VII, pp. 289–299. Springer
(2014)
3. Amato, F., De Pietro, G., Esposito, M., Mazzocca, N.: An integrated framework for securing
semi-structured health records. Knowledge-Based Systems 79, 99–117 (2015)
4. Amato, F., Moscato, F.: A model driven approach to data privacy veriﬁcation in e-health systems. Transactions on Data Privacy 8(3), 273–296 (2015)
5. Andrews, H.C., Patterson, C.L.: Singular value decompositions and digital image processing.
IEEE Trans. on Acoustics, Speech, and Signal Processing ASSP-24, 26–53 (1976). URL
http://ieeexplore.ieee.org/stamp/stamp.jsp?arnumber=1162766
6. Buades, A., Coll, B., Morel, J.: A review of image denoising algorithms, with a new one.
Multiscale Modeling & Simulation 4(2), 490–530 (2005). DOI 10.1137/040616024. URL
http://dx.doi.org/10.1137/040616024
7. Bydder, M., Du, J.: Noise reduction in multiple-echo data sets using singular value decomposition. Magnetic Resonance Imaging 24(7), 849 – 856 (2006). DOI http://dx.doi.org/10.1016/
j.mri.2006.03.006. URL http://www.sciencedirect.com/science/article/
pii/S0730725X06001317
8. Cuomo, S., De Michele, P., Galletti, A., Marcellino, L.: A gpu parallel implementation of
the local principal component analysis overcomplete method for dw image denoising. In:
2016 IEEE Symposium on Computers and Communication (ISCC), pp. 26–31 (2016). DOI
10.1109/ISCC.2016.7543709. URL http://dx.doi.org/10.1109/ISCC.2016.
7543709. The Twenty-First IEEE Symposium on Computers and Communication, 27-30
June 2016, Messina, Italy
9. Cuomo, S., De Michele, P., Piccialli, F.: A regularized mri image reconstruction based on hessian penalty term on cpu/gpu systems.
pp. 2643–2646
(2013).
DOI 10.1016/j.procs.2013.06.001.
URL http://www.scopus.com/
inward/record.url?eid=2-s2.0-84892506892&partnerID=40&md5=
cc785a43da0426b134b5a4e05bc3ad5e
10. Cuomo, S., De Michele, P., Piccialli, F.: 3d data denoising via nonlocal means ﬁlter by using
parallel GPU strategies. Comp. Math. Methods in Medicine 2014, 523,862:1–523,862:14
(2014). DOI 10.1155/2014/523862. URL http://dx.doi.org/10.1155/2014/
523862
716
S. Cuomo et al.
11. Cuomo, S., Galletti, A., Giunta, G., Marcellino, L.: Toward a multi-level parallel framework on gpu cluster with petsc-cuda for pde-based optical ﬂow computation.
pp. 170–179 (2015).
DOI 10.1016/j.procs.2015.05.220.
URL http:
//www.scopus.com/inward/record.url?eid=2-s2.0-84939155665&
partnerID=40&md5=ddcb2162cbc29925e582fc9498463059
12. Cuomo, S., Galletti, A., Giunta, G., Starace, A.: Surface reconstruction from scattered point via rbf interpolation on gpu.
pp. 433–440 (2013).
URL http:
//www.scopus.com/inward/record.url?eid=2-s2.0-84892530536&
partnerID=40&md5=517ef890d27db116781907864f6861fd
13. Konstantinides, K., Natarajan, B., Yovanof, G.S.: Noise estimation and ﬁltering using blockbased singular value decomposition. IEEE Transactions on Image Processing 6(3), 479–483
(1997). DOI 10.1109/83.557359
14. Manj´on, J., Coup´e, P., Concha, L., Buades, A., Collins, D., Robles, M.: Diffusion weighted image denoising using overcomplete local pca.
PLoS ONE 8(9)
(2013).
DOI 10.1371/journal.pone.0073021.
URL http://www.scopus.com/
inward/record.url?eid=2-s2.0-84883366803&partnerID=40&md5=
467a3af41b50d17486ab1385ccf8e816
15. Manj´on, J.V., Coup´e, P., Mart´ı-Bonmat´ı, L., Collins, D.L., Robles, M.: Adaptive non-local
means denoising of MR images with spatially varying noise levels. Journal of Magnetic Resonance Imaging 31(1), 192–203 (2010). DOI 10.1002/jmri.22003. URL http://www.
hal.inserm.fr/inserm-00454564
16. Muresan, D.D., Parks, T.W.: Orthogonal, exactly periodic subspace decomposition. IEEE
Transactions on Signal Processing 51(9), 2270–2279 (2003). DOI 10.1109/TSP.2003.815381.
URL http://dx.doi.org/10.1109/TSP.2003.815381
17. Palma, G., Piccialli, F., Michele, P.D., Cuomo, S., Comerci, M., Borrelli, P., Alfano, B.: 3d
non-local means denoising via multi-gpu. In: Proceedings of the 2013 Federated Conference
on Computer Science and Information Systems, Krak´ow, Poland, September 8-11, 2013., pp.
495–498 (2013). URL http://ieeexplore.ieee.org/xpl/articleDetails.
jsp?arnumber=6644045
18. Poon, P., Wei-Ren, N., Sridharan, V.: Image denoising with singular value decompositon and principal component analysis.
http://www.u.arizona.edu/˜ppoon/
ImageDenoisingWithSVD.pdf (2009)
19. Yang, J.F., Lu, C.L.: Combined techniques of singular value decomposition and vector quantization for image coding. IEEE Transactions on Image Processing 4(8), 1141–1146 (1995).
DOI 10.1109/83.403419
A machine learning approach for predictive
maintenance for mobile phones service providers
A. Corazza, F. Isgr`o, L. Longobardo, R. Prevete
Abstract The problem of predictive maintenance is a very crucial one for every technological company. This is particularly true for mobile phones service
providers, as mobile phone networks require continuous monitoring. The ability
of previewing malfunctions is crucial to reduce maintenance costs and loss of customers. In this paper we describe a preliminary study in predicting failures in a
mobile phones networks based on the analysis of real data. A ridge regression classiﬁer has been adopted as machine learning engine, and interesting and promising
conclusion were drawn from the experimental data.
1 Introduction
A large portion of the total operating costs of any industry or service provider is
devoted to keep their machinery and instruments up to a good level, aiming to ensure
a minimal disruption in the production line. It has been estimated that the costs of
maintenance is the range 15-60% of the costs of good produced [14]. Moreover
about one third of the maintenance costs is spent in not necessary maintenance; just
as an example, for the U.S. industry only this is a $60 billion each year spent in
unnecessary work. On the other hand an ineffective maintenance can cause further
loss in the production line, when a failure presents itself.
Anna Corazza
DIETI, Universit`a di Napoli Federico II e-mail: anna.corazza@unina.it
Francesco Isgr`o
DIETI, Universit`a di Napoli Federico II e-mail: francesco.isgro@unina.it
Luca Longobardo
DIETI, Universit`a di Napoli Federico II e-mail: luc.longobardo@studenti.unina.it
Roberto Prevete
DIETI, Universit`a di Napoli Federico II e-mail: roberto.prevete@unina.it
© Springer International Publishing AG 2017
F. Xhafa et al. (eds.), Advances on P2P, Parallel, Grid, Cloud
and Internet Computing, Lecture Notes on Data Engineering
and Communications Technologies 1, DOI 10.1007/978-3-319-49109-7_69
717
718
A. Corazza et al.
Predictive maintenance [14, 10] attempts to minimise the costs due to failure
via a regular monitoring of the conditions of the machinery and instruments. The
observation will return a set of features from which it is possible in some way to
infer if the apparatus are likely to fail in the near future. The nature of the feature
depend, of course, on the apparatus that is being inspected. The amount of time in
the future that the failure will arise also depends on the problem, although we can
state, as a general rule, that the sooner a failure can be predicted, the better is in
terms of effective maintenance.
In general the prediction is based on some empirical rule [23, 17, 19], but over the
last decade there has been some work devoted to apply machine learning [6, 22, 5]
techniques to the task predicting the possible failure of the apparatus. For instance,
a Bayesian network has been adopted in [9] for a prototype system designed for the
predictive maintenance of non-critical apparatus (e.g., elevators). In [12] different
kind of analysis for dimensionality reduction and support vector machines [7] have
been applied to rail networks. Time series analysis has been adopted in [13] for
link quality prediction in wireless networks. In a recent work the use of multiple
classiﬁers for providing different performance estimates has been proposed in [18].
An area where disruption of service can have a huge impact on the company sales
and/or the customer satisfaction is the one of mobile phone service providers [8, 4].
The context considered in this work is a the predictive maintenance of national
mobile phone network, that is being able to foresee well in advance if a cell of
the network is going to fail. This is very important as the failure of a cell can have
a huge impact on the users’ quality of experience [11], and to prevent them makes
less likely that the user decides to change service provider.
In this paper we present a preliminary analysis on the use of a machine learning
paradigm for the prediction of a failure on a cell of a mobile phones network. The
aim is to predict the failure such in advance that no disruption in the service will
occur, lets say, at least a few hours in advance. A failure is reported among a set
of features that are measured every quarter of an hour. The task is then to predict
the status of the feature reporting the failure within a certain amount of time. As for
many other predictive maintenance problems given we are dealing with a very large
amount of sensors [15].
The paper is organised as follows. Next section describes the data we used and
reports some interesting properties of the data that have been helpful in designing
the machine learning engine. The failure prediction model proposed is discussed in
Section 3, together with some experimental results. Section 4 is left to some ﬁnal
remarks.
2 Data analysis
To predict failure, considered data is obtained by monitoring the state of base
transceiver stations (also known as cells) in a telecommunication network, during a
1 month (31 days) time-span. A cell represents the unit for a telecommunication net-
A machine learning approach for predictive maintenance …
719
work in the tackled case study. Cells are grouped into antennas, so that one antenna
can contain several cells. The goal for the problem is to predict a malfunctioning
(pointed out by an alarm signal originated from cells) in a cell.
Furthermore, information about the geographical location of the cell can be relevant. When the Italian peninsula is considered, the total number of cells amounts to
nearly 52, 000. For instance, when considering the total number of measurements,
we get more than 150 millions of tuples.
Several kinds of statistical analysis were implemented to explore the data, and
some interesting key-points and critical issues emerged from this analysis.
First of all, more than the 60% of the cells did not show any alarm signal. This
is a quite usual behavior, as the system works smoothly for most of the time. Even
when such cases are excluded, the average number of alarms per cell is only 3 in
a month. In order to obtain a data set which is meaningful enough for a statistical
analysis, only cells with at least 6 alarms have been kept: in this way the number
of cells is further reduced to less than 2, 000. Moreover, among the remaining cells
the proportion between alarm tuples and non-alarm tuples still remains high, as the
former represent barely 1% of the total. However, we considered it acceptable, as
malfunctioning must be considered unlikely to happen. In the end, this unbalance
strongly inﬂuence the pool size of useful input data, and must faced with an adequate
strategy.
Another critical issue regards the presence of several non-numeric values spread
among the tuples. There are four different undeﬁned values, among whose, INF
values are the most frequent. Indeed, INF is the second most frequent value among
all ﬁelds. All in all, discarding tuples containing undeﬁned values in some of the
ﬁelds would cut out 80% of data, leaving us with a too small dataset. We therefore
had to ﬁnd a different strategy to face the problem.
As already stated, another problematic issue regards the temporal dimension of
data. Time-span is only one month, which on a time series related problem is not
very much, to begin with the very basic problem of properly splitting data into
training and test sets.
Another key-point regarding the data was found by looking at scatter plot diagrams between pairs of features. These diagrams highlighted two aspects: the ﬁrst
one is that alarm occurrences seem related to the values of some speciﬁc features.
The second one is that there are two identical features. Since we don’t have information about the meaning of the various features, we can’t tell if this is supposed to
be an error on the data supplied.
In addition to these, other statistical analysis were performed, focusing mainly
on the values assumed by the features. Average values are summarized, along with
standard deviations in Figure 1.
Inspecting Figure 1 we can see that FEATURE 7 and FEATURE 9 show a signiﬁcant difference in term of average value between alarm and non-alarm events and
thus, can trace a good starting point for a machine learning approach. Moreover, we
can split features in three different groups.
720
A. Corazza et al.
Fig. 1 Average values for the features in a stable or alarm situation. The line on every bar represent
the standard deviation
Fig. 2 Pearson correlation coefﬁcient between features and alarm indicator
1. The ﬁrst group is composed of: FEATURE 4, FEATURE 8, and FEATURE 9.
These features have constant values in all the three conditions but also a relatively
high standard deviation.
2. The second group is composed of: FEATURE 6, FEATURE 3, FEATURE 5,
and FEATURE 1. Also in this case, the features tend to have constant values in
all three situations, but with a relatively lower standard deviation.
3. The third group is composed of: FEATURE 7 and FEATURE 2. These features
show a large difference in terms of both average value and standard deviation
between alarm and non-alarm situations.
To better analyse the differences between alarm and non-alarm situations, Pearson correlation coefﬁcients have been calculated between each feature and the alarm
indicator. Results are shown in Figure 2, and conﬁrm that FEATURE 2 and FEATURE 7 appear to be more related with an alarm occurrence, although the correlation value is always lower than 0.2.
Last but not least, the alarm propagation effect has been analyzed, to check if
an alarm occurring in a cell is correlated to alarms in nearby cells. The results in
Figure 3 show that this is the case only for cells belonging to the same antenna. In
general, when the distance increases and cells of different antennas are considered,
the probability of cooccurrent alarms drops close to 0. We can therefore conclude
that, according to our data, there is no propagation effect.