Tải bản đầy đủ - 0 (trang)
2 ACRA Learning Strategies Scales (Acquisition, Coding, Retrieval andSupport)

2 ACRA Learning Strategies Scales (Acquisition, Coding, Retrieval andSupport)

Tải bản đầy đủ - 0trang


S. Cuomo et al.

Fig. 1 A CPU multi-threaded approach on a set of small matrices. Each thread performs a SVD

computation on a Xi matrix of small size. Depending on the CPU caching capabilities, performance

can be enhanced for small problem sizes.

Fig. 2 CUDA streams approach to multiple SVD problems. Each stream performs a SVD decomposition on a small matrix. A pipeline is executed using multiple streams. In this example, 3

streams are used to compute 3 SVD decomposition. The upper side of the image shows a CPU

executing 3 SVDs. The bottom side shows 3 streams working together on the same problem.

GPU Profiling of Singular Value Decomposition in OLPCA …


Listing 1 Calculate size for pre-allocated buffer

cusolverStatus_t cuSolverStatus;

cusolverDnHandle_t cuSolverHandle;


cuSolverStatus = cusolverDnSgesvd_bufferSize(cuSolverHandle, M, N,


float *Work;

cudaMalloc(&Work, WorkSize*sizeof(*Work));

float *rwork;

cudaMalloc(&rwork, M*M*sizeof(*rwork));

However, it is necessary to be careful in choosing a parallel approach to tackle this

problem. Although it is possible to achieve good performance and speed-ups parallelizing the SVD computation, for instance by using a custom kernel or a library

of choice (cuBLAS, CULA or other), we stress that in this case the size of all matrices Xi is very small (in the order of 10 to 14 squared matrices). Performing a

SVD decomposition, which in the truncated case has a computational complexity

of O(min(mn2 , m2 n)), of such small matrices on a CPU requires only a hundred of

FLOPS thus allowing to compute a solution in a range of milliseconds.

The process of initiating a GPU and then perform a SVD decomposition requires

more time than just performing it on a CPU. For instance, the theoretical number of

threads that could be used for a 5 × 5 matrix is 25. In practice, the number of simultaneous threads per matrix is 1 or 5, meaning many thousands of simultaneous matrices are needed to obtain a sufficient number of threads. In order to justify a GPU

computation, more operations should be done for which thousands of simultaneous

threads are preferable to achieve a noticeable difference in overall performance.

Theoretically, one CUDA stream per problem could be used as a solution for one

problem at a time, as depicted in Fig. 2. However, two issues arise:

• the number of threads per block would be very low. Typically, no less than 32

threads per block (the warp size, effectively the SIMD width, of all current

CUDA-capable GPUs is 32 threads) are preferred to fully exploit the potential

of a GPU card, and this is not practical, as previously mentioned;

• the prepare-lauch code would be more expensive than just performing the operation on a CPU and the resulting overhead would be unacceptable.

NVIDIA cuBLAS has introduced a batch-mode feature but this is only available

for multiply, inverse routines and QR decomposition. It is intended for solving a

specific class of problems by allowing a user to request a solution for a group of

identical problems (a SIMD operation in this case). cuBLAS batch interface is intended for specific circumstances with small problems (matrix size < 32) and very


S. Cuomo et al.

Listing 2 An example of a cuBLAS SVD routine call. Profiling events registration is included.



cudaEventRecord(start, 0);

cuSolverStatus = cusolverDnSgesvd(cuSolverHandle, ’A’, ’A’, M, N,

devA, M, devS, devU, M, devV, N, Work, Lwork, NULL, devInfo);

cudaEventRecord(stop, 0);


cudaEventElapsedTime(&executionTime, start, stop);



Listing 3 An example of a SVD routine call using CULA. Profiling event registration is included.



cudaEventRecord(start, 0);

status = culaSgesvd(’A’, ’A’, M, N, A, lda, S, U, ldu, V, ldvt);

cudaEventRecord(stop, 0);


cudaEventElapsedTime(&executionTime, start, stop);



large batch sizes (number of matrices > 1000). Our case study falls in this category

but a SVD decomposition should be derived and implemented using the batch QR

decomposition. This goes beyond the scope of this case study. Also, using QR iterations to achieve a SVD decomposition would result in a slower algorithm than

LAPACK optimized SVD.

The GPU implementation is based on two libraries: NVIDIA cuBLAS and CULA.

Here we report a sample code for each library. In block 1 and 2, we exploit the

cuSOLVER APIs. They represent a high-level package based on cuBLAS and cuSPARSE, combining three separate libraries. Here we make use of the first part of

cuSOLVER, called cuSolverDN. It deals with dense matrix factorization and solve

routines such as LU, QR and SVD. We use the S data type for real valued single

precision inputs.

In Listing 1, we use a helper function to calculate the sizes needed for pre-allocated

buffer Lwork. Moreover, in Listing 2, the cusolverDnSgesvd function computes the

SVD of a m × n matrix devA (the prefix dev accounts for a matrix placed on the

GPU Profiling of Singular Value Decomposition in OLPCA …


device) and corresponding left and/or right singular vectors. The working space

needed is pointed by the Work parameter, declared as a vector of float, while the

size of the working space is specified in Lwork, as calculated in Listing 1.

Finally, in Listing 3, we report another test code based on the CULA library. Here,

the SVD functionality is implemented in the gesvd routine. Initialization and actual

routine call is simplified as opposed to cuBLAS and cuSOLVER libraries, as it is

not necessary to perform any kind of preallocation. A matrix A is passed to the

culaSgesvd routine and the result is stored in the S, U and V matrices. Here we use

the S data type for real valued single precision inputs. CULA also provides a version

of its routines to specifically work on the device. In this case, culaDeviceSgesvd

would be used instead of culaSgesvd and all input and output matrices need to be

properly moved from host to device and vice-versa.

4 Performance Analysis and Profiling

To support our analysis, we report a test for a group of SVD decompositions. In

Fig. 3 we have compared the execution times of a SVD decomposition performed

on a CPU (Intel Core i7 860 @ 2.80Ghz) and a GPU (NVIDIA Quadro K5000

4Gb). In the GPU version, we perform a SVD decomposition on a random test

matrix and compare the libraries: CULA and cuBLAS. GPU performance gains

are noticeable when the size of a squared input matrix increases. In fact, the CPU

version performs better when the problem size is smaller than 256 × 256 (Fig. 4).

These results show how the GPU overhead has an important impact on performance

for very small matrices. Moreover, it naturally follows that calling a SVD routine,

of a chosen library, several thousands of times affects the overall performance. In

our tests, issuing several SVD decompositions using CULA or cuBLAS resulted in

a stalling computation.

The CPU version we use in this comparative study is based on the SVD implementation provided by the OpenCV library. We use the compute::SVD routine which

stores the results in user-provided matrices, similarly to the cuBLAS and CULA

routines. This choice is mainly motivated by the intention of testing state of the art

and/or production-ready routines used in many application of image processing and

to investigate the impact each module, CPU or GPU -based, gives to algorithms that

specifically need to manipulate a considerable amount of small matrices.

5 Conclusion

In many image processing methods, heavy computational kernels are implemented

by using suitable methods and scientific computing strategies. We propose a solution

based on optimized standard scientific libraries in order to improve the performance


S. Cuomo et al.

Fig. 3 CPU and GPU Execution time comparison for squared matrices.

Fig. 4 CPU and GPU execution time comparison for squared matrices. A section of Fig. 3 in order

to highlight the performance scatter for small problem sizes. As shown, CPU performance is still

preferable for small problem sizes.

GPU Profiling of Singular Value Decomposition in OLPCA …


of same tasks in the denoising OLPCA algorithm. In a future work we will develop

a more optimized parallel software for this algorithm.


The work is partially founded thanks to Big4H (Big data analyics for e-Health applications) Project, Bando Sportello dell’Innovazione - Progetti di trasferimento tecnologico cooperativi e di prima industrializzazione per le imprese innovative ad alto

potenziale - della Regione Campania.


1. Nvidia cuda programming guide.

[online] (2012).

Tech. Rep. available at



2. Amato, F., Barbareschi, M., Casola, V., Mazzeo, A.: An fpga-based smart classifier for decision support systems. In: Intelligent Distributed Computing VII, pp. 289–299. Springer


3. Amato, F., De Pietro, G., Esposito, M., Mazzocca, N.: An integrated framework for securing

semi-structured health records. Knowledge-Based Systems 79, 99–117 (2015)

4. Amato, F., Moscato, F.: A model driven approach to data privacy verification in e-health systems. Transactions on Data Privacy 8(3), 273–296 (2015)

5. Andrews, H.C., Patterson, C.L.: Singular value decompositions and digital image processing.

IEEE Trans. on Acoustics, Speech, and Signal Processing ASSP-24, 26–53 (1976). URL


6. Buades, A., Coll, B., Morel, J.: A review of image denoising algorithms, with a new one.

Multiscale Modeling & Simulation 4(2), 490–530 (2005). DOI 10.1137/040616024. URL


7. Bydder, M., Du, J.: Noise reduction in multiple-echo data sets using singular value decomposition. Magnetic Resonance Imaging 24(7), 849 – 856 (2006). DOI http://dx.doi.org/10.1016/

j.mri.2006.03.006. URL http://www.sciencedirect.com/science/article/


8. Cuomo, S., De Michele, P., Galletti, A., Marcellino, L.: A gpu parallel implementation of

the local principal component analysis overcomplete method for dw image denoising. In:

2016 IEEE Symposium on Computers and Communication (ISCC), pp. 26–31 (2016). DOI

10.1109/ISCC.2016.7543709. URL http://dx.doi.org/10.1109/ISCC.2016.

7543709. The Twenty-First IEEE Symposium on Computers and Communication, 27-30

June 2016, Messina, Italy

9. Cuomo, S., De Michele, P., Piccialli, F.: A regularized mri image reconstruction based on hessian penalty term on cpu/gpu systems.

pp. 2643–2646


DOI 10.1016/j.procs.2013.06.001.

URL http://www.scopus.com/



10. Cuomo, S., De Michele, P., Piccialli, F.: 3d data denoising via nonlocal means filter by using

parallel GPU strategies. Comp. Math. Methods in Medicine 2014, 523,862:1–523,862:14

(2014). DOI 10.1155/2014/523862. URL http://dx.doi.org/10.1155/2014/



S. Cuomo et al.

11. Cuomo, S., Galletti, A., Giunta, G., Marcellino, L.: Toward a multi-level parallel framework on gpu cluster with petsc-cuda for pde-based optical flow computation.

pp. 170–179 (2015).

DOI 10.1016/j.procs.2015.05.220.

URL http:



12. Cuomo, S., Galletti, A., Giunta, G., Starace, A.: Surface reconstruction from scattered point via rbf interpolation on gpu.

pp. 433–440 (2013).

URL http:



13. Konstantinides, K., Natarajan, B., Yovanof, G.S.: Noise estimation and filtering using blockbased singular value decomposition. IEEE Transactions on Image Processing 6(3), 479–483

(1997). DOI 10.1109/83.557359

14. Manj´on, J., Coup´e, P., Concha, L., Buades, A., Collins, D., Robles, M.: Diffusion weighted image denoising using overcomplete local pca.

PLoS ONE 8(9)


DOI 10.1371/journal.pone.0073021.

URL http://www.scopus.com/



15. Manj´on, J.V., Coup´e, P., Mart´ı-Bonmat´ı, L., Collins, D.L., Robles, M.: Adaptive non-local

means denoising of MR images with spatially varying noise levels. Journal of Magnetic Resonance Imaging 31(1), 192–203 (2010). DOI 10.1002/jmri.22003. URL http://www.


16. Muresan, D.D., Parks, T.W.: Orthogonal, exactly periodic subspace decomposition. IEEE

Transactions on Signal Processing 51(9), 2270–2279 (2003). DOI 10.1109/TSP.2003.815381.

URL http://dx.doi.org/10.1109/TSP.2003.815381

17. Palma, G., Piccialli, F., Michele, P.D., Cuomo, S., Comerci, M., Borrelli, P., Alfano, B.: 3d

non-local means denoising via multi-gpu. In: Proceedings of the 2013 Federated Conference

on Computer Science and Information Systems, Krak´ow, Poland, September 8-11, 2013., pp.

495–498 (2013). URL http://ieeexplore.ieee.org/xpl/articleDetails.


18. Poon, P., Wei-Ren, N., Sridharan, V.: Image denoising with singular value decompositon and principal component analysis.


ImageDenoisingWithSVD.pdf (2009)

19. Yang, J.F., Lu, C.L.: Combined techniques of singular value decomposition and vector quantization for image coding. IEEE Transactions on Image Processing 4(8), 1141–1146 (1995).

DOI 10.1109/83.403419

A machine learning approach for predictive

maintenance for mobile phones service providers

A. Corazza, F. Isgr`o, L. Longobardo, R. Prevete

Abstract The problem of predictive maintenance is a very crucial one for every technological company. This is particularly true for mobile phones service

providers, as mobile phone networks require continuous monitoring. The ability

of previewing malfunctions is crucial to reduce maintenance costs and loss of customers. In this paper we describe a preliminary study in predicting failures in a

mobile phones networks based on the analysis of real data. A ridge regression classifier has been adopted as machine learning engine, and interesting and promising

conclusion were drawn from the experimental data.

1 Introduction

A large portion of the total operating costs of any industry or service provider is

devoted to keep their machinery and instruments up to a good level, aiming to ensure

a minimal disruption in the production line. It has been estimated that the costs of

maintenance is the range 15-60% of the costs of good produced [14]. Moreover

about one third of the maintenance costs is spent in not necessary maintenance; just

as an example, for the U.S. industry only this is a $60 billion each year spent in

unnecessary work. On the other hand an ineffective maintenance can cause further

loss in the production line, when a failure presents itself.

Anna Corazza

DIETI, Universit`a di Napoli Federico II e-mail: anna.corazza@unina.it

Francesco Isgr`o

DIETI, Universit`a di Napoli Federico II e-mail: francesco.isgro@unina.it

Luca Longobardo

DIETI, Universit`a di Napoli Federico II e-mail: luc.longobardo@studenti.unina.it

Roberto Prevete

DIETI, Universit`a di Napoli Federico II e-mail: roberto.prevete@unina.it

© Springer International Publishing AG 2017

F. Xhafa et al. (eds.), Advances on P2P, Parallel, Grid, Cloud

and Internet Computing, Lecture Notes on Data Engineering

and Communications Technologies 1, DOI 10.1007/978-3-319-49109-7_69



A. Corazza et al.

Predictive maintenance [14, 10] attempts to minimise the costs due to failure

via a regular monitoring of the conditions of the machinery and instruments. The

observation will return a set of features from which it is possible in some way to

infer if the apparatus are likely to fail in the near future. The nature of the feature

depend, of course, on the apparatus that is being inspected. The amount of time in

the future that the failure will arise also depends on the problem, although we can

state, as a general rule, that the sooner a failure can be predicted, the better is in

terms of effective maintenance.

In general the prediction is based on some empirical rule [23, 17, 19], but over the

last decade there has been some work devoted to apply machine learning [6, 22, 5]

techniques to the task predicting the possible failure of the apparatus. For instance,

a Bayesian network has been adopted in [9] for a prototype system designed for the

predictive maintenance of non-critical apparatus (e.g., elevators). In [12] different

kind of analysis for dimensionality reduction and support vector machines [7] have

been applied to rail networks. Time series analysis has been adopted in [13] for

link quality prediction in wireless networks. In a recent work the use of multiple

classifiers for providing different performance estimates has been proposed in [18].

An area where disruption of service can have a huge impact on the company sales

and/or the customer satisfaction is the one of mobile phone service providers [8, 4].

The context considered in this work is a the predictive maintenance of national

mobile phone network, that is being able to foresee well in advance if a cell of

the network is going to fail. This is very important as the failure of a cell can have

a huge impact on the users’ quality of experience [11], and to prevent them makes

less likely that the user decides to change service provider.

In this paper we present a preliminary analysis on the use of a machine learning

paradigm for the prediction of a failure on a cell of a mobile phones network. The

aim is to predict the failure such in advance that no disruption in the service will

occur, lets say, at least a few hours in advance. A failure is reported among a set

of features that are measured every quarter of an hour. The task is then to predict

the status of the feature reporting the failure within a certain amount of time. As for

many other predictive maintenance problems given we are dealing with a very large

amount of sensors [15].

The paper is organised as follows. Next section describes the data we used and

reports some interesting properties of the data that have been helpful in designing

the machine learning engine. The failure prediction model proposed is discussed in

Section 3, together with some experimental results. Section 4 is left to some final


2 Data analysis

To predict failure, considered data is obtained by monitoring the state of base

transceiver stations (also known as cells) in a telecommunication network, during a

1 month (31 days) time-span. A cell represents the unit for a telecommunication net-

A machine learning approach for predictive maintenance …


work in the tackled case study. Cells are grouped into antennas, so that one antenna

can contain several cells. The goal for the problem is to predict a malfunctioning

(pointed out by an alarm signal originated from cells) in a cell.

Furthermore, information about the geographical location of the cell can be relevant. When the Italian peninsula is considered, the total number of cells amounts to

nearly 52, 000. For instance, when considering the total number of measurements,

we get more than 150 millions of tuples.

Several kinds of statistical analysis were implemented to explore the data, and

some interesting key-points and critical issues emerged from this analysis.

First of all, more than the 60% of the cells did not show any alarm signal. This

is a quite usual behavior, as the system works smoothly for most of the time. Even

when such cases are excluded, the average number of alarms per cell is only 3 in

a month. In order to obtain a data set which is meaningful enough for a statistical

analysis, only cells with at least 6 alarms have been kept: in this way the number

of cells is further reduced to less than 2, 000. Moreover, among the remaining cells

the proportion between alarm tuples and non-alarm tuples still remains high, as the

former represent barely 1% of the total. However, we considered it acceptable, as

malfunctioning must be considered unlikely to happen. In the end, this unbalance

strongly influence the pool size of useful input data, and must faced with an adequate


Another critical issue regards the presence of several non-numeric values spread

among the tuples. There are four different undefined values, among whose, INF

values are the most frequent. Indeed, INF is the second most frequent value among

all fields. All in all, discarding tuples containing undefined values in some of the

fields would cut out 80% of data, leaving us with a too small dataset. We therefore

had to find a different strategy to face the problem.

As already stated, another problematic issue regards the temporal dimension of

data. Time-span is only one month, which on a time series related problem is not

very much, to begin with the very basic problem of properly splitting data into

training and test sets.

Another key-point regarding the data was found by looking at scatter plot diagrams between pairs of features. These diagrams highlighted two aspects: the first

one is that alarm occurrences seem related to the values of some specific features.

The second one is that there are two identical features. Since we don’t have information about the meaning of the various features, we can’t tell if this is supposed to

be an error on the data supplied.

In addition to these, other statistical analysis were performed, focusing mainly

on the values assumed by the features. Average values are summarized, along with

standard deviations in Figure 1.

Inspecting Figure 1 we can see that FEATURE 7 and FEATURE 9 show a significant difference in term of average value between alarm and non-alarm events and

thus, can trace a good starting point for a machine learning approach. Moreover, we

can split features in three different groups.


A. Corazza et al.

Fig. 1 Average values for the features in a stable or alarm situation. The line on every bar represent

the standard deviation

Fig. 2 Pearson correlation coefficient between features and alarm indicator

1. The first group is composed of: FEATURE 4, FEATURE 8, and FEATURE 9.

These features have constant values in all the three conditions but also a relatively

high standard deviation.

2. The second group is composed of: FEATURE 6, FEATURE 3, FEATURE 5,

and FEATURE 1. Also in this case, the features tend to have constant values in

all three situations, but with a relatively lower standard deviation.

3. The third group is composed of: FEATURE 7 and FEATURE 2. These features

show a large difference in terms of both average value and standard deviation

between alarm and non-alarm situations.

To better analyse the differences between alarm and non-alarm situations, Pearson correlation coefficients have been calculated between each feature and the alarm

indicator. Results are shown in Figure 2, and confirm that FEATURE 2 and FEATURE 7 appear to be more related with an alarm occurrence, although the correlation value is always lower than 0.2.

Last but not least, the alarm propagation effect has been analyzed, to check if

an alarm occurring in a cell is correlated to alarms in nearby cells. The results in

Figure 3 show that this is the case only for cells belonging to the same antenna. In

general, when the distance increases and cells of different antennas are considered,

the probability of cooccurrent alarms drops close to 0. We can therefore conclude

that, according to our data, there is no propagation effect.

Tài liệu bạn tìm kiếm đã sẵn sàng tải về

2 ACRA Learning Strategies Scales (Acquisition, Coding, Retrieval andSupport)

Tải bản đầy đủ ngay(0 tr)