Tải bản đầy đủ - 0 (trang)
2 CUDA: Compute Unified Device Architecture

2 CUDA: Compute Unified Device Architecture

Tải bản đầy đủ - 0trang


T. Turker et al.

Fig. 1. CUDA execution model.

Fig. 2. CUDA device memory model.

Many-thread GPUs and general-purpose multicore CPUs have different

design principles as pointed out by Kirk et al. in [6]. While CPUs are designed

with the intent of providing minimum execution latency for a single thread,

GPUs are designed as throughput oriented devices which maximize the total

execution throughput of a large number of threads. So, GPUs cannot perform

well on tasks demanding one or very few threads which are more sensible to

perform on CPUs. It is, therefore, important for a program to be executed on

both CPU and GPU for better resource utilization. To address this need, CUDA

provides heterogeneous execution model, as presented in Fig. 1, which enables

the execution of compute intensive parts on GPU and sequential parts on CPU.

A typical CUDA-capable GPU is formed as a set of highly threaded streaming

multiprocessors (SMs) which perform the actual computations. The number of

SMs in CUDA GPUs varies from generation to generation. There are a number

of streaming processors (SPs) or in other words CUDA cores in each SM. Each

SM has its own control units, execution pipelines, registers and caches which

are shared by these CUDA cores. Another important element in a CUDA GPU

is GDDR DRAM or global memory which is different from the system DRAMs

and used as a high bandwidth off-chip memory for computation.

In CUDA programming model, threads are grouped into blocks, and blocks

are grouped into a grid. In Fig. 1, the execution of two grids which are composed

of a number of thread blocks is illustrated. As it is provided in Fig. 2, CUDA

has additional techniques to access memory that can reduce the amount of data

requests to the global memory in order to prevent the access traffic congestion.

Global and constant memories can be accessed by the host using CUDA API

functions. Also, these two memory types can be accessed by all threads in the

system during the execution. Global memory, as mentioned before, is typically

dynamic random access memory (DRAM) and tends to have long access latencies. On the other hand, the constant memory provides shorter latency read-only

access by the device for the cases when all threads want to access to the same

location. In addition to these two off-chip memories, registers and shared memory provides very high-speed access. Registers are used by individual threads

GPU-Accelerated Flight Route Planning for Multi-UAV Systems


for keeping frequently accessed variables and can be accessed by only associated

thread. Lastly, shared memory is used by thread blocks for thread cooperation

and sharing. In order to develop high-performance parallel applications, it is

important to optimize the kernel’s memory access model [9].

In the literature, there are a number of studies about UAV route planning

problems using CUDA platform. Cekmez et al. provide efficient UAV path planning approaches on CUDA platform by using both Ant Colony Optimization [1]

and Genetic Algorithms [2]. Another research on flight route planning using parallel genetic algorithm on NVIDIA GPUs is provided by [11]. Results obtained

for all these studies show that parallel implementations of these algorithms using

CUDA promise new efficient ways to solve route planning problem in the context

of TSP. However, none of these researchers considers multi-UAV system within

the context of mTSP using CUDA.




Solution Representation

In this study, multi-chromosome genetic representation technique proposed by


aly et al. [5] is used to represent solutions in the algorithm. Figure 3 illustrates

this representation for 11 waypoints and 3 UAVs. Node H in the graph shows the

Home which represents the single take-off and landing point for all UAVs in the

system. It is, therefore, important to note that numbers 0, 1, and 2 used in the solution are reserved for 3 UAVs in the system, and each of them indicates the same

take-off and landing point, Home. A solution is formed by individual routes for

each UAV as it is illustrated by merging Route-0, Route-1, and Route-2 in Fig. 3,

so each element in the solution denotes a geographical position to be visited.

Fig. 3. Solution representation for a multi-UAV system with 3 UAV and 11 waypoints.



T. Turker et al.

Initial Solution Construction

As it is mentioned in Sect. 1, in addition to try to find the route with minimum

total travel cost, we want to make each individual route to have close cost values

as much as possible. In this context, we set our initial solution using a simple

heuristic. It is obvious that if the points representing the take-off and landing

points of UAVs are located into the indices which are too close to each other in

the solution vector, it is quite not possible to start with a solution with better

cost value. However, we organize initial solution by locating each take-off point

for UAVs equidistantly in the solution vector. With this way, we are hoping to

have a solution with better quality.


Objective Function

Searching for the best route for a UAV typically means searching the shortest

route. However, for a multi-UAV system, total travel cost is considered as to be

minimum. Additionally, this study is looking for fair route distribution in terms

of individual route costs of UAVs. That is, this research does not focus only to

minimize the total travel cost for the multi-UAV system. Individual flight routes

for UAVs in the system are also considered to be as close as possible to each other

in terms of individual route costs. In order to ensure this additional requirement,

the method provided by Hou et al. in [4] is used in this study. Given a solution

with total travel cost CT , the difference between the individual flight routes with

minimum cost and maximum cost Cmax −Cmin is added to CT . With this way, a

greater difference between the minimum and maximum individual routes cause

a worsen solution with its greater penalty so these routes will be less preferable

by the algorithm.


GPU-Accelerated Simulated Annealing

SA algorithm can be implemented using two different approaches, homogeneous

and inhomogeneous. In this study, homogeneous version is used due to its capability over controlling the equilibrium state. The implementation of parallel SA

algorithm for route planning consists of two major nested loops as it is presented

in Fig. 4. These are Cooling Loop which controls the temperature of the system,

and Equilibrium State Loop which all threads in the GPU are responsible to

execute it as a CUDA kernel to reach the equilibrium state.

Cooling loop, as being the outer loop, is controlled by the host using static

geometric cooling schedule. In each iteration, the temperature, as the main control parameter of the algorithm, is decreased by multiplying a predefined cooling

factor. In addition to cooling scheduling, this loop is also responsible for selection of the best solution within the solutions found and returned by all CUDA

threads. The best solution is cloned into an array with size of number of threads

in order to comply with data-parallel computation mechanism of CUDA. This is

because all CUDA threads should perform their computation on their own data

GPU-Accelerated Flight Route Planning for Multi-UAV Systems


Fig. 4. Overview of parallel simulated annealing.

portion to exploit data parallelism. Then, this solution array, which is composed of the same best solutions, and decreased temperature are sent to device

global memory (GPU DRAM) by passing them as CUDA kernel parameters, as

it is illustrated in Fig. 4. For each iteration of cooling loop these operations are

repeated until reaching the defined stopping criteria.

Algorithm 1. Cooling Iteration

1: procedure Anneal(initialT emperature, initialConf iguration) Runs on CPU


currentT emperature ← initialT emperature


currentConf ig ← initialConf iguration


minConf ig ← currentConf ig; k ← 0;


while k < noOf OuterIteration do i ← 0;


while i < noOf T hreads do


conf igs[i] ← currentConf ig; i ← i + 1;


end while


Kernel <<< DimGrid, DimBlock >>> (conf igs, currentT emperature)


currentConf ig ← GetConf igW ithM inCost(conf igs)


if currentConf ig.Cost < minConf ig.Cost then


minConf ig ← currentConf ig


end if


currentT emperature ← currentT emperature ∗ coolingF actor; k ← k + 1;


end while

16: end procedure

Equilibrium State Loop (ESL), as a CUDA kernel, is executed by an array of

threads in parallel on the GPU. This kernel is invoked at each cooling iteration

as it is provided in Algorithm 1, (Line 9). The ESL has two main responsibilities: neighbour generation and acceptance evaluation. As it is presented in

Algorithm 2, configurations and temperature parameters are provided by cooling


T. Turker et al.

iteration. The solution array mentioned above is represented as the parameter configurations in Algorithm 2. Each thread gets its own solution conf from

global memory to its local memory by using its unique id threadId as the index

of configurations, (Lines 2 and 3). Then, ESL starts for new neighbour generation and acceptance evaluation. At each equilibrium state iteration, a neighbour

newConf is generated by swapping randomly selected two elements in the current solution conf, (Line 6). Swap operation has a constraint that elements in

the solution which represents the Home point should not be positioned consecutively. Because such a case leads a missing in the number of individual routes

in the solution because of the solution representation technique used in the

algorithm. Next, the difference between the cost of two solutions is calculated to

be used for Metropolis acceptance rule which is the main algorithmic feature of

SA to escape from local optima. If the cost of the new solution is lower than the

current solution, it is directly accepted as the current solution conf. However, if

the newly generated neighbour is worsen than the current solution conf, then,

according to cost difference between two solutions and the temperature value, it

can still be accepted as the current configuration conf with a gradually decreasing probability through the Metropolis acceptance function. These operations

are repeatedly performed in the ESL by all threads until a predefined number of

iteration, as referred to noOfInnerIteration, is reached. At the end of the ESL,

each thread has its own candidate solution (conf ) on the device global memory

as provided in Fig. 4 and should wait for each other to synchronize for deviceToHost data transfer. After these candidate solutions generated at the end of the

ESL by all threads transferred from GPU tp CPU, the host continues its own

operations in the cooling iteration by decreasing the temperature and selecting

the best solution, and so on.

Algorithm 2. Equilibrium State Loop as a CUDA Kernel

1: procedure Kernel(conf igurations, temperature)


threadId ← blockDim.x ∗ blockIdx.x + threadIdx.x


conf ← &(conf igurations[threadId])




while i < noOf InnerIteration do


newConf ← Swap(conf );


ΔE ← newConf.Cost − conf.Cost


if IsAccepted(ΔE, temperature) then


conf ← newConf


end if




end while

13: end procedure

Runs on GPU

GPU-Accelerated Flight Route Planning for Multi-UAV Systems

Fig. 5. Comparison of CPU and GPU

performance for 52 and 100 waypoints

using 3 UAVs.



Fig. 6. Comparison of CPU and GPU

performance for 225 waypoints using 3


Experimental Results

SA algorithm is implemented both on CPU and GPU for performance comparisons. The hardware specifications are presented in Table 1. The initial temperature of SA algorithm is selected as 40 because of its average initial acceptance

percentage (60 %) and cooling iteration is set to 1000. All GPU tests are executed with 128 threads. Additionally, the equilibrium state iteration which each

thread execute is set to 10 for all tests.

Table 1. Hardware specifications








GeForce 840 M


Broadwell-U Maxwell

Clock frequency 2.2 GHz

1029 MHz






Multi-UAV system is considered as consisting 3 UAVs. Data sets with 52, 100,

and 225 waypoints are used in the experiments. As it is presented in Figs. 5 and 6,

GPU implementation provides better total travel cost values for the similar

number of iterations. Moreover, it can be inferred form the figures above, as

the number of waypoints increases, the difference between the total travel costs

provided by CPU and GPU at a certain iteration also increases at the begining of

the iterations. It is therefore important to note that GPU implementation of SA

algorithm provides better solutions as compared to serial CPU implementation.

So, experiments show the efficiency of GPU implementation of SA algorithm.



T. Turker et al.


In conclusion, this paper presents an alternative way to compute cost-fair flight

routes for single station multi-UAV systems efficiently. Simulated Annealing

algorithm is used to deal with exponentially increasing computation time of route

planning problem due to its combinatorial nature. The algorithm is redesigned by

making small modifications for GPU acceleration and implemented using CUDA

platform in order to exploit data-parallel compute mechanism of NVIDIA GPUs.

Experimental results show that GPU-accelerated Simulated Annealing algorithm

provides significant increases in computing performance for flight route planning problem. As a future work, in order to simulate more realistic scenarios,

additional environmental constraints which should be avoided by UAVs such as

radars or missiles in a pre-known flight region are planned to be involved in this



1. Cekmez, U., Ozsiginan, M., Sahingoz, O.K.: A UAV path planning with parallel ACO algorithm on CUDA platform. In: 2014 International Conference on

Unmanned Aircraft Systems (ICUAS), pp. 347–354, May 2014

2. Cekmez, U., Ozsiginan, M., Aydin, M., Sahingoz, O.K.: UAV path planning with

parallel genetic algorithms on CUDA architecture. In: Proceedings of the World

Congress on Engineering, pp. 347–354. IAENG (2014)


3. Cern´

y, V.: Thermodynamical approach to the traveling salesman problem: an

efficient simulation algorithm. J. Optim. Theor. Appl. 45(1), 41–51 (1985).


4. Hou, M., Liu, D.: A novel method for solving the multiple traveling salesmen

problem with multiple depots. Chin. Sci. Bull. 57(15), 1886–1892 (2012)

5. Kir´

aly, A., Abonyi, J.: A novel approach to solve multiple traveling salesmen problem by Genetic algorithm. In: Rudas, I.J., Fodor, J., Kacprzyk, J. (eds.) Computational Intelligence in Engineering. SCI, vol. 313, pp. 141–151. Springer, Heidelberg

(2010). http://dx.doi.org/10.1007/978-3-642-15220-7 12

6. Kirk, D.B., Wen-mei, W.H.: Programming Massively Parallel Processors: A Handson Approach. Newnes, Oxford (2012)

7. Kirkpatrick, S., Gelatt, C.D., Vecchi, M.P.: Optimization by simulated

annealing. Science 220(4598), 671–680 (1983). http://science.sciencemag.org/


8. Metropolis, N., Rosenbluth, A.W., Rosenbluth, M.N., Teller, A.H., Teller,

E.: Equation of state calculations by fast computing machines. J. Chem.

Phys. 21(6), 1087–1092 (1953). http://scitation.aip.org/content/aip/journal/jcp/


9. NVIDIA Corporation: CUDA C best practices guide, version 7.5, September 2015

10. NVIDIA Corporation: CUDA C programming guide, version 7.5, September 2015

˙ sler, V.: A parallel algorithm for UAV flight route planning on

11. Sancı, S., I¸

GPU. Int. J. Parallel Program. 39(6), 809–837 (2011). http://dx.doi.org/10.1007/


Reconstruction of Battery Level Curves Based

on User Data Collected from a Smartphone

Franck Gechter1(B) , Alastair R. Beresford2 , and Andrew Rice2



University of Technologie of Belfort Montbliard, UBFC,

F-90010 Belfort Cedex, France


Computer Laboratory, University of Cambridge, Cambridge, UK


Abstract. We demonstrate how a multi-agent top-down approach can

be used to interpolate between battery level measurements on a phone

handset. This allows us to obtain a high fidelity trace whilst minimising

the data collection overhead. We evaluate our approach using data collected by the Device Analyzer project which collects handset events and

polled measurements from Android devices. The value of the multi-agent

approach lies in the fact that it is able to incorporate implicit information about battery level from operating system events such as network

usage. We compare our approach to interpolation using Bezier curves

and show a 50 % improvement in mean error and variance.

Keywords: Power modelling · Smartphones · Device Analyzer · Physics

inspired artificial intelligence



Researchers are increasingly interested in how consumer devices are used in

the real world. Projects such as Device Analyzer attempt to help with this by

collecting usage data from Android devices. Device Analyzer has collected data

from more than 30,000 users worldwide with some participants providing more

than two years of continuous usage data [1]. It contains operating system events,

such as screen-on and incoming-call, and also periodically polls for information,

such as battery level and network byte counters.

Polled data in Device Analyzer is only collected every 5 min. This is necessary in order to minimise the energy footprint of the data collection itself on

the participant’s phone. However, this means that the trace lacks fidelity and

so being able to accurately interpolate between readings would be a valuable


In this paper we focus on estimating the power consumption of a handset.

These estimates are the basis of research into understanding (and subsequently

reducing) the energy consumption of smartphones. The most accurate way to

determine a device’s power consumption is to measure it directly. This can be

c Springer International Publishing Switzerland 2016

C. Dichev and G. Agre (Eds.): AIMSA 2016, LNAI 9883, pp. 289–298, 2016.

DOI: 10.1007/978-3-319-44748-3 28


F. Gechter et al.

done by intercepting the current flow between the phone and its battery [2].

This overall consumption can then be used in a power model to estimate the

consumption of different components on the phone [3]. Researchers have also

been successful in estimating energy consumed by monitoring the usage of hardware at the operating system level [4] or instrumenting application binaries and

logging system calls [5]. These techniques all suffer from the limitation that they

require some sort of specialist modification to the device and so are only suitable for small-scale studies. These studies are unlikely to capture the full range

of conditions and use-cases experienced by real users.

Wide-scale deployments of power estimation have been forced to rely on

coarse-grained estimates of energy use collected by polling the state-of-charge

of the battery directly. Drawing useful inferences from these measurements is

difficult. In the Carat project for example the authors were able to identify

energy-wasting applications (‘energy hogs’) by aggregating measurements from

many thousands of devices [6]. Our intention is similar in that we seek to make

use of additional information to improve the quality of data from coarse-grained


The energy consumption of one smartphone component is not independent of

other components [3] and so we take a holistic, top-down approach. Our approach

is similar to the top-down operating system approaches described above, however

we apply a multi-agent system method and use coarse-grained events which relate

to user actions rather than system calls or hardware usage.

A multi-agent system is composed of a set of agents in interaction with each

other within an environment. The result of the three main components (agents,

interaction, environment) leads to a collective organisation.

There are two main approaches in the multi-agent community. The first, classical, approach uses agents with a high-level decision process. These are called

cognitive agents and are commonly found in the literature [9]. The second approach is to use agents with small (or no) cognitive ability and whose behaviours

stem from a stimulus-response or influence-reaction scheme. These are called

reactive agents. In this approach intelligence is not in the agents themselves but

emerges from their interaction with each other and with their environment. These

methods are now also widely used in some domains, such as cyber-physical systems control [10,14] and (distributed) problem solving [11]. The main difference

between cognitive and reactive agents is in the role of the interaction processes

that act upon the agents. For reactive agents, the environment plays an essential

role since it formalises the constraints on the system’s evolution [13]. We take a

reactive agent paradigm using forces inspired by Coulomb’s law as an interaction


The rest of the paper is structured as follows: we first characterise the data

available from Device Analyzer (Sect. 2) and then explain the physics-inspired

multi-agent model we developed (Sect. 3). In Sect. 4 we show the accuracy of our

approach and discuss their validity range.

Reconstruction of Battery Level Curves



Device Analyzer

We use data collected by the Device Analyzer [1] project. This is coarse-grained

information but with the benefit that is has been collected from a large number

of different users over an extended time period. The device data consists of events

of three broad types:

Immediate events are directly stored with their occurrence time. These correspond to events delivered by the Android operating system. Examples include

screen on and screen off events.

Polled events are collected periodically (normally) every 5 min. These are split

into two categories depending on whether they are continuous data (such as

number of bytes transmitted over a network interface) or discrete data (such

as phone notification settings).

Static events are collected at the first connection and which are not supposed

to change during the period of analysis. These generally correspond to device

information such as the OS version or the Hashed ID of the SIM card.

Table 1. Classification of the data collected by Device Analyzer

Immediate event

Polled event



Phone on/off

Local system time

Phone alert status

Charging time

Amount of free storage

Device Analyzer version

Take picture

Audio volume settings

Total number of photos

Screen on/off

Battery level and voltage

Roaming status

Airplane mode on/off

Screen brightness

Network connectivity

Cellular signal strength

Incoming/outgoing call Amount of 3G data received

Bluetooth/Wifi on/off

Amount of 3G data transmitted


Amount of Wifi data received

Bluetooth scan

Amount of Wifi data transmitted

We give some examples in Table 1. Note that one specific piece of data, such

as number of bytes received on the 3G network, belongs to one category (3G

data received ) and to one type (polled continuous event).



The Particle Model


The general idea of our model is to use the event information collected by

Device Analyzer to estimate the battery level of a device over a time period.


F. Gechter et al.

We consider collected events as disturbances which deform the ideal battery

curve. The challenge is thus to determine how each event influences the battery

curve. In the rest of this paper, the battery curve is made of virtual charged

particles that will be influenced by the presence of events through attraction

forces based on Coulomb’s law. This can be considered as a beam. We make use

of an existing linear power model [3] expressed as follows:

(βi · xi ) + Pbase + P

P =



where P is the power used by the device, {xi }i is the set of state variables of

the system (e.g. when dealing with hardware, each element of this set corresponds

to a hardware component and the associated value to its utilisation), Pbase is the

base power consumption, P is a noise factor and βi are the linear coefficients

that must be determined in order to estimate the influence of the xi component

on the overall power model.

To estimate the energy consumption we must include the whole time period

during which the components are used. This leads to the following equation:




(βi · xji ) · dji + (Pbase + P ) · D


where E is the global energy consumption, xji is the utilisation of the component i during the window dji , and D is the duration of the experiment.

Since we are focusing on user events collected by Device Analyzer, we have

to determinate the βi and the dji values for each type of event xji .

Let SE be a finite set of events collected from the device defined by SE =

{Evi,ti }i∈1..N and ordered relatively to a time line (if i < j, ti < tj ). Each event

Evi,ti belongs to a category Ck of events which corresponds to those presented

in Sect. 2.

Considering an ideal curve C of the battery and assuming that we can find

out, for each category Ck , a representative Event Agent, EACk , the battery

consumption for the device, over a time period Δt = Ta , Tb , is the result

of the influence of the set of event agents, occurring during this time period

on C . The next section will detail the agent model

SEA = {EACk ,tp }

tp ∈ Ta ,Tb

used, starting with the design of the particle beam (the battery curve) and

continuing with the event agents.


Agent Models

The Particle Beam. The particle beam is made of small particles that can be

considered as either electrons or positrons depending on their charge. The system

environment contains an electrical field in order to ensure that the beam moves

from left to right. If one assumes that the charge of the particles is positive,

the electrical field is oriented from left to right. Thus, the forces applied to each

particle can be defined as follows:

F Field = q · E


Tài liệu bạn tìm kiếm đã sẵn sàng tải về

2 CUDA: Compute Unified Device Architecture

Tải bản đầy đủ ngay(0 tr)