Tải bản đầy đủ - 0 (trang)
5 Software - Kernel Interface Version 3

5 Software - Kernel Interface Version 3

Tải bản đầy đủ - 0trang

Efficient Hardware Acceleration of Recommendation Engines


that allow the transparent deployment of the accelerators. Specifically, we developed the required libraries that allow the instantiation of the kernel from a highlevel language like python, which is widely used in Machine Learning Tasks. The

whole process took place by using the Pynq Board, which is a prototype board

from Digilent that comes with a Linux image containing python libraries that

help designers use kernel’s from python scripts. The whole process is described


1. We created a bitstream for our IP matching the new Device (PYNQ), using


2. Then we wrote the software Part of the algorithm in Python, using efficient

libraries (numpy, scipy).

3. Finally by using the libraries coming with the Linux image of PYNQ, we

created the appropriate software driver responsible for the software-hardware


At the final step of the mentioned process we had to perform manually the

operations that are performed by SDSOC framework automatically. The Python

Libraries are wrappers of C language that are used for the interprocess communication. This wrap is accomplished with the use of a library called cffi, which

allows python scripts to execute C code coming either precompiled either in

source-code form. This means that this integration can happen in any platform

rather than Pynq. Moreover with the use of cffi we can hide low level implementation details from the developer under python hood.


Apache Spark Integration

Apache Spark [6] is a framework designed for fast large-scale data processing.

Spark stores data in a structure called resilient distributed datasets (RDD) [7],

that is a read only (for ease of coherency purposes) collection of the data. Spark

data operations are scheduled in a DAG scheme. Each task consists of a series of

transformations that generate new RDDs and an action which corresponds

to the reduce step of the map-reduce programming model. Spark performs lazy

evaluation, in order to perform as much tranformations as possible in one step so

that it can achieve more efficient task scheduling. In a glance it is an improved

version of Hadoop MapReduce [8]. At the moment, Spark is one of the most

popular big-data frameworks.

On this step we made a prototype Cluster consisting of four Pynq boards

in order to run the algorithm both in parallel using Apache Spark and accelerated using the programming logic of each PYNQ [10]. The idea is that every

worker of the cluster, each PYNQ board in our case, contains the bitstreams of

the accelerator and the Apache driver program commands the workers to configure their FPGAs appropriate for the computation that is about to happen.

This happens by calling a dummy map() function, before the actual computational map() operation, whose purpose is to instruct the workers to overlay the

appropriate bitstream.


K. Katsantonis et al.

Fig. 3. Accelerator speedup against arm only execution. Points represent our measurements for different input sizes. nf = 80



Performance Evaluation

Kernel-Only Performance Evaluation on Zedboard

The first implementation created on SDSoC framework, achieved speedup of

up to 120× for input matrices of size 12000 × 80, against the arm-only execution. As the input matrix size increased, the speedup was also increased. It is

important to notice that Fig. 3 refers only to the speedup of the accelerated part

(kernel) and not the speedup of the whole ALS algorithm.


ALS Performance Evaluation Zedboard

Embedding the Version 3 kernel in ALS algorithm and running iterations using

the datasets movielens 1m and movielens 100k we get the speed up shown in

Table 2. Notice that the column presenting the average number of ratings per

movie/user is present, because this is a good metric indicating the average size

of the matrices that will be produced at runtime. As a result, from this metric

combined with Fig. 3 and Amdahl’s law, we can estimate the anticipated speedup

for the specific dataset, this observation is actually verified by the actual speedup measurements which happen to be very similar to the ones anticipated.

We also compared this implementation against a software only implementation on two other platforms, an Intel(R) Core(TM) i7-7500U CPU @ 2.70 GHz,

and Intel(R) Xeon(R) CPU E5-2650 v2 @ 2.60 GHz. For this comparison we used

as input the movielens 1m dataset and the result was that the implementation

on the accelerated embedded system outperformed the i7 processor by a factor

of 1.7× and the Xeon processor by a factor of 2.7×. It is important to notice that

such small datasets like movielens 1m are unable to demonstrate the kernel’s full

potential which is presented in Fig. 3, because the small number of ratings per

user/movie leads to matrix operations of very limited size.

Efficient Hardware Acceleration of Recommendation Engines


Table 2. Execution time speedup as deducted of the ALS algorithm on datasets

movielens-1m and movielens 100k with nf = 80.


Speed-up vs arm-only Average ratings per movie/user

movielens 100k 18.8×

movielens 1m





Power Consumption

In Figs. 4 and 5 we show the power consumption of the algorithm for one iteration

on two different datasets. Although the accelerated version consumes more power

momentarily in the beginning of the execution the fact that it runs for much less

duration leads to a great improvement to the Performance per Watt metric versus

the arm only execution. Specifically one iteration on movielens 100k dataset

consumed 12× less energy while an iteration on movielens 1m consumed 27× less

energy. We can notice both in performance and energy consumption evaluation

that the kernel scales very well, meaning that as the input size increases the

performance speedup and the energy savings increase too (Table 3).

Table 3. Energy savings as deducted of the ALS algorithm on datasets movielens-1m

and movielens 100k.


Energy savings Average ratings per movie/user

movielens 100k 12×

movielens 1m




Fig. 4. Power consumption profiling of the system for 1m dataset (from one iteration)


K. Katsantonis et al.

Fig. 5. Power consumption profiling of the system for 100k dataset (from one iteration)


Python on Pynq

This implementation showed-up great results in performance but the speedup

achieved compared to an arm only execution was quite reduced compared to the

one achieved on the previous implementations. The reason is that a high level

language like python and its corresponding libraries were not created having

in mind integration with hardware accelerators and as a result specific data

conversions are needed that consume great percent of the execution time.


Apache Spark Integration

Four Pynqs accelerated and coordinated by spark managed to run 4–5× faster

than an arm only execution. On this case there are many software parts added to

the algorithm by spark. Spark adds serialization and deserialization tasks data

broadcasts over Ethernet and more. As a result the part which is accelerated is

smaller compared to the total execution time and as a direct impact of Amdahl’s

law we expected a smaller speedup (Table 4).

Table 4. Execution time speedup as deducted of the ALS algorithm with nf = 80 on

datasets movielens-1m and movielens 100k.


movielens 100k

movielens 1m


Speed-up of python implementation



Conclusion and Future Work

In this paper we discussed the path of reconfigurable architectures as an computational alternative path, and we attempted to sum a performance and energy

Efficient Hardware Acceleration of Recommendation Engines


evaluation of that path on embedded boards. Moreover we attempted to test

this technology with a popular scripting language like python in order to make

it more accessible and easy to use by software developers. The results are definitely promising from both power consumption and performance perspectives.

However in order to make these architectures a common case, we must expand

the library so that it contains multiple accelerators, for many computational

intensive tasks. Moreover it is great need to make these accelerators easy to use

by constructing a fine tuned and efficient python library that allowed smooth

transitions from software execution to hardware and vice versa. Except from

python, apache spark could be extended in order to natively support accelerated

execution more efficiently, by integrating specific instructions for configuring the

slave’s programming logic instead of forcing us to use “map()” calls that don’t

have computational intentions but were just written for FPGA configuration


Acknowledgment. This project has received funding from the European Union’s

Horizon 2020 research and innovation programme under grant agreement No. 687628

- VINEYARD: Versatile Integrated Heterogeneous Accelerator-based Data Centers.


1. Gupta, P.K.: Director of Intel Cloud Platform Technology, Xeon+FPGA Platform

for the Data Center (2015)

2. Esmaeilzadeh, H.: Dark Silicon and the End of Multicore Scaling. In: ISCA (2011)

3. Rajagopalan, V.: Xilinx Zynq-7000 EPP: An Extensible Processing Platform Family (2011)

4. Koren, Y.: Matrix Factorization Techniques for Recommender Systems. IEEE

Computer Society (2009)

5. Zhou, Y.: Large-Scale Parallel Collaborative Filtering for the Netflix Prize (2008)

6. Zaharia, M.: Spark: cluster computing with working sets. In: Proceedings of the

2nd USENIX Conference on Hot Topics in Cloud Computing (2010)

7. Zaharia, M.: Resilient distributed datasets: a fault-tolerant abstraction for inmemory cluster computing. In: Proceedings of the 9th USENIX Conference on

Networked Systems Design and Implementation (2012)

8. Shi, J.: Clash of the titans: MapReduce vs. spark for large scale data analytics.

In: Proceedings of the 41st International Conference on Very Large Data Bases,

Kohala Coast, Hawaii (2015)

9. Ma, X., Wang, C., Yu, Q., Li, X., Zhou, X.: An FPGA-based accelerator for

neighborhood-based collaborative filtering recommendation algorithms. In: 2015

IEEE International Conference on Cluster Computing (CLUSTER), September


10. Kachris, C., Koromilas, E., Stamelos, I., Soudris, D.: FPGA acceleration of Spark

applications in a Pynq cluster. In: IEEE International Conference on FieldProgrammable Logic and Applications, Ghent, Belgium, September 2017

11. Yang, D.: An FPGA Implementation for Solving Least Square Problem. IEEE


12. Ma, X.: An FPGA-based accelerator for neighborhood-based collaborative filtering

recommendation algorithms. In: IEEE International Conference on Cluster Computing (CLUSTER) (2015)


K. Katsantonis et al.

13. Huang, M.: Programming and runtime support to blaze FPGA accelerator deployment at datacenter scale. In: SoCC Proceedings of the Seventh ACM Symposium

on Cloud Computing (2016)

14. Lin, Z., Chow, P.: ZCluster: A Zynq-Based Hadoop Cluster, pp. 450–453. IEEE


15. Neshatpour, K., Malik, M., Ghodrat, M.A., Sasan, A., Homayoun, H.: Energyefficient acceleration of big data analytics applications using FPGAs. In: IEEE

International Conference on Big Data, pp. 115–123 (2015)

16. Fahmy, S.A., Vipin, K., Shreejith, S.: Virtualized FPGA accelerators for efficient

cloud computing. In: IEEE International Conference on Cloud Computing Technology and Science (CloudCom), Vancouver, Canada, 30 November–3 December,

pp. 430–435 (2015)

FPGA-based Design and CGRA


VerCoLib: Fast and Versatile

Communication for FPGAs

via PCI Express


guzhan Sezenlik(B) , Sebastian Schă

uller , and Joachim K. Anlauf

Technical Computer Science, Institute of Computer Science VI, University of Bonn,

Endenicher Allee 19 A, 53115 Bonn, Germany

{sezenlik,anlauf}@cs.uni-bonn.de, schueller@ti.uni-bonn.de

Abstract. PCI Express plays a vital role in including FPGA accelerators into high-performance computing systems. This also includes direct

communication between multiple FPGAs, without any involvement of

the main memory of the host. We present a highly configurable hardware interface that supports DMA-based connections to a host system

as well as direct communication between multiple FPGAs. Our implementation offers unidirectional channels to connect FPGAs, allowing for

precise adaptation to all kinds of use cases. Multiple channels to the same

endpoint can be used to realise independent data transmissions. While

the main focus of this work is flexibility, we are able to show maximum

throughput for connections between two FPGAs and up to 88% saturation of the available bandwidth for connections between the FPGA and

the host system.

Keywords: VerCoLib · PCI Express

Communication library · Transceiver




FPGAs are widely used to accelerate state-of-the-art algorithms or as coprocessors in heterogeneous high performance computer systems. FPGA vendors

offer affordable evaluation boards with high-end FPGAs, especially popular in

academic research. Through the development of high level synthesis tools like

Xilinx Vivado HLS or intelFPGA OpenCL, FPGAs became a more accessible and viable platform. While writing code for the FPGA accelerator itself is

one part of a design another important factor is to utilise its full performance

by transferring data reliably and with sufficient throughput. Here a common

high-bandwidth interface like PCI Express (PCIe in the following) is essential,

the use of which requires fundamental knowledge about its protocol, underlying

computer hardware as well as kernel driver programming. Therefore developers

often face the problem to implement the required complex logic to interface the

O. Sezenlik and S. Schă

ullerThese authors contributed equally to this work.

c Springer International Publishing AG, part of Springer Nature 2018

N. Voros et al. (Eds.): ARC 2018, LNCS 10824, pp. 81–92, 2018.



O. Sezenlik et al.

PCIe IP core provided by the vendor and integrate access to the accelerator

into their software. Furthermore modern mainboards allow devices to communicate directly via PCIe, completely bypassing the main memory, a feature heavily

used to connect multiple GPUs and build low-cost supercomputers for various

scientific applications. The same idea also applies to FPGAs: one could simply

plug several off-the-shelf FPGA boards into one standard desktop computer to

improve the computational power. Such a feature is especially useful, since combining several smaller FPGAs is more cost efficient than using high-end variants.

Our goal is to provide a highly configurable and easy to use open-source communication library (VERsatile COmmunication LIBrary) that allows FPGAprogrammers to focus on their primary objective, namely implementing their

algorithms. With our interface it is very easy to build affordable multi-FPGA

computers, where communication is performed between Host and FPGA as well

as directly between FPGAs via PCIe. This includes configurable and generic

modules used on the FPGA as well as a Linux kernel driver to set up communication channels and performing the host part of the data transmission. After

setting up the channels between FPGAs, the host system is not involved into

FPGA-FPGA communication at all.

This paper is structured as follows: In Sect. 2 we give an overview about

existing commercial and open-source PCIe solutions and our motivation for the

development of VerCoLib. Then the hardware architecture and features of our

transceiver are described in Sect. 3, followed by the software interface and driver

in Sect. 4. Finally the resource consumption and performance are evaluated in

Sect. 5.


Related Work

There are already several other systems that provide a PCIe interface for FPGA

accelerators. In general the different solutions can be categorised based on the

PCIe configurations they support and the resulting theoretical maximum bandwidth. This bandwidth depends on the generation of the PCIe standard as well

as the number of PCIe lanes a device is connected to. For reference, the maximum bandwidth for a Gen2 device with 8 lanes is 4 GB/s. This also applies

to Gen3 devices with 4 lanes. Gen3 devices with 8 lanes theoretically reach a

bandwidth of 8 GB/s.

Out of the available commercial solutions, the interfaces provided by Xillybus

[1] and Northwest Logic [2] are the most notable ones. The designs from Northwest are limited to Xilinx devices and are used in the reference design Xilinx

offers, supporting PCIe Gen3 with 8 lanes. Xillybus is available for both Xilinx and intelFPGA FPGAs and provides host software for Linux and Windows

operating systems.

Academic solutions include RIFFA 2.2 [3], JetStream [4], ffLink [5], EPEE

[6] and DyRact [7]. Out of these solutions, PCIe Gen2 is supported by EPEE (8

lanes), DyRact (4 lanes) and RIFFA (8 lanes), Gen3 is supported by RIFFA (4

lanes), JetStream and ffLink (8 lanes).

VerCoLib: Fast and Versatile Communication for FPGAs via PCI Express


RIFFA offers a host-FPGA interface for a wide range of devices and PCIe

configurations with drivers for both Linux and Windows as well as APIs for a

variety of programming languages. Their transceiver reaches a throughput of

up to 3.64 GB/s (upstream) and 3.43 GB/s (downstream).

EPEE is designed around the concept of a general purpose PCIe interface

including DMA communications with up to 3.28 GB/s, a set of IO registers

reachable from both hardware and software as well as user defined interrupts.

They support Xilinx Virtex-5, 6 and 7 series FPGAs on Linux systems. While

multiple independent DMA channels are supported as a plugin, there are no

measurements of the performance impact in terms of resource usage or throughput.

DyRact implements an interface for dynamic partial reconfiguration within

its PCIe solution, allowing for convenient and efficient reconfiguration of user

designs with PCIe connections. Since they only support Gen2 devices with up to

4 lanes, their bandwidth peaks at 1.54 GB/s.

The ffLink interface is created mainly out of IP-Cores supplied by Xilinx

and relies heavily on the AXI-4 [8] infrastructure. Strongly relying on IP-Cores

has the advantage of low development times and bug fixes from the IP-Core

developer. On the flip side, this also means a comparatively high resource usage

and makes it impossible to adapt the design to other vendors. The ffLink system

achieves a maximum throughput of 7.04 GB/s.

JetStream is the only other solution we have found that supports direct

PCIe FPGA-FPGA communication. They demonstrate the effectiveness of direct

FPGA-FPGA communication with a large FIR filter that spans multiple FPGAs.

They were able to show that distributing the data directly between FPGAs can

result in a reduction of memory bandwidth by up to 75%. Their host-FPGA

solution supports only Gen3 devices with 8 lanes with a maximum throughput

of 7.09 GB/s. However, the missing support for Gen2 devices effectively limits

the use of JetStream to high-end devices.


FPGA Transceiver Design

The central concept of VerCoLib deals with unidirectional, independent channels. A channel is the user interface to send or receive data, translating between

raw data and PCIe packets.

Every configuration of the transceiver has the same structure, consisting of

one endpoint module, an arbiter and an arbitrary number of channels, all of

them using the same handshake interface, equivalent to AXI4-Stream [8]. An

example is shown in Fig. 1.

The function of the endpoint is to handle global resources which need to be

shared among the channels. This includes interfacing the Xilinx specific hard IPCore [9] and handling internal communication with the software driver as well

as managing the interrupts from all channel modules and providing dynamic tag

mapping for downstream DMA transfers.


O. Sezenlik et al.

















Host RX

DMA Buffer

DMA Engine

Host RX


User Design


Tag control

Interrupt ctrl

Driver config

Host TX

DMA Engine


Fig. 1. System overview of example configuration. Best viewed in color. The colors

indicate independent data streams. Black and gray connections may contain data from

all streams and fading colors displays the data being filtered. Note that a user can

instantiate channel modules as necessary.

We are using MSI-X interrupts that allow devices to allocate up to 2048

interrupt vectors instead of only 32 vectors allowed by MSI. This makes it possible to map every channel uniquely to a MSI-X vector. Thereby the driver is

able to immediately identify the channel that issued an interrupt without further

communication, which in turn reduces communication overhead and latency.

In a multi-channel PCIe transceiver design special consideration is required

when handling DMA transfers from host to FPGA. According to the PCIe specification [10], the receiver has to request data from the main memory of the host

system with memory read request packets. These are answered by the host via

memory completion packets with the requested data as payload. PCIe provides

a tag field with at most 256 different values in the packet header which is used

to determine the affiliation of each completion with a request. To reach optimal

Tài liệu bạn tìm kiếm đã sẵn sàng tải về

5 Software - Kernel Interface Version 3

Tải bản đầy đủ ngay(0 tr)