Tải bản đầy đủ - 0trang
5 Software - Kernel Interface Version 3
Eﬃcient Hardware Acceleration of Recommendation Engines
that allow the transparent deployment of the accelerators. Speciﬁcally, we developed the required libraries that allow the instantiation of the kernel from a highlevel language like python, which is widely used in Machine Learning Tasks. The
whole process took place by using the Pynq Board, which is a prototype board
from Digilent that comes with a Linux image containing python libraries that
help designers use kernel’s from python scripts. The whole process is described
1. We created a bitstream for our IP matching the new Device (PYNQ), using
2. Then we wrote the software Part of the algorithm in Python, using eﬃcient
libraries (numpy, scipy).
3. Finally by using the libraries coming with the Linux image of PYNQ, we
created the appropriate software driver responsible for the software-hardware
At the ﬁnal step of the mentioned process we had to perform manually the
operations that are performed by SDSOC framework automatically. The Python
Libraries are wrappers of C language that are used for the interprocess communication. This wrap is accomplished with the use of a library called cﬃ, which
allows python scripts to execute C code coming either precompiled either in
source-code form. This means that this integration can happen in any platform
rather than Pynq. Moreover with the use of cﬃ we can hide low level implementation details from the developer under python hood.
Apache Spark Integration
Apache Spark  is a framework designed for fast large-scale data processing.
Spark stores data in a structure called resilient distributed datasets (RDD) ,
that is a read only (for ease of coherency purposes) collection of the data. Spark
data operations are scheduled in a DAG scheme. Each task consists of a series of
transformations that generate new RDDs and an action which corresponds
to the reduce step of the map-reduce programming model. Spark performs lazy
evaluation, in order to perform as much tranformations as possible in one step so
that it can achieve more eﬃcient task scheduling. In a glance it is an improved
version of Hadoop MapReduce . At the moment, Spark is one of the most
popular big-data frameworks.
On this step we made a prototype Cluster consisting of four Pynq boards
in order to run the algorithm both in parallel using Apache Spark and accelerated using the programming logic of each PYNQ . The idea is that every
worker of the cluster, each PYNQ board in our case, contains the bitstreams of
the accelerator and the Apache driver program commands the workers to conﬁgure their FPGAs appropriate for the computation that is about to happen.
This happens by calling a dummy map() function, before the actual computational map() operation, whose purpose is to instruct the workers to overlay the
K. Katsantonis et al.
Fig. 3. Accelerator speedup against arm only execution. Points represent our measurements for diﬀerent input sizes. nf = 80
Kernel-Only Performance Evaluation on Zedboard
The ﬁrst implementation created on SDSoC framework, achieved speedup of
up to 120× for input matrices of size 12000 × 80, against the arm-only execution. As the input matrix size increased, the speedup was also increased. It is
important to notice that Fig. 3 refers only to the speedup of the accelerated part
(kernel) and not the speedup of the whole ALS algorithm.
ALS Performance Evaluation Zedboard
Embedding the Version 3 kernel in ALS algorithm and running iterations using
the datasets movielens 1m and movielens 100k we get the speed up shown in
Table 2. Notice that the column presenting the average number of ratings per
movie/user is present, because this is a good metric indicating the average size
of the matrices that will be produced at runtime. As a result, from this metric
combined with Fig. 3 and Amdahl’s law, we can estimate the anticipated speedup
for the speciﬁc dataset, this observation is actually veriﬁed by the actual speedup measurements which happen to be very similar to the ones anticipated.
We also compared this implementation against a software only implementation on two other platforms, an Intel(R) Core(TM) i7-7500U CPU @ 2.70 GHz,
and Intel(R) Xeon(R) CPU E5-2650 v2 @ 2.60 GHz. For this comparison we used
as input the movielens 1m dataset and the result was that the implementation
on the accelerated embedded system outperformed the i7 processor by a factor
of 1.7× and the Xeon processor by a factor of 2.7×. It is important to notice that
such small datasets like movielens 1m are unable to demonstrate the kernel’s full
potential which is presented in Fig. 3, because the small number of ratings per
user/movie leads to matrix operations of very limited size.
Eﬃcient Hardware Acceleration of Recommendation Engines
Table 2. Execution time speedup as deducted of the ALS algorithm on datasets
movielens-1m and movielens 100k with nf = 80.
Speed-up vs arm-only Average ratings per movie/user
movielens 100k 18.8×
In Figs. 4 and 5 we show the power consumption of the algorithm for one iteration
on two diﬀerent datasets. Although the accelerated version consumes more power
momentarily in the beginning of the execution the fact that it runs for much less
duration leads to a great improvement to the Performance per Watt metric versus
the arm only execution. Speciﬁcally one iteration on movielens 100k dataset
consumed 12× less energy while an iteration on movielens 1m consumed 27× less
energy. We can notice both in performance and energy consumption evaluation
that the kernel scales very well, meaning that as the input size increases the
performance speedup and the energy savings increase too (Table 3).
Table 3. Energy savings as deducted of the ALS algorithm on datasets movielens-1m
and movielens 100k.
Energy savings Average ratings per movie/user
movielens 100k 12×
Fig. 4. Power consumption proﬁling of the system for 1m dataset (from one iteration)
K. Katsantonis et al.
Fig. 5. Power consumption proﬁling of the system for 100k dataset (from one iteration)
Python on Pynq
This implementation showed-up great results in performance but the speedup
achieved compared to an arm only execution was quite reduced compared to the
one achieved on the previous implementations. The reason is that a high level
language like python and its corresponding libraries were not created having
in mind integration with hardware accelerators and as a result speciﬁc data
conversions are needed that consume great percent of the execution time.
Apache Spark Integration
Four Pynqs accelerated and coordinated by spark managed to run 4–5× faster
than an arm only execution. On this case there are many software parts added to
the algorithm by spark. Spark adds serialization and deserialization tasks data
broadcasts over Ethernet and more. As a result the part which is accelerated is
smaller compared to the total execution time and as a direct impact of Amdahl’s
law we expected a smaller speedup (Table 4).
Table 4. Execution time speedup as deducted of the ALS algorithm with nf = 80 on
datasets movielens-1m and movielens 100k.
Speed-up of python implementation
Conclusion and Future Work
In this paper we discussed the path of reconﬁgurable architectures as an computational alternative path, and we attempted to sum a performance and energy
Eﬃcient Hardware Acceleration of Recommendation Engines
evaluation of that path on embedded boards. Moreover we attempted to test
this technology with a popular scripting language like python in order to make
it more accessible and easy to use by software developers. The results are deﬁnitely promising from both power consumption and performance perspectives.
However in order to make these architectures a common case, we must expand
the library so that it contains multiple accelerators, for many computational
intensive tasks. Moreover it is great need to make these accelerators easy to use
by constructing a ﬁne tuned and eﬃcient python library that allowed smooth
transitions from software execution to hardware and vice versa. Except from
python, apache spark could be extended in order to natively support accelerated
execution more eﬃciently, by integrating speciﬁc instructions for conﬁguring the
slave’s programming logic instead of forcing us to use “map()” calls that don’t
have computational intentions but were just written for FPGA conﬁguration
Acknowledgment. This project has received funding from the European Union’s
Horizon 2020 research and innovation programme under grant agreement No. 687628
- VINEYARD: Versatile Integrated Heterogeneous Accelerator-based Data Centers.
1. Gupta, P.K.: Director of Intel Cloud Platform Technology, Xeon+FPGA Platform
for the Data Center (2015)
2. Esmaeilzadeh, H.: Dark Silicon and the End of Multicore Scaling. In: ISCA (2011)
3. Rajagopalan, V.: Xilinx Zynq-7000 EPP: An Extensible Processing Platform Family (2011)
4. Koren, Y.: Matrix Factorization Techniques for Recommender Systems. IEEE
Computer Society (2009)
5. Zhou, Y.: Large-Scale Parallel Collaborative Filtering for the Netﬂix Prize (2008)
6. Zaharia, M.: Spark: cluster computing with working sets. In: Proceedings of the
2nd USENIX Conference on Hot Topics in Cloud Computing (2010)
7. Zaharia, M.: Resilient distributed datasets: a fault-tolerant abstraction for inmemory cluster computing. In: Proceedings of the 9th USENIX Conference on
Networked Systems Design and Implementation (2012)
8. Shi, J.: Clash of the titans: MapReduce vs. spark for large scale data analytics.
In: Proceedings of the 41st International Conference on Very Large Data Bases,
Kohala Coast, Hawaii (2015)
9. Ma, X., Wang, C., Yu, Q., Li, X., Zhou, X.: An FPGA-based accelerator for
neighborhood-based collaborative ﬁltering recommendation algorithms. In: 2015
IEEE International Conference on Cluster Computing (CLUSTER), September
10. Kachris, C., Koromilas, E., Stamelos, I., Soudris, D.: FPGA acceleration of Spark
applications in a Pynq cluster. In: IEEE International Conference on FieldProgrammable Logic and Applications, Ghent, Belgium, September 2017
11. Yang, D.: An FPGA Implementation for Solving Least Square Problem. IEEE
12. Ma, X.: An FPGA-based accelerator for neighborhood-based collaborative ﬁltering
recommendation algorithms. In: IEEE International Conference on Cluster Computing (CLUSTER) (2015)
K. Katsantonis et al.
13. Huang, M.: Programming and runtime support to blaze FPGA accelerator deployment at datacenter scale. In: SoCC Proceedings of the Seventh ACM Symposium
on Cloud Computing (2016)
14. Lin, Z., Chow, P.: ZCluster: A Zynq-Based Hadoop Cluster, pp. 450–453. IEEE
15. Neshatpour, K., Malik, M., Ghodrat, M.A., Sasan, A., Homayoun, H.: Energyeﬃcient acceleration of big data analytics applications using FPGAs. In: IEEE
International Conference on Big Data, pp. 115–123 (2015)
16. Fahmy, S.A., Vipin, K., Shreejith, S.: Virtualized FPGA accelerators for eﬃcient
cloud computing. In: IEEE International Conference on Cloud Computing Technology and Science (CloudCom), Vancouver, Canada, 30 November–3 December,
pp. 430–435 (2015)
FPGA-based Design and CGRA
VerCoLib: Fast and Versatile
Communication for FPGAs
via PCI Express
guzhan Sezenlik(B) , Sebastian Schă
uller , and Joachim K. Anlauf
Technical Computer Science, Institute of Computer Science VI, University of Bonn,
Endenicher Allee 19 A, 53115 Bonn, Germany
Abstract. PCI Express plays a vital role in including FPGA accelerators into high-performance computing systems. This also includes direct
communication between multiple FPGAs, without any involvement of
the main memory of the host. We present a highly conﬁgurable hardware interface that supports DMA-based connections to a host system
as well as direct communication between multiple FPGAs. Our implementation oﬀers unidirectional channels to connect FPGAs, allowing for
precise adaptation to all kinds of use cases. Multiple channels to the same
endpoint can be used to realise independent data transmissions. While
the main focus of this work is ﬂexibility, we are able to show maximum
throughput for connections between two FPGAs and up to 88% saturation of the available bandwidth for connections between the FPGA and
the host system.
Keywords: VerCoLib · PCI Express
Communication library · Transceiver
FPGAs are widely used to accelerate state-of-the-art algorithms or as coprocessors in heterogeneous high performance computer systems. FPGA vendors
oﬀer aﬀordable evaluation boards with high-end FPGAs, especially popular in
academic research. Through the development of high level synthesis tools like
Xilinx Vivado HLS or intelFPGA OpenCL, FPGAs became a more accessible and viable platform. While writing code for the FPGA accelerator itself is
one part of a design another important factor is to utilise its full performance
by transferring data reliably and with suﬃcient throughput. Here a common
high-bandwidth interface like PCI Express (PCIe in the following) is essential,
the use of which requires fundamental knowledge about its protocol, underlying
computer hardware as well as kernel driver programming. Therefore developers
often face the problem to implement the required complex logic to interface the
O. Sezenlik and S. Schă
ullerThese authors contributed equally to this work.
c Springer International Publishing AG, part of Springer Nature 2018
N. Voros et al. (Eds.): ARC 2018, LNCS 10824, pp. 81–92, 2018.
O. Sezenlik et al.
PCIe IP core provided by the vendor and integrate access to the accelerator
into their software. Furthermore modern mainboards allow devices to communicate directly via PCIe, completely bypassing the main memory, a feature heavily
used to connect multiple GPUs and build low-cost supercomputers for various
scientiﬁc applications. The same idea also applies to FPGAs: one could simply
plug several oﬀ-the-shelf FPGA boards into one standard desktop computer to
improve the computational power. Such a feature is especially useful, since combining several smaller FPGAs is more cost eﬃcient than using high-end variants.
Our goal is to provide a highly conﬁgurable and easy to use open-source communication library (VERsatile COmmunication LIBrary) that allows FPGAprogrammers to focus on their primary objective, namely implementing their
algorithms. With our interface it is very easy to build aﬀordable multi-FPGA
computers, where communication is performed between Host and FPGA as well
as directly between FPGAs via PCIe. This includes conﬁgurable and generic
modules used on the FPGA as well as a Linux kernel driver to set up communication channels and performing the host part of the data transmission. After
setting up the channels between FPGAs, the host system is not involved into
FPGA-FPGA communication at all.
This paper is structured as follows: In Sect. 2 we give an overview about
existing commercial and open-source PCIe solutions and our motivation for the
development of VerCoLib. Then the hardware architecture and features of our
transceiver are described in Sect. 3, followed by the software interface and driver
in Sect. 4. Finally the resource consumption and performance are evaluated in
There are already several other systems that provide a PCIe interface for FPGA
accelerators. In general the diﬀerent solutions can be categorised based on the
PCIe conﬁgurations they support and the resulting theoretical maximum bandwidth. This bandwidth depends on the generation of the PCIe standard as well
as the number of PCIe lanes a device is connected to. For reference, the maximum bandwidth for a Gen2 device with 8 lanes is 4 GB/s. This also applies
to Gen3 devices with 4 lanes. Gen3 devices with 8 lanes theoretically reach a
bandwidth of 8 GB/s.
Out of the available commercial solutions, the interfaces provided by Xillybus
 and Northwest Logic  are the most notable ones. The designs from Northwest are limited to Xilinx devices and are used in the reference design Xilinx
oﬀers, supporting PCIe Gen3 with 8 lanes. Xillybus is available for both Xilinx and intelFPGA FPGAs and provides host software for Linux and Windows
Academic solutions include RIFFA 2.2 , JetStream , ﬀLink , EPEE
 and DyRact . Out of these solutions, PCIe Gen2 is supported by EPEE (8
lanes), DyRact (4 lanes) and RIFFA (8 lanes), Gen3 is supported by RIFFA (4
lanes), JetStream and ﬀLink (8 lanes).
VerCoLib: Fast and Versatile Communication for FPGAs via PCI Express
RIFFA oﬀers a host-FPGA interface for a wide range of devices and PCIe
conﬁgurations with drivers for both Linux and Windows as well as APIs for a
variety of programming languages. Their transceiver reaches a throughput of
up to 3.64 GB/s (upstream) and 3.43 GB/s (downstream).
EPEE is designed around the concept of a general purpose PCIe interface
including DMA communications with up to 3.28 GB/s, a set of IO registers
reachable from both hardware and software as well as user deﬁned interrupts.
They support Xilinx Virtex-5, 6 and 7 series FPGAs on Linux systems. While
multiple independent DMA channels are supported as a plugin, there are no
measurements of the performance impact in terms of resource usage or throughput.
DyRact implements an interface for dynamic partial reconﬁguration within
its PCIe solution, allowing for convenient and eﬃcient reconﬁguration of user
designs with PCIe connections. Since they only support Gen2 devices with up to
4 lanes, their bandwidth peaks at 1.54 GB/s.
The ﬀLink interface is created mainly out of IP-Cores supplied by Xilinx
and relies heavily on the AXI-4  infrastructure. Strongly relying on IP-Cores
has the advantage of low development times and bug ﬁxes from the IP-Core
developer. On the ﬂip side, this also means a comparatively high resource usage
and makes it impossible to adapt the design to other vendors. The ﬀLink system
achieves a maximum throughput of 7.04 GB/s.
JetStream is the only other solution we have found that supports direct
PCIe FPGA-FPGA communication. They demonstrate the eﬀectiveness of direct
FPGA-FPGA communication with a large FIR ﬁlter that spans multiple FPGAs.
They were able to show that distributing the data directly between FPGAs can
result in a reduction of memory bandwidth by up to 75%. Their host-FPGA
solution supports only Gen3 devices with 8 lanes with a maximum throughput
of 7.09 GB/s. However, the missing support for Gen2 devices eﬀectively limits
the use of JetStream to high-end devices.
FPGA Transceiver Design
The central concept of VerCoLib deals with unidirectional, independent channels. A channel is the user interface to send or receive data, translating between
raw data and PCIe packets.
Every conﬁguration of the transceiver has the same structure, consisting of
one endpoint module, an arbiter and an arbitrary number of channels, all of
them using the same handshake interface, equivalent to AXI4-Stream . An
example is shown in Fig. 1.
The function of the endpoint is to handle global resources which need to be
shared among the channels. This includes interfacing the Xilinx speciﬁc hard IPCore  and handling internal communication with the software driver as well
as managing the interrupts from all channel modules and providing dynamic tag
mapping for downstream DMA transfers.
O. Sezenlik et al.
Fig. 1. System overview of example conﬁguration. Best viewed in color. The colors
indicate independent data streams. Black and gray connections may contain data from
all streams and fading colors displays the data being ﬁltered. Note that a user can
instantiate channel modules as necessary.
We are using MSI-X interrupts that allow devices to allocate up to 2048
interrupt vectors instead of only 32 vectors allowed by MSI. This makes it possible to map every channel uniquely to a MSI-X vector. Thereby the driver is
able to immediately identify the channel that issued an interrupt without further
communication, which in turn reduces communication overhead and latency.
In a multi-channel PCIe transceiver design special consideration is required
when handling DMA transfers from host to FPGA. According to the PCIe speciﬁcation , the receiver has to request data from the main memory of the host
system with memory read request packets. These are answered by the host via
memory completion packets with the requested data as payload. PCIe provides
a tag ﬁeld with at most 256 diﬀerent values in the packet header which is used
to determine the aﬃliation of each completion with a request. To reach optimal