Tải bản đầy đủ - 0 (trang)
4 Performance Evaluation of dgeqrf, dgetrf and dgemm on Multicore and GPGPU

4 Performance Evaluation of dgeqrf, dgetrf and dgemm on Multicore and GPGPU

Tải bản đầy đủ - 0trang

Efficient Realization of Kalman Filter on CGRA

(a) Different Configurations of RDP Corresponding to Identified Macro Operations in

dgemm, dgeqrf, and dgetrf


(b) Software Optimization in KF Resulting

in Reduction in the Run-time (logical diagram)

Fig. 5. RDP configurations for KF and scheduling of different routines in KF


Hardware Optimized KF

In hardware optimized KF, we revisit the basic operations required to be performed in MFA like dgemm, dgeqrf, and dgetrf and identify several macro operations in these basic operations. We realize these macro operations in RDP

depicted in the Fig. 2. Configurations of RDP corresponding to the identified

macro operations in dgemm, dgeqrf and dgetrf are shown in Fig. 5(a). With

these configurations, we achieve up to 50% of the theoretical peak of the PE

where theoretical peak of the PE is 4.9 Gflops at 700 MHz. Here theoretical

peak of the PE is increased since we are using RDP that consists of 4 multipliers

and 3 adders.

We perform a software optimization in MFA by analysis of the DAG of MFA.

We overlap dgeqrf, dgemm, and dgetrf routines as shown in Fig. 5(b). Optimization diagram shown in the Fig. 5(b) is logical flow of computations post software

optimization. To overlap these routines, we identify pipeline stalls in the PE while

execution and insert independent instructions while also maintaining operation

correctness. Overlapping dgeqrf, dgemm, and dgetrf results in significant reduction in the run-time of MFA that directly translates to performance improvement

in KF. After software optimization, we are able to attain 65% of the theoretical

peak in PE which is 30% improvement. Performance improvement after each

optimization is shown in Fig. 6. In the Fig. 6, we have also incorporated the

performance attained in multicore and GPGPU.

It can be observed in the Fig. 6 that the performance of KF in multicore and

GPGPU is hardly 20–30% of the theoretical peak of these platforms, while we

achieve up to 65% of the theoretical peak of PE in KF which is 2.15x higher.


F. Merchant et al.






Kalman Filter (PLASMA)

Kalman Filter (MAGMA)

Kalman Filter (PE)














9x9 10x10

Matrix Size

Fig. 6. Performance of KF in PE, multicore, and GPGPU


Parallel Realization and Results

For parallel realization of KF, we use three different configurations of REDEFINE. Two configurations are shown in Fig. 7. In configuration 1 we use 2 × 2

Tile array, in configuration 2 we use 3 × 3 Tile array, and in configuration 3, we

use 4 × 4 Tile array. In our simulations, for configuration 1 and configuration 2,

we use last column of the Tile array as a memory where we attach memories as

a PE.

Fig. 7. Different configurations of REDEFINE for KF realization and scheduling

In configuration 3, we use entire Tile array for computations and hence we

attach another memory PE to the Router along side the compute PE. Memory

PE is divided into to segments and the second half of the segment in memory PE

acts as a global memory that is accessible by all other Tiles while the first half of

the memory segment is private to the compute PE that is used for computations

on local data. Typically, we have 256 K bytes of memory per Tile and compute PE

consists of 256 registers of width 64 bits. Scheduling technique for REDEFINE







Kalman Filter (2x2)

Kalman Filter (3x3)

Kalman Filter (4x4)













9x9 10x10

Matrix Size

(a) Performance in Different Configurations

of REDEFINE in KF Realization

Performance Improvement (Gflops/waƩ)

Efficient Realization of Kalman Filter on CGRA









Intel Core Nvidia GTX Altera ClearSpeed Intel Core


480 SM

StraƟx IV

CSX700 i7 Haswell Tesla C2075


(b) Power Performance Comparison of PE

with Other Platforms for KF

Fig. 8. RDP configurations for KF and scheduling of different routines in KF

is shown in the right side of the Fig. 7. We use a technique where we divide input

matrix into k × k blocks where k is size of row/column of the Tile array. These

blocks are further divided into the sub-blocks where size of the sub-blocks depend

on the number of local registers available and size of the local memory available

to a PE. Blocks of the matrices are loaded and computation is performed and the

result is stored to the global memory. Percentage of theoretical peak performance

attained for each configuration is shown in Fig. 8(a). It can be observed in the

Fig. 8(a) that the performance attained for each configuration saturates at 60%

of the theoretical peak performance which is 2x higher than the performance

attained in multicore and GPGPU.

We evaluate power performance of PE based on technique presented in [10].

We compare PE with other platform for KF as shown in Fig. 8(b). It can be

observed that the PE is capable of achieving 4-105x higher performance improvement over platforms like ClearSpeed CSX700, Intel Core i7, and Nvidia GPGPU.



In this paper, we presented efficient realization of KF. We used versatile MFA

as a tool for efficient realization of KF. Based on the case studies presented on

dgemm, dgetrf, and dgeqrf, it was identified that the performance of these operations on multiore and GPGPU is not satisfactory even with hightly optimized

software packages like PLASMA and MAGMA. It was also shown that the performance attained in KF on multicore and GPGPU is 20–30% of the theoretical

peak performance of underlying platform. To accelerate KF on customizable

platform like REDEFINE, we identify macro operations in the routines of MFA

and realized them on RDP. Our approach resulted in 67% improvement in the

performance of KF. A software tuning in MFA was also presented that resulted

in 30% performance improvement over the hardware optimized KF. Overall, our

approach of algorithm-architecture co-design resulted in 116% of performance

improvement over the base realization of KF. In terms of Gflops/watt, KF is

2-105x better than multicore and GPGPU platforms.


F. Merchant et al.


1. Baluni, A., Merchant, F., Nandy, S.K., Balakrishnan, S.: A fully pipelined modular

multiple precision floating point multiplier with vector support. In: 2011 International Symposium on Electronic System Design, pp. 45–50, December 2011

2. Cerati, G., Elmer, P., Lantz, S., MacNeill, I., McDermott, K., Riley, D., Tadel, M.,

Wittich, P., Wă

urthwein, F., Yagil, A.: Traditional tracking with kalman filter on

parallel architectures. J. Phys.: Conf. Ser. 608(1), 012057 (2015)

3. Das, S., Madhu, K.T., Madhu, K., Krishna, M., Sivanandan, N., Merchant, F.,

Natarajan, S., Biswas, I., Pulli, A., Nandy, S.K., Narayan, R.: A framework for

post-silicon realization of arbitrary instruction extensions on reconfigurable datapaths. J. Syst. Archit. - Embed. Syst. Des. 60(7), 592–614 (2014)

4. Gentleman, W.M., Kung, H.T.: Matrix triangularization by systolic arrays. In:

SPIE Proceedings, vol. 298, pp. 19–26 (1982)

5. Higham, N.J.: Exploiting fast matrix multiplication within the level 3 BLAS. ACM

Trans. Math. Softw. 16(4), 352–368 (1990)

6. Hill, M.D., Marty, M.R.: Amdahl’s law in the multicore era. Computer 41(7), 33–38


7. Johnson, B., Thomas, N., Rani, J.S.: An FPGA based high throughput discrete

Kalman filter architecture for real-time image denoising. In: 2017 30th International Conference on VLSI Design and 2017 16th International Conference on

Embedded Systems (VLSID), January 2017

8. Mahadurkar, M., Merchant, F., Maity, A., Vatwani, K., Munje, I., Gopalan, N.,

Nandy, S.K., Narayan, R.: Co-exploration of NLA kernels and specification of compute elements in distributed memory CGRAs. In: XIVth International Conference on Embedded Computer Systems: Architectures, Modeling, and Simulation,

SAMOS 2014, Agios Konstantinos, Samos, Greece, 14–17 July 2014, pp. 225–232


9. Merchant, F., Maity, A., Mahadurkar, M., Vatwani, K., Munje, I., Krishna, M.,

Nalesh, S., Gopalan, N., Raha, S., Nandy, S.K., Narayan, R.: Micro-architectural

enhancements in distributed memory CGRAs for LU and QR factorizations.

In: 2015 28th International Conference on VLSI Design (VLSID), pp. 153–158,

January 2015

10. Merchant, F.A., Vatwani, T., Chattopadhyay, A., Raha, S., Nandy, S.K., Narayan,

R.: Efficient realization of householder transform through algorithm-architecture

co-design for acceleration of QR factorization. IEEE Trans. Parallel Distrib. Syst.

PP(99), 1 (2018)

11. Merchant, F., Chattopadhyay, A., Garga, G., Nandy, S.K., Narayan, R., Gopalan,

N.: Efficient QR decomposition using low complexity column-wise givens rotation

(CGR). In: 2014 27th International Conference on VLSI Design and 2014 13th

International Conference on Embedded Systems, Mumbai, India, 5–9 January 2014,

pp. 258–263 (2014)

12. Merchant, F., Chattopadhyay, A., Raha, S., Nandy, S.K., Narayan, R.: Accelerating BLAS and LAPACK via efficient floating point architecture design. Parallel

Process. Lett. 27(3–4), 1–17 (2017)

13. Merchant, F., Choudhary, N., Nandy, S.K., Narayan, R.: Efficient realization of

table look-up based double precision floating point arithmetic. In: 29th International Conference on VLSI Design and 15th International Conference on Embedded

Systems, VLSID 2016, Kolkata, India, 4–8 January 2016, pp. 415–420 (2016)

Efficient Realization of Kalman Filter on CGRA


14. Merchant, F., Vatwani, T., Chattopadhyay, A., Raha, S., Nandy, S.K., Narayan, R.:

Achieving efficient QR factorization by algorithm-architecture co-design of householder transformation. In: 29th International Conference on VLSI Design and 15th

International Conference on Embedded Systems, VLSID 2016, Kolkata, India, 4–8

January 2016, pp. 98–103 (2016)

15. Nash, J.G., Hansen, S.: Modified Faddeeva algorithm for concurrent execution of

linear algebraic operations. IEEE Trans. Comput. 37(2), 129–137 (1988)

16. R´

akossy, Z.E., Merchant, F., Acosta Aponte, A., Nandy, S.K., Chattopadhyay, A.:

Efficient and scalable CGRA-based implementation of column-wise givens rotation.

In: ASAP, pp. 188–189 (2014)

17. R´

akossy, Z.E., Merchant, F., Acosta Aponte, A., Nandy, S.K., Chattopadhyay, A.:

Scalable and energy-efficient reconfigurable accelerator for column-wise givens rotation. In: 22nd International Conference on Very Large Scale Integration, VLSI-SoC,

Playa del Carmen, Mexico, 6–8 October 2014, pp. 1–6 (2014)

18. Sandhu, F., Selamat, H., Alavi, S.E., Behtaji Siahkal Mahalleh, V.: FPGA-based

implementation of kalman filter for real-time estimation of tire velocity and acceleration. IEEE Sens. J. 17(17), 5749–5758 (2017)

19. Smith, B.J.: R package magma: matrix algebra on GPU and multicore architectures, version 0.2.2, 3 September 2010. http://cran.r-project.org/package=magma

20. Thornton, C.L., Bierman, G.J.: Givens transformation techniques for kalman filtering. Acta Astronaut. 4(7–8), 847–863 (1977)

21. Wang, Q., Zhang, X., Zhang, Y., Yi, Q.: AUGEM: automatically generate high

performance dense linear algebra kernels on x86 CPUs. In: Proceedings of the

International Conference on High Performance Computing, Networking, Storage

and Analysis, SC 2013, pp. 25:1–25:12. ACM, New York (2013)

22. Zhong, G., Niar, S., Prakash, A., Mitra, T.: Design of multiple-target tracking

system on heterogeneous system-on-chip devices. IEEE Trans. Veh. Technol. 65(6),

4802–4812 (2016)

FPGA-Based Memory Efficient Shift-And

Algorithm for Regular Expression


Junsik Kim and Jaehyun Park(B)

Department of Information and Communication Engineering,

Inha University, Incheon 22212, Korea

jskim@emcl.org, jhyun@inha.ac.kr

Abstract. This paper proposes a FPGA-based reconfigurable regular

expression matching engine for a network intrusion detection system

(NIDS). In the proposed system, the Shift-And algorithm was used to

process a regular expression matching. To improve the memory efficiency of the algorithm especially used for the Non-deterministic Finite

Automata (NFA) with large number of states, this paper proposes a

parallel matching module with a counter module and a priority encoder.

In addition, in the proposed system, a large NFA can be divided into

several NFAs and process separately by parallel matching module. The

proposed architecture with 265 regular expression matching modules

is implemented using Xilinx Zynq-7030 FPGA, that shows 1.066 Gbps

throughput and uses 54.81% LUT.



Recently, the importance of network security has increased as the damage caused

by DDos attack or Ransomware increases. To reduce this network security risk,

Network Intrusion Detection System (NIDS) that protects the computer system by analyzing network packets is widely adopted. NIDS distinguishes the

malicious packets among the received network packets based on the predefined

rules that is usually written in Perl-compatible regular expression. Snort [1] and

Suricata [2] are the most widely used intrusion detection softwares that provide user-defined detection rules in Perl-compatible regular expression for deep

packet inspection. In spite of its flexibility and expandability, the software-based

NIDS has some disadvantages. First of all, since Snort is not an intrusion prevention system but a pure intrusion detection system, NIDS software itself can

be infected and altered by other virus-software [3]. In that case, normal intrusion detection function may be stumbled. Secondly, since the network pattern

matching is performed by software, the processing speed depends on the performance of the CPU core. This means the detection speed may be slower as the

complexity of the rules grows.

To overcome these shortcomings of software-based NIDS, several hardwarebased NIDS systems that are normally designed using high performance FPGA

c Springer International Publishing AG, part of Springer Nature 2018

N. Voros et al. (Eds.): ARC 2018, LNCS 10824, pp. 132–141, 2018.


FPGA-Based Memory Efficient Shift-And Algorithm


have been proposed during the past decades. Clark proposed a parallel decoder

to process multiple input characters [4]. Sidhu suggested a way to convert regular expressions to Non-deterministic Finite Automata (NFA) and binary logic

expression [5]. Yamagaki proposed an efficient character set comparison logic and

a NFA configuration algorithm for parallel character processing [6]. However,

these architecture should modify the internal logic when the regular expressions

changes, and in turn, the logic blocks should be reconfigured.

Another approach to design a FPGA-based NIDS is using the memorybased pattern matching scheme. To implement an ordinary Deterministic Finite

Automata (DFA) to search ASCII strings represented in regular expression with

N possible states, 256 × N state transition tables are necessary in all memory,

which is very inefficient in memory usage and almost impossible to handle a

large regular expression in a practical NIDS system. To improve the efficiency

in memory usage, Hieu proposed a structure that combines redundant states

into one [7] and Freire proposed a method using Huffman coding [8]. Huffman

coding analyzes user-defined rules and assigns short codes to frequently occurred

characters to increase overall memory efficiency. Also, Harwayne-Gidansky and

Meghana proposed NIDS using the data structure of the counting bloom filter

[9,10]. The counting bloom filter is suitable for the black or white list method

because there is no false error but the probability of positive error occurs. The

string matching algorithm using hash function is also proposed. [11–14]. Bando

proposed a way to avoid overlapping by specifying a character set as a symbol

set in range hash, and to transmit only the packets included in the range to

the main module, thereby reducing the overall throughput [11]. Lee proposed a

structure that dynamically reconfigure regular expressions [15]. However, since

it uses Xilinx’s shift register LUT (SRL), it has limitations that can not be

applied to the other FPGA architecture. Divyasree proposed a structure based

on Shift-OR algorithm and bit-map with Glushkov-NFA that showed an outstanding throughput more than 4 Gbps [16]. Cronin proposed a structure using

countable Bit-Parallel Glushkov-NFA (BPG-NFA)[17].

Meanwhile, the Extended Shift-And algorithm is useful for dynamically

reconfiguring patterns such as DNA and text searching efficiently because with

which the NFA can be updated at once with bit-parallelism [18]. To achieve high

throughput in the NIDS system, Kaneta implemented the Extended Shift-And

algorithm using an FPGA [19]. However, in their architecture, the maximum

number of states depends on the length of the register, so it is hard to apply

a regular expression having many states. In addition, the number of states in

the NFA can quickly increase due to constrained repetition. In practical NIDS

system, thousands of repetition expression of the detection rules are easily used,

which results in a very large number of states. Therefore, in this paper, a memory

efficient architecture that can be applied to the Extended Shift-And algorithm

is proposed to solve this problem.

This paper consists of five sections. Section 2 introduces the extended regular

expression and the Shift-And algorithm using Bit-parallelism. In Sect. 3, the


J. Kim and J. Park

overall system design and implementation details are described. Section 4 shows

the implementation results and Sect. 5 concludes this paper.



Pattern Matching Method


The regular expressions are often used to specify a set of strings to match multiple strings in a single pattern. They are primarily used for character searches,

DNA pattern detection, and deep packet inspection in NIDS software. In this

paper, the regular expression is used to define the packet detection rules used

by FPGA-based NIDS engine. Table 1 summarized the regular expressions supported by the proposed scheme.

Table 1. Regular expression description

Symbol Description


Matches any one character


It is the same as selecting one of several characters and using multiple |.

A range can be specified with the “-” symbol


Multiple expressions can be bound together


Zero or more characters

{m, n} m times or more, and n times or less


0 or 1 occurrence


More than once


Including one of the two

To apply the Extended Shift-And algorithm, regular expressions should be

converted to Non-deterministic Finite Automata defined like Eq. (1).

A = (Q, Σ, I, F, Δ)


where Q is the entire state set, Σ is all characters in the regular expression, I is

the start state, F is the end state, and Δ is the transition function. Transition

function Δ determines the next state. When a new character α is received to

the NFA, the next state is determined according to the Eq. (2). D is a specific

function that determines the next state depending on the input character and

the current state.

Δ = {(q, α, q ), q ∈ Q, α ∈ Σ ∪ { } , q ∈ D(q, α)}


Thompson [20] and Glushkov [21] NFA can move from one state to several states,

so multiple states can be active at the same time. However, to apply the ShiftAnd algorithm, the NFA should have a property of moving only in one direction.

Figure 1 shows the NFA with that property.

FPGA-Based Memory Efficient Shift-And Algorithm


Fig. 1. NFA of R = ba {2, 3} c + de ∗ f


Shift-And Algorithm

In this paper, the Extended Shift-And algorithm is used to process the regular

expression [18]. Since the active state is expressed in bits, state transitions can

be performed through a shift operation. Thus, if the register length is w bits, w

consecutive strings can be detected.

The NFA of the algorithm is linear and the state transition proceeds step by

step. When B has the transition conditions of all characters Σ in table form, the

input character α is used as the address of the table. If the current state of the

NFA is D, and a new character arrives, the next state Dn is updated by Eq. (3).

Dn ← ((D << 1) | 0m−1 1)&B [α]


In Eq. (3), the current state D is compared with B(q, α) after the shift operation to determine the next state. Pattern matching is determined whether

D &10m−1 = 0 or not. It is determined that pattern matching has occurred.

Since Eq. (3) is an exact pattern matching, it can not be applied to extended

regular expressions. Therefore, R, F , I, and A are added for extend regular

expression and the next state Dn is calculated by the following equations.

D ← (((D << 1) | 0m−1 1)&B [α]) | (D&R [α])


Df ← D | F


Dn ← D | (A&((∼ (Df − I)) ⊕ Df ))


Equation (4) is used for exact pattern matching and iteration expressions.

Equations (5) and (6) are used where can occur, such as ?, +, {m, n}. R has a

table of repetition conditions for all characters Σ, and F, I, and A are used to

indicate the location of occurrence in that regular expression. However, since

R = α {m, n} is replaced by R = α?n−m αm , (n ≥ m), the number of states



J. Kim and J. Park

Fig. 2. Overall system block diagram



Countable Pattern Matching Structure

Overall System Structure

Figure 2 shows the overall system architecture with parallel Regular Expression

Matching Module (REMM). A host interface is required to exchange the detection rules written in regular expression generated by the host CPU. The main

control module controls each REMM by generating a module enable signals and

a chain enable signals. REMM has a 32-bits length, so it can represent a regular expression with maximum 32 NFA states. If there are more states than

this limitation, the remaining states are processed by the next REMM module

with indicating by the Chain Out signal. For example, if the number of NFA

states of any regular expression is 120, the number of modules N required is

(120 >> 5) + 1 = 4, and the last module would use only 24 bits (120 mod 32).

Therefore, a host needs to create a transition table for multi-matching in preprocessing for modules with two regular expressions overlapping. Each REMM

reads the bit patterns stored in the FPGA’s block memory and performs comparison operations. The matching signal of each module is encoded in the main

control module and transmitted to the host. The main control module performs

register and memory update of REMM and controls the connection state of each

module through the Chain Enable signal.


Regular Expression Matching Module

Figure 3 shows the structure of the regular expression matching module that is

similar to the original form of the Extended Shift-And structure proposed in

[19]. To solve unrolling issue, signals for counter module and daisy chain signals

were added. Chain In is the signal from the previous module and Chain Out

is the signal to activate the next module. En and Chain En signals of the main

control module control activation of the next module. The Match Out signal

FPGA-Based Memory Efficient Shift-And Algorithm


Fig. 3. Regular expression matching module

passes the pattern matching result to the main control module. The STATE, MOVE

and REPEAT in Fig. 3 correspond to D, B and R in Eq. (1), respectively. EpsBEG,

EpsEND, and EpsBLK represent I, F and A, respectively, as well. MOVE and REPEAT

are block memory and use the input packet as an address. Therefore, when a new

packet arrives, the operation is performed in the combinational circuit within one

clock. And the REMM includes one counter module for supporting constrained

repetition expression. The update module updates the contents of the block

memory when the main control module receives a new regular expression from

the host.


Counter Module

Figure 4 shows the structure of the counter module. The counter module has a

Constrained-repetition Register(CR) indicating the position of the constrained

Fig. 4. Counter module

Tài liệu bạn tìm kiếm đã sẵn sàng tải về

4 Performance Evaluation of dgeqrf, dgetrf and dgemm on Multicore and GPGPU

Tải bản đầy đủ ngay(0 tr)