4 Performance Evaluation of dgeqrf, dgetrf and dgemm on Multicore and GPGPU
Tải bản đầy đủ  0trang
Eﬃcient Realization of Kalman Filter on CGRA
(a) Different Configurations of RDP Corresponding to Identified Macro Operations in
dgemm, dgeqrf, and dgetrf
127
(b) Software Optimization in KF Resulting
in Reduction in the Runtime (logical diagram)
Fig. 5. RDP conﬁgurations for KF and scheduling of diﬀerent routines in KF
4.2
Hardware Optimized KF
In hardware optimized KF, we revisit the basic operations required to be performed in MFA like dgemm, dgeqrf, and dgetrf and identify several macro operations in these basic operations. We realize these macro operations in RDP
depicted in the Fig. 2. Conﬁgurations of RDP corresponding to the identiﬁed
macro operations in dgemm, dgeqrf and dgetrf are shown in Fig. 5(a). With
these conﬁgurations, we achieve up to 50% of the theoretical peak of the PE
where theoretical peak of the PE is 4.9 Gﬂops at 700 MHz. Here theoretical
peak of the PE is increased since we are using RDP that consists of 4 multipliers
and 3 adders.
We perform a software optimization in MFA by analysis of the DAG of MFA.
We overlap dgeqrf, dgemm, and dgetrf routines as shown in Fig. 5(b). Optimization diagram shown in the Fig. 5(b) is logical ﬂow of computations post software
optimization. To overlap these routines, we identify pipeline stalls in the PE while
execution and insert independent instructions while also maintaining operation
correctness. Overlapping dgeqrf, dgemm, and dgetrf results in signiﬁcant reduction in the runtime of MFA that directly translates to performance improvement
in KF. After software optimization, we are able to attain 65% of the theoretical
peak in PE which is 30% improvement. Performance improvement after each
optimization is shown in Fig. 6. In the Fig. 6, we have also incorporated the
performance attained in multicore and GPGPU.
It can be observed in the Fig. 6 that the performance of KF in multicore and
GPGPU is hardly 20–30% of the theoretical peak of these platforms, while we
achieve up to 65% of the theoretical peak of PE in KF which is 2.15x higher.
128
F. Merchant et al.
70
Percentage
60
50
40
Kalman Filter (PLASMA)
Kalman Filter (MAGMA)
Kalman Filter (PE)
30
20
10
0
1x1
2x2
3x3
4x4
5x5
6x6
7x7
8x8
x103
9x9 10x10
Matrix Size
Fig. 6. Performance of KF in PE, multicore, and GPGPU
5
Parallel Realization and Results
For parallel realization of KF, we use three diﬀerent conﬁgurations of REDEFINE. Two conﬁgurations are shown in Fig. 7. In conﬁguration 1 we use 2 × 2
Tile array, in conﬁguration 2 we use 3 × 3 Tile array, and in conﬁguration 3, we
use 4 × 4 Tile array. In our simulations, for conﬁguration 1 and conﬁguration 2,
we use last column of the Tile array as a memory where we attach memories as
a PE.
Fig. 7. Diﬀerent conﬁgurations of REDEFINE for KF realization and scheduling
In conﬁguration 3, we use entire Tile array for computations and hence we
attach another memory PE to the Router along side the compute PE. Memory
PE is divided into to segments and the second half of the segment in memory PE
acts as a global memory that is accessible by all other Tiles while the ﬁrst half of
the memory segment is private to the compute PE that is used for computations
on local data. Typically, we have 256 K bytes of memory per Tile and compute PE
consists of 256 registers of width 64 bits. Scheduling technique for REDEFINE
70
60
Percentage
50
40
30
Kalman Filter (2x2)
Kalman Filter (3x3)
Kalman Filter (4x4)
20
10
0
x103
1x1
2x2
3x3
4x4
5x5
6x6
7x7
8x8
9x9 10x10
Matrix Size
(a) Performance in Different Configurations
of REDEFINE in KF Realization
Performance Improvement (Gflops/waƩ)
Eﬃcient Realization of Kalman Filter on CGRA
129
120
100
80
60
40
20
0
Intel Core Nvidia GTX Altera ClearSpeed Intel Core
Nvidia
480 SM
StraƟx IV
CSX700 i7 Haswell Tesla C2075
Plaƞorms
(b) Power Performance Comparison of PE
with Other Platforms for KF
Fig. 8. RDP conﬁgurations for KF and scheduling of diﬀerent routines in KF
is shown in the right side of the Fig. 7. We use a technique where we divide input
matrix into k × k blocks where k is size of row/column of the Tile array. These
blocks are further divided into the subblocks where size of the subblocks depend
on the number of local registers available and size of the local memory available
to a PE. Blocks of the matrices are loaded and computation is performed and the
result is stored to the global memory. Percentage of theoretical peak performance
attained for each conﬁguration is shown in Fig. 8(a). It can be observed in the
Fig. 8(a) that the performance attained for each conﬁguration saturates at 60%
of the theoretical peak performance which is 2x higher than the performance
attained in multicore and GPGPU.
We evaluate power performance of PE based on technique presented in [10].
We compare PE with other platform for KF as shown in Fig. 8(b). It can be
observed that the PE is capable of achieving 4105x higher performance improvement over platforms like ClearSpeed CSX700, Intel Core i7, and Nvidia GPGPU.
6
Conclusion
In this paper, we presented eﬃcient realization of KF. We used versatile MFA
as a tool for eﬃcient realization of KF. Based on the case studies presented on
dgemm, dgetrf, and dgeqrf, it was identiﬁed that the performance of these operations on multiore and GPGPU is not satisfactory even with hightly optimized
software packages like PLASMA and MAGMA. It was also shown that the performance attained in KF on multicore and GPGPU is 20–30% of the theoretical
peak performance of underlying platform. To accelerate KF on customizable
platform like REDEFINE, we identify macro operations in the routines of MFA
and realized them on RDP. Our approach resulted in 67% improvement in the
performance of KF. A software tuning in MFA was also presented that resulted
in 30% performance improvement over the hardware optimized KF. Overall, our
approach of algorithmarchitecture codesign resulted in 116% of performance
improvement over the base realization of KF. In terms of Gﬂops/watt, KF is
2105x better than multicore and GPGPU platforms.
130
F. Merchant et al.
References
1. Baluni, A., Merchant, F., Nandy, S.K., Balakrishnan, S.: A fully pipelined modular
multiple precision ﬂoating point multiplier with vector support. In: 2011 International Symposium on Electronic System Design, pp. 45–50, December 2011
2. Cerati, G., Elmer, P., Lantz, S., MacNeill, I., McDermott, K., Riley, D., Tadel, M.,
Wittich, P., Wă
urthwein, F., Yagil, A.: Traditional tracking with kalman ﬁlter on
parallel architectures. J. Phys.: Conf. Ser. 608(1), 012057 (2015)
3. Das, S., Madhu, K.T., Madhu, K., Krishna, M., Sivanandan, N., Merchant, F.,
Natarajan, S., Biswas, I., Pulli, A., Nandy, S.K., Narayan, R.: A framework for
postsilicon realization of arbitrary instruction extensions on reconﬁgurable datapaths. J. Syst. Archit.  Embed. Syst. Des. 60(7), 592–614 (2014)
4. Gentleman, W.M., Kung, H.T.: Matrix triangularization by systolic arrays. In:
SPIE Proceedings, vol. 298, pp. 19–26 (1982)
5. Higham, N.J.: Exploiting fast matrix multiplication within the level 3 BLAS. ACM
Trans. Math. Softw. 16(4), 352–368 (1990)
6. Hill, M.D., Marty, M.R.: Amdahl’s law in the multicore era. Computer 41(7), 33–38
(2008)
7. Johnson, B., Thomas, N., Rani, J.S.: An FPGA based high throughput discrete
Kalman ﬁlter architecture for realtime image denoising. In: 2017 30th International Conference on VLSI Design and 2017 16th International Conference on
Embedded Systems (VLSID), January 2017
8. Mahadurkar, M., Merchant, F., Maity, A., Vatwani, K., Munje, I., Gopalan, N.,
Nandy, S.K., Narayan, R.: Coexploration of NLA kernels and speciﬁcation of compute elements in distributed memory CGRAs. In: XIVth International Conference on Embedded Computer Systems: Architectures, Modeling, and Simulation,
SAMOS 2014, Agios Konstantinos, Samos, Greece, 14–17 July 2014, pp. 225–232
(2014)
9. Merchant, F., Maity, A., Mahadurkar, M., Vatwani, K., Munje, I., Krishna, M.,
Nalesh, S., Gopalan, N., Raha, S., Nandy, S.K., Narayan, R.: Microarchitectural
enhancements in distributed memory CGRAs for LU and QR factorizations.
In: 2015 28th International Conference on VLSI Design (VLSID), pp. 153–158,
January 2015
10. Merchant, F.A., Vatwani, T., Chattopadhyay, A., Raha, S., Nandy, S.K., Narayan,
R.: Eﬃcient realization of householder transform through algorithmarchitecture
codesign for acceleration of QR factorization. IEEE Trans. Parallel Distrib. Syst.
PP(99), 1 (2018)
11. Merchant, F., Chattopadhyay, A., Garga, G., Nandy, S.K., Narayan, R., Gopalan,
N.: Eﬃcient QR decomposition using low complexity columnwise givens rotation
(CGR). In: 2014 27th International Conference on VLSI Design and 2014 13th
International Conference on Embedded Systems, Mumbai, India, 5–9 January 2014,
pp. 258–263 (2014)
12. Merchant, F., Chattopadhyay, A., Raha, S., Nandy, S.K., Narayan, R.: Accelerating BLAS and LAPACK via eﬃcient ﬂoating point architecture design. Parallel
Process. Lett. 27(3–4), 1–17 (2017)
13. Merchant, F., Choudhary, N., Nandy, S.K., Narayan, R.: Eﬃcient realization of
table lookup based double precision ﬂoating point arithmetic. In: 29th International Conference on VLSI Design and 15th International Conference on Embedded
Systems, VLSID 2016, Kolkata, India, 4–8 January 2016, pp. 415–420 (2016)
Eﬃcient Realization of Kalman Filter on CGRA
131
14. Merchant, F., Vatwani, T., Chattopadhyay, A., Raha, S., Nandy, S.K., Narayan, R.:
Achieving eﬃcient QR factorization by algorithmarchitecture codesign of householder transformation. In: 29th International Conference on VLSI Design and 15th
International Conference on Embedded Systems, VLSID 2016, Kolkata, India, 4–8
January 2016, pp. 98–103 (2016)
15. Nash, J.G., Hansen, S.: Modiﬁed Faddeeva algorithm for concurrent execution of
linear algebraic operations. IEEE Trans. Comput. 37(2), 129–137 (1988)
16. R´
akossy, Z.E., Merchant, F., Acosta Aponte, A., Nandy, S.K., Chattopadhyay, A.:
Eﬃcient and scalable CGRAbased implementation of columnwise givens rotation.
In: ASAP, pp. 188–189 (2014)
17. R´
akossy, Z.E., Merchant, F., Acosta Aponte, A., Nandy, S.K., Chattopadhyay, A.:
Scalable and energyeﬃcient reconﬁgurable accelerator for columnwise givens rotation. In: 22nd International Conference on Very Large Scale Integration, VLSISoC,
Playa del Carmen, Mexico, 6–8 October 2014, pp. 1–6 (2014)
18. Sandhu, F., Selamat, H., Alavi, S.E., Behtaji Siahkal Mahalleh, V.: FPGAbased
implementation of kalman ﬁlter for realtime estimation of tire velocity and acceleration. IEEE Sens. J. 17(17), 5749–5758 (2017)
19. Smith, B.J.: R package magma: matrix algebra on GPU and multicore architectures, version 0.2.2, 3 September 2010. http://cran.rproject.org/package=magma
20. Thornton, C.L., Bierman, G.J.: Givens transformation techniques for kalman ﬁltering. Acta Astronaut. 4(7–8), 847–863 (1977)
21. Wang, Q., Zhang, X., Zhang, Y., Yi, Q.: AUGEM: automatically generate high
performance dense linear algebra kernels on x86 CPUs. In: Proceedings of the
International Conference on High Performance Computing, Networking, Storage
and Analysis, SC 2013, pp. 25:1–25:12. ACM, New York (2013)
22. Zhong, G., Niar, S., Prakash, A., Mitra, T.: Design of multipletarget tracking
system on heterogeneous systemonchip devices. IEEE Trans. Veh. Technol. 65(6),
4802–4812 (2016)
FPGABased Memory Eﬃcient ShiftAnd
Algorithm for Regular Expression
Matching
Junsik Kim and Jaehyun Park(B)
Department of Information and Communication Engineering,
Inha University, Incheon 22212, Korea
jskim@emcl.org, jhyun@inha.ac.kr
Abstract. This paper proposes a FPGAbased reconﬁgurable regular
expression matching engine for a network intrusion detection system
(NIDS). In the proposed system, the ShiftAnd algorithm was used to
process a regular expression matching. To improve the memory eﬃciency of the algorithm especially used for the Nondeterministic Finite
Automata (NFA) with large number of states, this paper proposes a
parallel matching module with a counter module and a priority encoder.
In addition, in the proposed system, a large NFA can be divided into
several NFAs and process separately by parallel matching module. The
proposed architecture with 265 regular expression matching modules
is implemented using Xilinx Zynq7030 FPGA, that shows 1.066 Gbps
throughput and uses 54.81% LUT.
1
Introduction
Recently, the importance of network security has increased as the damage caused
by DDos attack or Ransomware increases. To reduce this network security risk,
Network Intrusion Detection System (NIDS) that protects the computer system by analyzing network packets is widely adopted. NIDS distinguishes the
malicious packets among the received network packets based on the predeﬁned
rules that is usually written in Perlcompatible regular expression. Snort [1] and
Suricata [2] are the most widely used intrusion detection softwares that provide userdeﬁned detection rules in Perlcompatible regular expression for deep
packet inspection. In spite of its ﬂexibility and expandability, the softwarebased
NIDS has some disadvantages. First of all, since Snort is not an intrusion prevention system but a pure intrusion detection system, NIDS software itself can
be infected and altered by other virussoftware [3]. In that case, normal intrusion detection function may be stumbled. Secondly, since the network pattern
matching is performed by software, the processing speed depends on the performance of the CPU core. This means the detection speed may be slower as the
complexity of the rules grows.
To overcome these shortcomings of softwarebased NIDS, several hardwarebased NIDS systems that are normally designed using high performance FPGA
c Springer International Publishing AG, part of Springer Nature 2018
N. Voros et al. (Eds.): ARC 2018, LNCS 10824, pp. 132–141, 2018.
https://doi.org/10.1007/9783319788906_11
FPGABased Memory Eﬃcient ShiftAnd Algorithm
133
have been proposed during the past decades. Clark proposed a parallel decoder
to process multiple input characters [4]. Sidhu suggested a way to convert regular expressions to Nondeterministic Finite Automata (NFA) and binary logic
expression [5]. Yamagaki proposed an eﬃcient character set comparison logic and
a NFA conﬁguration algorithm for parallel character processing [6]. However,
these architecture should modify the internal logic when the regular expressions
changes, and in turn, the logic blocks should be reconﬁgured.
Another approach to design a FPGAbased NIDS is using the memorybased pattern matching scheme. To implement an ordinary Deterministic Finite
Automata (DFA) to search ASCII strings represented in regular expression with
N possible states, 256 × N state transition tables are necessary in all memory,
which is very ineﬃcient in memory usage and almost impossible to handle a
large regular expression in a practical NIDS system. To improve the eﬃciency
in memory usage, Hieu proposed a structure that combines redundant states
into one [7] and Freire proposed a method using Huﬀman coding [8]. Huﬀman
coding analyzes userdeﬁned rules and assigns short codes to frequently occurred
characters to increase overall memory eﬃciency. Also, HarwayneGidansky and
Meghana proposed NIDS using the data structure of the counting bloom ﬁlter
[9,10]. The counting bloom ﬁlter is suitable for the black or white list method
because there is no false error but the probability of positive error occurs. The
string matching algorithm using hash function is also proposed. [11–14]. Bando
proposed a way to avoid overlapping by specifying a character set as a symbol
set in range hash, and to transmit only the packets included in the range to
the main module, thereby reducing the overall throughput [11]. Lee proposed a
structure that dynamically reconﬁgure regular expressions [15]. However, since
it uses Xilinx’s shift register LUT (SRL), it has limitations that can not be
applied to the other FPGA architecture. Divyasree proposed a structure based
on ShiftOR algorithm and bitmap with GlushkovNFA that showed an outstanding throughput more than 4 Gbps [16]. Cronin proposed a structure using
countable BitParallel GlushkovNFA (BPGNFA)[17].
Meanwhile, the Extended ShiftAnd algorithm is useful for dynamically
reconﬁguring patterns such as DNA and text searching eﬃciently because with
which the NFA can be updated at once with bitparallelism [18]. To achieve high
throughput in the NIDS system, Kaneta implemented the Extended ShiftAnd
algorithm using an FPGA [19]. However, in their architecture, the maximum
number of states depends on the length of the register, so it is hard to apply
a regular expression having many states. In addition, the number of states in
the NFA can quickly increase due to constrained repetition. In practical NIDS
system, thousands of repetition expression of the detection rules are easily used,
which results in a very large number of states. Therefore, in this paper, a memory
eﬃcient architecture that can be applied to the Extended ShiftAnd algorithm
is proposed to solve this problem.
This paper consists of ﬁve sections. Section 2 introduces the extended regular
expression and the ShiftAnd algorithm using Bitparallelism. In Sect. 3, the
134
J. Kim and J. Park
overall system design and implementation details are described. Section 4 shows
the implementation results and Sect. 5 concludes this paper.
2
2.1
Pattern Matching Method
Background
The regular expressions are often used to specify a set of strings to match multiple strings in a single pattern. They are primarily used for character searches,
DNA pattern detection, and deep packet inspection in NIDS software. In this
paper, the regular expression is used to deﬁne the packet detection rules used
by FPGAbased NIDS engine. Table 1 summarized the regular expressions supported by the proposed scheme.
Table 1. Regular expression description
Symbol Description
.
Matches any one character
[,]
It is the same as selecting one of several characters and using multiple .
A range can be speciﬁed with the “” symbol
(,)
Multiple expressions can be bound together
*
Zero or more characters
{m, n} m times or more, and n times or less
?
0 or 1 occurrence
+
More than once

Including one of the two
To apply the Extended ShiftAnd algorithm, regular expressions should be
converted to Nondeterministic Finite Automata deﬁned like Eq. (1).
A = (Q, Σ, I, F, Δ)
(1)
where Q is the entire state set, Σ is all characters in the regular expression, I is
the start state, F is the end state, and Δ is the transition function. Transition
function Δ determines the next state. When a new character α is received to
the NFA, the next state is determined according to the Eq. (2). D is a speciﬁc
function that determines the next state depending on the input character and
the current state.
Δ = {(q, α, q ), q ∈ Q, α ∈ Σ ∪ { } , q ∈ D(q, α)}
(2)
Thompson [20] and Glushkov [21] NFA can move from one state to several states,
so multiple states can be active at the same time. However, to apply the ShiftAnd algorithm, the NFA should have a property of moving only in one direction.
Figure 1 shows the NFA with that property.
FPGABased Memory Eﬃcient ShiftAnd Algorithm
135
Fig. 1. NFA of R = ba {2, 3} c + de ∗ f
2.2
ShiftAnd Algorithm
In this paper, the Extended ShiftAnd algorithm is used to process the regular
expression [18]. Since the active state is expressed in bits, state transitions can
be performed through a shift operation. Thus, if the register length is w bits, w
consecutive strings can be detected.
The NFA of the algorithm is linear and the state transition proceeds step by
step. When B has the transition conditions of all characters Σ in table form, the
input character α is used as the address of the table. If the current state of the
NFA is D, and a new character arrives, the next state Dn is updated by Eq. (3).
Dn ← ((D << 1)  0m−1 1)&B [α]
(3)
In Eq. (3), the current state D is compared with B(q, α) after the shift operation to determine the next state. Pattern matching is determined whether
D &10m−1 = 0 or not. It is determined that pattern matching has occurred.
Since Eq. (3) is an exact pattern matching, it can not be applied to extended
regular expressions. Therefore, R, F , I, and A are added for extend regular
expression and the next state Dn is calculated by the following equations.
D ← (((D << 1)  0m−1 1)&B [α])  (D&R [α])
(4)
Df ← D  F
(5)
Dn ← D  (A&((∼ (Df − I)) ⊕ Df ))
(6)
Equation (4) is used for exact pattern matching and iteration expressions.
Equations (5) and (6) are used where can occur, such as ?, +, {m, n}. R has a
table of repetition conditions for all characters Σ, and F, I, and A are used to
indicate the location of occurrence in that regular expression. However, since
R = α {m, n} is replaced by R = α?n−m αm , (n ≥ m), the number of states
increases.
136
J. Kim and J. Park
Fig. 2. Overall system block diagram
3
3.1
Countable Pattern Matching Structure
Overall System Structure
Figure 2 shows the overall system architecture with parallel Regular Expression
Matching Module (REMM). A host interface is required to exchange the detection rules written in regular expression generated by the host CPU. The main
control module controls each REMM by generating a module enable signals and
a chain enable signals. REMM has a 32bits length, so it can represent a regular expression with maximum 32 NFA states. If there are more states than
this limitation, the remaining states are processed by the next REMM module
with indicating by the Chain Out signal. For example, if the number of NFA
states of any regular expression is 120, the number of modules N required is
(120 >> 5) + 1 = 4, and the last module would use only 24 bits (120 mod 32).
Therefore, a host needs to create a transition table for multimatching in preprocessing for modules with two regular expressions overlapping. Each REMM
reads the bit patterns stored in the FPGA’s block memory and performs comparison operations. The matching signal of each module is encoded in the main
control module and transmitted to the host. The main control module performs
register and memory update of REMM and controls the connection state of each
module through the Chain Enable signal.
3.2
Regular Expression Matching Module
Figure 3 shows the structure of the regular expression matching module that is
similar to the original form of the Extended ShiftAnd structure proposed in
[19]. To solve unrolling issue, signals for counter module and daisy chain signals
were added. Chain In is the signal from the previous module and Chain Out
is the signal to activate the next module. En and Chain En signals of the main
control module control activation of the next module. The Match Out signal
FPGABased Memory Eﬃcient ShiftAnd Algorithm
137
Fig. 3. Regular expression matching module
passes the pattern matching result to the main control module. The STATE, MOVE
and REPEAT in Fig. 3 correspond to D, B and R in Eq. (1), respectively. EpsBEG,
EpsEND, and EpsBLK represent I, F and A, respectively, as well. MOVE and REPEAT
are block memory and use the input packet as an address. Therefore, when a new
packet arrives, the operation is performed in the combinational circuit within one
clock. And the REMM includes one counter module for supporting constrained
repetition expression. The update module updates the contents of the block
memory when the main control module receives a new regular expression from
the host.
3.3
Counter Module
Figure 4 shows the structure of the counter module. The counter module has a
Constrainedrepetition Register(CR) indicating the position of the constrained
Fig. 4. Counter module