Tải bản đầy đủ - 0 (trang)
2 Two's Complement to CSD Recoding with Fault Localization Support Using Scan Based Design

2 Two's Complement to CSD Recoding with Fault Localization Support Using Scan Based Design

Tải bản đầy đủ - 0trang

542



A. Palchaudhuri and A. S. Dhar



(a) LA scheme



(b) Single LA block slice of CSD recoder



Fig. 3. Look-ahead based configuration of CSD recoder



Fig. 4. Scan based pipelined, serial architecture of two’s complement to CSD recoding.



(T D = 0), the circuit functions as per the original functionality of two’s complement to CSD recoding. In the test mode (T D = 1), the FFs are stitched

together in the form of a shift register using a multiplexer based realization.

c0 = SD = 0 in the normal mode of operation, whereas during the test mode,

c0 accepts serial data input SD preferably from a finite state machine. This



FPGA Architectures for High Speed CSD Recoding



543



multiplexing arrangement is encapsulated within the same LUT realizing the

original CSD recoding circuitry as shown in Fig. 4. The additional control input

T D and the serial data input obtained from previous FF output are fed into the

configured LUTs using their vacant inputs. The first pipeline stage of the CSD

recoding circuitry in Fig. 2 establishes its scan path via LUTs and carry chain.

To encapsulate the scan based logic into dual output LUTs of the first pipeline

stage, a minimum of 40% under-utilized inputs are required, a criterion which is

comfortably satisfied in our proposed original architecture. For the subsequent

pipelined stage logic, the dual outputs of each LUT are registered. Thus, in order

to establish the scan path, a test mode input and two serial data inputs need to

be driven using the vacant LUT inputs (minimum of 60% under-utilized inputs)

to avoid logic overhead, a criterion which is satisfied for our proposed architecture. Multi-scan path arrangement is also feasible for our design by replacing

any of the serial data inputs to each LUT with an external serial data input SD.

Such an arrangement optimizes the test time and increases the granularity of

fault localization (Fig. 5).



Fig. 5. Scan based pipelined, LA architecture of two’s complement to CSD recoding.



544



A. Palchaudhuri and A. S. Dhar



Look-Ahead Based Architecture. For the scan based LA architecture, the

CSD recoder for lower half word is identical to that shown in Fig. 4. The

CSD recoder for the upper half word, which accepts its carry input from the

fast LA generator undergoes slight modification in its LUT configuration, as

the initial carry chain multiplexer data input is no longer directly controllable. Hence the LUT in the CSD recoder for the upper half word acts as the

input source of the new serial scan data input. It may be noticed that the

scan based approach does not apply for the fast LA generator as there are no

FFs attached to it. Instead, the carry chain response may be verified by applying the appropriate test vectors. Each successive LUT accepts inputs X4i:4i−4 ,

X4i−4:4i−8 , X4i−8:4i−12 and so on. Hence, every LUT shares a common input

e.g., X4i−8 , with the preceeding LUT, and another common input, e.g., X4i−4 ,

with its successive LUT. Thus, tying the common inputs to logic zero (low),

c4i+4 = c4i if {c4i , x4i−1:4i−3 } = {00XX, 010X, 1101, 111X}; and c4i+4 = c4i if

{c4i , x4i−1:4i−3 } = {011X, 10XX, 1100}. Similarly, tying the common inputs to

logic one (high), c4i+4 = c4i if {c4i , x4i−1:4i−3 } = {000X, 0010, 101X, 11XX};

and c4i+4 = c4i if {c4i , x4i−1:4i−3 } = {0011, 01XX, 100X}.



4



Results and Discussions



Virtex-7 FPGAs with device family, package and speed grade being

XC7VX330T, FFG1157 and -2 respectively are chosen as the Xilinx platform

settings using ISE 14.7 and post place and route results have been reported.

Our proposed two’s complement to CSD recoding architecture is compared with

a previously proposed binary to CSD recoder [2] (Table 1), where it was mentioned that every proposed processing element fits into one slice of a Virtex-4

FPGA comprising of two LUTs. The Virtex-4 platform is now several generations out of date, but the noteworthy point remains that the “bypass” signal [2]

generation logic and its subsequent cascadable design is not amenable for carry

chain based implementation. In the process, the circuit is mapped solely using

LUTs and relies upon the relatively slower programmable routing fabric instead

of the fast, dedicated, hardwired routing fabric of the carry chain to route the

bypass signal, leading to a slow speed realization. Additionally, the circuit of [2]

is not amenable to forward path pipelining in a manner similar to our proposed

architectures. We have registered the inputs and outputs of the circuit proposed

in [2] and mapped it onto Virtex-7 FPGA to estimate its delay, whose results

have been tabulated in Table 1, and compared with our proposed serial and LA

based designs. Both the serial and LA based designs outperform the previously

proposed CSD encoder [2] in speed. To the best of our knowledge, there are no

other FPGA amenable CSD recoders available in literature. Our proposed serial

and LA based CSD recoders exploits the carry chain fabric to achieve higher

speed. The LA based design further accelerates the computation of the carry

signal four times as faster as compared to the serial design for the lower half

word with a nominal logic overhead of 6.25%. For an n-digit output wordlength,

the proposed serial design occupies 4n FFs, 2n LUTs and n2 slices. Similarly,



FPGA Architectures for High Speed CSD Recoding



545



Table 1. Implementation results for two’s complement to CSD converter

Operand Design style

width

32



64



96



128



#FF #LUT #Slice #Pipeline stages Frequency

(MHz)/Delay (ns)



Behavioral



realization

of [2]

Proposed

128

serial design

Proposed

128

LA design



a



46



29



Non-pipelined



64



16



2



852.51/1.173



68



17



2



987.17/1.013



Behavioral



94

realization

of [2]

Proposed

256 128

serial design

Proposed

256 136

LA design



61



Non-pipelined



32



2



624.22/1.602



34



2



781.25/1.280



85



Non-pipelined



48



2



491.64/2.034



51



2



645.99/1.548



Behavioral



142

realization

of [2]

Proposed

384 192

serial design

Proposed

384 204

LA design

Behavioral



190

realization

of [2]

Proposed

512 256

serial design

Proposed

512 272

LA design



112



Non-pipelined



a



a



a



401.77/2.489



292.48/3.419



278.78/3.587



223.26/4.479



64



2



402.41/2.485



68



2



550.36/1.817



a

Behavioral



238

142

Non-pipelined

204.46/4.891

realization

of [2]

Proposed

640 320

80

2

344.47/2.903

serial design

Proposed

640 340

85

2

478.93/2.088

LA design

a

The behavioral designs are essentially combinational circuits. The delay and the frequency of operation were obtained by inserting FFs at the primary input and output

ports of the circuits.



160



546



A. Palchaudhuri and A. S. Dhar



Table 2. Implementation results for scan based two’s complement to CSD converter

Operand width Design style #FF #LUT #Slice #Scan paths Frequency

(MHz)/Delay (ns)

32



Serial scan

LA scan



128

128



64

68



16

17



1

2



782.47/1.278

931.10/1.074



64



Serial scan



256

256

256



128

128

136



32

32

34



1

2

2



585.82/1.707

585.82/1.707

777.00/1.287



96



Serial scan



384

384

384

384



192

192

192

204



48

48

48

51



1

2

3

2



467.51/2.139

467.51/2.139

467.51/2.139

665.78/1.502



512

512

512

512

512



256

256

256

272

272



64

64

64

68

68



1

2

4

2

4



386.10/2.590

386.10/2.590

386.10/2.590

581.73/1.719

581.73/1.719



640

640

640

640

640



320

320

320

340

340



80

80

80

85

85



1

2

4

2

4



313.48/3.190

313.48/3.190

313.48/3.190

514.67/1.943

514.67/1.943



LA scan



LA scan

128



Serial scan



LA scan

160



Serial scan



LA scan



for an n-digit output wordlength, the proposed LA design occupies 4n FFs,

n

) slices.

(2n + n8 ) LUTs and ( n2 + 32

Table 2 shows the implementation results for the scan based design of two’s

complement to CSD recoding for both the serial and LA based design. The scan

based architectures do not consume additional logic in comparison to the original

designs. However some minor differences in the delay amongst the original and

the scan based architectures may be possibly attributed owing to routing of

the additional signals to establish the scan path. The architectures with multiscan path arrangements have identical performance metrics compared to their

single scan path equivalent. The area consumption for the original and their

equivalent scan based designs are also the same. The scan paths of the LA based

CSD recoder for the lower and the upper half word are kept separate, hence the

minimum number of scan paths in the LA design as tabulated in Table 2 is two.



5



Design Flow and Automation



All the architectures have been conceived following the art of primitive instantiation and the placement directives to map the primitives on designated slice



FPGA Architectures for High Speed CSD Recoding



547



Fig. 6. Highlighted changes in the LUT instantiation templates to facilitate scan insertion into the original design



coordinates often dictated by the proximity of the bit indices or successive stages

of logic. The structural regularity of the circuits aids in design automation for

generating the circuit descriptions by writing simple C programs with computational complexity O(n) where n is the output digit size. Consider the LUT

instantiation template for the configured LUTs in the second stage of pipelining

shown in Fig. 6. The unused inputs of the configured LUT have been grounded

by attaching logic zero (1 b0) to them. For the scan based design, the vacant

inputs are used to drive the serial data inputs and the mode input, and necessary alteration of the truth-table contents for the new function is taken care of

as shown in Fig. 6. The C programs for the original and scan based design may

be generated independently, or one may be generated from the other.

For fault localization, the first faulty bit response (if any) from the scanned

out bit-stream during the test mode is noted, where the FFs are pre-initialized

with a known string in the normal mode. If each FPGA slice generates r output

bits, and the circuit is mapped starting from the slice located at Xi Yj in a

columnar fashion and the total number of FFs in the scan chain is x, the LUT

or FF located at the S = (mod (x − p, r) + 1)-th position in the slice situated at

the Xi Yj−1+ x−p coordinate may be the faulty candidate. For example, consider

r

“101001010110001010101110”, a scanned out bit-stream from a single scan path,

with the bold red (underlined) “1” at position p = 15 from right as the first

faulty response. If the circuit is mapped starting from the Xi Yj coordinate in a

columnar fashion, with the rightmost bit (0) of the above string being the first



548



A. Palchaudhuri and A. S. Dhar



scanned out element, and total number of FFs in the scan chain is x (x = 24),

the LUT or FF located at position S = mod (24 − 15, 4) + 1 = 2 may be deemed

faulty. Thus, the second LUT or the second FF present within a slice located

at the coordinate Xi Yj−1+ x−p = Xi Yj+2 may be the faulty candidate. Such

r

defective slices may be bypassed by providing Xilinx proprietary PROHIBIT

constraints and a new set of slice coordinates for the circuit may be generated by

re-running the automation machinery, that accepts the CSD output digit size and

the initial slice coordinates for mapping the circuit as its inputs. This practice

enables real-time detection of any newly generated FPGA defects without any

expensive test-bed set up or expertise in testing. A multi-scan path arrangement

in such cases increase the granularity of fault localization, as it searches for the

faulty logic element within a smaller radius of neighbourhood.

The scan path cannot solely decide upon the fault coverage. The designer

shall now formulate the test vectors to achieve the desired percentage of fault

coverage. We have only provided for the hardware support in the form of designing the scan paths without logic overhead, and spelt out the control bit specifications for run-time reconfiguration to test mode. However, we have performed

post-route simulations by emulating certain faults in the FPGA and have been

able to localize the area from which the fault has emanated.



6



Conclusion



In this paper, we have proposed high speed, compact CSD recoding circuits on

FPGAs, by properly exploiting the carry chain fabric. The proposed bit-sliced

circuits are amenable to forward path pipelining, and comfortably outperforms

another state-of-the-art circuit proposed for FPGAs in [2] with respect to speed.

The feasibility of automating the circuit descriptions makes it attractive for

commercial viability. The placement directives generated ensured a compact

architecture and maximum resource utilization in every configured slice. The

circuit descriptions are backward compatible for realization on Virtex-6 FPGAs

and scalable for other 6 and 7 series of FPGAs.



References

1. Parhi, K.K.: VLSI Digital Signal Processing Systems: Design and Implementation.

Wiley India Pvt. Limited, Delhi (2007)

2. Faust, M., Gustafsson, O., Chang, C.-H.: Fast and VLSI efficient binary-to-CSD

encoder using bypass signal. Electron. Lett. 47(1), 18–20 (2011)

3. Ruiz, G.A., Granda, M.: Efficient canonic signed digit recoding. Microelectron. J.

42(9), 1090–1097 (2011)

4. Herrfeld, A., Hentschke, S.: Look-ahead circuit for CSD-code carry determination.

Electron. Lett. 31(6), 434–435 (1995)

5. Ko¸c, C

¸ .K.: Parallel canonical recoding. Electron. Lett. 32(22), 2063–2065 (1996)



FPGA Architectures for High Speed CSD Recoding



549



6. He, Y., Ma, B., Li, J., Zhen, S., Luo, P., Li, Q.: A fast and energy efficient binaryto-pseudo CSD converter. In: 2015 IEEE International Symposium on Circuits and

Systems (ISCAS), pp. 838–841 (2015)

7. Tanaka, Y.: Efficient signed-digit-to-canonical-signed-digit recoding circuits.

Microelectron. J. 57, 21–25 (2016)

8. Modi, H., Athanas, P.: In-system testing of Xilinx 7-series FPGAs: part 1-logic.

In: IEEE International Conference for Military Communications (MILCOM), pp.

477–482 (2015)

9. Naouss, M., Marc, F.: Modelling delay degradation due to NBTI in FPGA lookup tables. In: 26th International Conference on Field Programmable Logic and

Applications (FPL), pp. 1–4 (2016)

10. Naouss, M., Marc, F.: FPGA LUT delay degradation due to HCI: experiment and

simulation result. Microelectron. Reliab. 64(C), 31–35 (2016)

11. Gupte, A., Vyas, S., Jones, P.H.: A fault-aware toolchain approach for FPGA fault

tolerance. ACM Trans. Design Autom. Electron. Syst. 20(2), 32:1–32:22 (2015)

12. Kyriakoulakos, K., Pnevmatikatos, D.: A novel SRAM-based FPGA architecture

for efficient TMR fault tolerance support. In: 19th International Conference on

Field Programmable Logic and Applications (FPL), pp. 193–198 (2009)

13. Nazar, G.L., Carro, L.: Fast error detection through efficient use of hardwired

resources in FPGAs. In: 17th IEEE European Test Symposium (ETS), pp. 1–6

(2012)

14. Palchaudhuri, A., Dhar, A.S.: Efficient implementation of scan register insertion

on integer arithmetic cores for FPGAs. In: 29th International Conference on VLSI

Design, pp. 433–438 (2016)

15. Basha, B.C., Pillement, S., Piestrak, S.J.: Fault-aware configurable logic block for

reliable reconfigurable FPGAs, In: IEEE International Symposium on Circuits and

Systems (ISCAS). pp. 2732–2735 (2015)

16. Wheeler, T., Graham, P., Nelson, B., Hutchings, B.: Using design-level scan to

improve FPGA design observability and controllability for functional verification. In: Brebner, G., Woods, R. (eds.) FPL 2001. LNCS, vol. 2147, pp. 483–492.

Springer, Heidelberg (2001). https://doi.org/10.1007/3-540-44687-7 50

17. Ehliar, A.: Optimizing Xilinx designs through primitive instantiation. In: Proceedings of the 7th FPGAworld Conference, FPGAworld 2010, pp. 20–27. ACM, New

York (2010)

18. Palchaudhuri, A., Chakraborty, R.S.: High Performance Integer Arithmetic Circuit

Design on FPGA: Architecture Implementation and Design Automation. Springer

India, New Delhi (2016). https://doi.org/10.1007/978-81-322-2520-1

19. Xilinx Inc.: 7 Series FPGAs Configurable Logic Block User Guide UG474 (v1.8)

27 Sep 2016. https://www.xilinx.com/support/documentation/user guides/ug474

7Series CLB.pdf

20. Hwang, K.: Computer Arithmetic: Principles Architecture and Design. Wiley,

Hoboken (1979)

21. Zicari, P., Perri, S.: A fast carry chain adder for Virtex-5 FPGAs. In: 15th IEEE

Mediterranean Electrotechnical Conference (MELECON), pp. 304308 (2010)

22. Kă

allstră

om, P., Gustafsson, O.: Fast and area efficient adder for wide data in recent

Xilinx FPGAs. In: 26th International Conference on Field Programmable Logic

and Applications (FPL), pp. 1–4 (2016)

23. Wheeler, T.B.: Improving design observability and controllability for functional

verification of FPGA-based circuits using design-level scan techniques. Master’s

thesis. Brigham Young University (2001)



550



A. Palchaudhuri and A. S. Dhar



24. Wheeler, T., Graham, P., Nelson, B., Hutchings, B.: Using design-level scan to

improve FPGA design observability and controllability for functional verification. In: Brebner, G., Woods, R. (eds.) FPL 2001. LNCS, vol. 2147, pp. 483–492.

Springer, Heidelberg (2001). https://doi.org/10.1007/3-540-44687-7 50

25. Toutounchi, S., Lai, A.: FPGA test and coverage. In: International Test Conference,

pp. 599–607 (2002)

26. Palchaudhuri, A., Amresh, A.A., Dhar, A.S.: Efficient automated implementation

of testable cellular automata based pseudorandom generator circuits on FPGAs.

J. Cell. Autom. 12(3–4), 217–247 (2017)



Exploring Functional Acceleration of

OpenCL on FPGAs and GPUs Through

Platform-Independent Optimizations

Umar Ibrahim Minhas(B) , Roger Woods , and George Karakonstantis

Queens University Belfast, Belfast, UK

u.minhas@qub.ac.uk



Abstract. OpenCL has been proposed as a means of accelerating functional computation using FPGA and GPU accelerators. Although it provides ease of programmability and code portability, questions remain

about the performance portability and underlying vendor’s compiler

capabilities to generate efficient implementations without user-defined,

platform specific optimizations. In this work, we systematically evaluate

this by formalizing a design space exploration strategy using platformindependent micro-architectural and application-specific optimizations

only. The optimizations are then applied across Altera FPGA, NVIDIA

GPU and ARM Mali GPU platforms for three computing examples,

namely matrix-matrix multiplication, binomial-tree option pricing and

3-dimensional finite difference time domain. Our strategy enables a fair

comparison across platforms in terms of throughput and energy efficiency

by using the same design effort. Our results indicate that FPGA provides better performance portability in terms of achieved percentage of

device’s peak performance (68%) compared to NVIDIA GPU (20%) and

also achieves better energy efficiency (up to 1.4×) for some of the considered cases without requiring in-depth hardware design expertise.



1



Introduction



The rapidly increasing use of heterogeneous accelerators such as Graphic Processing Unit (GPU) and Field Programmable Gate Array (FPGA) in data centres necessitates the adoption of a unified programming environment that also

maintains better throughput and energy efficiency [1]. This is hard to achieve,

however, due to widely varied architectures and technologies, which have been

traditionally programmed via specialized languages e.g. VHDL for FPGAs and

CUDA for NVIDIA GPUs, using detailed knowledge of underlying hardware. In

addition to programming inefficiency, this hinders fair comparison of achieved

performance and design cost across the different accelerating technologies.

To address this challenge, the Open Computing Language (OpenCL) [2] has

been introduced as a C-based platform-independent language, to allow parallelism to be expressed explicitly regardless of the underlying hardware. OpenCL

is now supported by a range of programmable accelerators including GPUs and

c Springer International Publishing AG, part of Springer Nature 2018

N. Voros et al. (Eds.): ARC 2018, LNCS 10824, pp. 551–563, 2018.

https://doi.org/10.1007/978-3-319-78890-6_44



552



U. I. Minhas et al.



FPGAs. However, OpenCL only provides functional portability and the application implementation needs to be optimized by the underlying accelerator vendor

compilers. Under such a reality, the question remains about performance portability of OpenCL applications on various accelerators. That is, how much performance an application can achieve across various platforms and how to gauge

the efficiency of the vendor-specific compilers to map OpenCL source code to

the targeted device with minimum or even no user-defined platform-specific optimizations. Also, it is questionable if FPGAs still require more in-depth knowledge

of underlying hardware compared to other technologies.

Achieving performance portability and fair evaluation is becoming extremely

important with increased usage of accelerators in data centres and cloud environments [3]. Researchers have approached these challenges from two angles. On one

hand some studies compare programming languages such as a hardware descriptive language (HDL) and Compute Unified Device Architecture (CUDA) with

OpenCL on the same platform, e.g. FPGAs [4] and GPUs [5]. On the other hand

there is portability evaluation of the same language e.g. OpenCL across multiple

platforms i.e. NVIDIA GPU, AMD GPU, Intel CPU and Sony/Toshiba/IBM

Cell Broadband Engine [6]. These works conclude that although platform independent language can lead to better portability, additional effort is required for

tuning kernels to each device to achieve comparable performance.

An architectural and programming model study on fractal video compression involving optimization of OpenCL on FPGA has been presented in [7] and

provides a series of FPGA-based optimizations on FPGA before comparing the

results with CPU and GPU for an optimized kernel. In [8], six benchmarks of

the Rodinia suite are evaluated using OpenCL and FPGA-specific optimizations are performed on kernels optimized for GPU-like devices, achieving up to

3.4x better energy efficiency compared to GPUs. However, this work requires

platform-specific optimizations, which partially nullifies the motivation behind

a software-based approach via a unified programming environment. In addition,

they compare the output with already-optimized implementations on other platforms and do not discussed performance portability.

In this paper, we develop and apply a systematic approach to gauging performance portability and fair evaluation. We apply a set of uniform microarchitectural optimizations for fair porting, optimization and evaluation of applications across platforms using OpenCL. The optimizations are based on carefully

selected common micro-architectural features that can be easily parametrized

via the OpenCL model. Initially, we take C source code of kernels for 3 accelerated computing applications, namely matrix-matrix multiplication (SGEMM),

Binomial-tree Option Pricing (BOP) and 3 dimensional Finite Difference Time

Domain (FDTD) and port them to OpenCL as base kernels before applying the

platform-independent optimizations.

We then evaluate these optimisations on 3, state-of-the art, platforms namely

Altera FPGA, a high performance NVIDIA GPU and a low power ARM Mali

GPU. In doing so, we analyse the underlying compilers’ job in generating an

optimized implementation. We also compare the achieved performance to the



Tài liệu bạn tìm kiếm đã sẵn sàng tải về

2 Two's Complement to CSD Recoding with Fault Localization Support Using Scan Based Design

Tải bản đầy đủ ngay(0 tr)

×