Tải bản đầy đủ - 0 (trang)
6 Performance/Energy Relative to GPU

6 Performance/Energy Relative to GPU

Tải bản đầy đủ - 0trang

430



B. Liebig et al.

Table 6. Single FPGA kernel vs GK210 GPU

Test case



Latency [s]

for one cell



Throughput

[cells per second]



Power [W]



Energy per

cell [J]



A on FPGA

1.2

A on GPU 322.3



0.81

12.1



3.6

138.4



4.48

11.46



B on FPGA

B on GPU



1.0

91.3



1.05

38.69



3.0

131.6



2.89

3.41



C on FPGA

C on GPU



0.7

23.9



1.35

34.60



3.0

132.5



2.21

3.83



D on FPGA

D on GPU



7.9

72.0



1.27

39.64



3.1

145.7



2.44

3.67



E on FPGA

E on GPU



7.8

75.3



1.28

42.98



3.2

147.0



2.47

3.42



expected for a throughput-architecture such as a GPU). However, the FPGA is

more energy efficient (in terms of Joules per cell).



5



Conclusion and Future Work



We presented a new approach for hardware synthesis of larger CellML models

that offers superior latency and energy efficiency compared to CPU and GPU.

Furthermore, our specialized HLS tool significantly exceeds the quality-of-results

of a state-of-the-art industrial HLS system. The performance and size of the

accelerators created by our approach can be flexibly scaled, achieving significant

speed-ups in most cases even when dedicating just a quarter of a mid-size FPGA

to the accelerator circuit.

To extrapolate the power of our approach beyond the Virtex 7-class devices,

which were introduced in 2010, to current generation FPGAs, we have performed

an initial experiment compiling and mapping model A to a modern XCVU13P-3

UltraScale+ FPGA. We used a total 16 FP units and achieved an fmax of 316

MHz, which would yield a speed-up of 8.3x relative to the desktop class CPU

in single-accelerator performance. As each accelerator requires only 2.9% of that

FPGA’s area, an additional speed-up could be achieved by tiling accelerators, e.g.

8 accelerators implemented in parallel still reach 282 MHz. This huge potential

makes further research on reconfigurable computing for cell simulation highly

promising.



Improved HLS for Complex CellML Models



431



References

1. Cuellar, A.A., Lloyd, C.M., Nielsen, P.M.F., et al.: An overview of CellML 1.1, a

biological model description language. Simulation 79(12), 740–747 (2003)

2. Yu, T., Bradley, C., Sinnen, O.: ODoST: automatic hardware acceleration for

biomedical model integration. TRETS 9(4), 27:1–27:24 (2016)

3. Yu, T., Oppermann, J., Bradley, C., Sinnen, O.: Performance optimisation strategies for automatically generated FPGA accelerators for biomedical models. Concurrency Comput.: Practice Experience 28(5), 1480–1506 (2016)

4. Bradley, C., Bowery, A., Britten, R., et al.: OpenCMISS: a multi-physics & multiscale computational infrastructure for the VPH/Physiome project. Progress Biophys. Mol. Biol. 107(1), 32–47 (2011). Experimental and Computational Model

Interactions in Bio-Research: State of the Art

5. Faville, R.A., Pullan, A.J., Sanders, K.M., et al.: Biophysically based mathematical

modeling of interstitial cells of Cajal slow wave activity generated from a discrete

unitary potential basis (2009). CellML file: faville model 2008.cellml (Catherine

Lloyd)

6. Miller, A.K., Marsh, J., Reeve, A., et al.: An overview of the CellML API and its

implementation. BMC Bioinform. 11, 178 (2010)

7. de Dinechin, F., Pasca, B.: Designing custom arithmetic data paths with FloPoCo.

IEEE Des. Test Comput. 28(4), 18–27 (2011)

8. Oppermann, J., Koch, A., Yu, T., Sinnen, O.: Domain-specific optimisation for the

high-level synthesis of CellML-based simulation accelerators. In: 25th International

Conference on Field Programmable Logic and Applications, FPL 2015, London,

United Kingdom, 2–4 September 2015, pp. 1–7. IEEE (2015)

9. Liebig, B., Koch, A.: High-level synthesis of resource-shared microarchitectures

from irregular complex c-code. In: 2016 International Conference on FieldProgrammable Technology (FPT), pp. 133–140. IEEE (2016)

10. Huthmann, J., Liebig, B., Oppermann, J., Koch, A.: Hardware/software cocompilation with the Nymble system. In: 8th International Workshop on Reconfigurable and Communication-Centric Systems-on-Chip, pp. 1–8. IEEE, July 2013

11. Huthmann, J., Mller, P., Stock, F., Hildenbrand, D., Koch, A.: Accelerating highlevel engineering computations by automatic compilation of geometric algebra to

hardware accelerators. In: 2010 International Conference on Embedded Computer

Systems: Architectures, Modeling and Simulation, pp. 216–222, July 2010

12. Thielmann, B., Huthmann, J., Koch, A.: Precore - a token-based speculation architecture for high-level language to hardware compilation. In: 2011 21st International

Conference on Field Programmable Logic and Applications, pp. 123–129. September 2011

13. Huthmann, J., Oppermann, J., Koch, A.: Automatic high-level synthesis of multithreaded hardware accelerators. In: 2014 24th International Conference on Field

Programmable Logic and Applications (FPL), pp. 1–4, September 2014

14. Nane, R., Sima, V.M., Pilato, C., et al.: A survey and evaluation of FPGA

high-level synthesis tools. IEEE Trans. Comput.-Aided Des. Integr. Circuits Syst.

35(10), 1591–1604 (2016)

15. Xilinx, Inc.: Vivado Design Suite User Guide - High-Level Synthesis (2012)

16. Fingeroff, M., Bollaert, T.: High-Level Synthesis Blue Book. Mentor Graphics Corporation, Wilsonville (2010)

17. Pilato, C., Ferrandi, F.: Bambu: a modular framework for the high level synthesis

of memory-intensive applications. In: 2013 23rd International Conference on Field

Programmable Logic and Applications (FPL), pp. 1–4. IEEE (2013)



432



B. Liebig et al.



18. Nane, R., Sima, V.M., Olivier, B., et al.: DWARV 2.0: a CoSy-based C-to-VHDL

hardware compiler. In: 2012 22nd International Conference on Field Programmable

Logic and Applications (FPL), pp. 619–622. IEEE (2012)

19. Canis, A., Choi, J., Aldham, M., et al.: LegUp: high-level synthesis for FPGAbased processor/accelerator systems. In: Proceedings of International Symposium

on Field Programmable Gate Arrays (FPGA), pp. 33–36 (2011)

20. Lloyd, C.M., Lawson, J.R., Hunter, P.J., et al.: The CellML model repository.

Bioinformatics 24(18), 2122–2123 (2008)

21. Detrey, J., de Dinechin, F.: Parameterized floating-point logarithm and exponential

functions for FPGAs. Microprocess. Microsyst. Spec. Issue FPGA-based Reconfigurable Comput. 31(8), 537–545 (2007)

22. Grandi, E., Pasqualini, F.S., Bers, D.M.: A novel computational model of

the human ventricular action potential and Ca transient (2010). CellML file:

grandi pasqualini bers 2010 flat.cellml (Geoffrey Nunns)

23. Hornberg, J.J., Binder, B., Bruggeman, F.J., et al.: Control of MAPK

signalling: from complexity to what really matters (2005). CellML file:

hornberg binder brugge-man schoeberl heinrich westerhoff 2005.cellml (Catherine

Lloyd)

24. Iyer, V., Hajjar, R.J., Armoundas, A.A.: Mechanisms of abnormal calcium homeostasis in mutations responsible for catecholaminergic polymorphic ventricular

tachycardia (2007). CellML file: iyer 2007 ss.cellml (Penny Noble)

25. Iyer, V., Mazhari, R., Winslow, R.L.: A computational model of

the human left-ventricular epicardial myocyte (2004). CellML file:

iyer mazhari winslow 2004.cellml (Steven Niederer)



An Intrusive Dynamic Reconfigurable

Cycle-Accurate Debugging System

for Embedded Processors

Habib ul Hasan Khan(&), Ahmed Kamal, and Diana Goehringer

Technische Universitaet Dresden (TUD), Dresden, Germany

{habib.khan,ahmed.kamal,

diana.goehringer}@tu-dresden.de



Abstract. This paper presents a dynamic partial reconfigurable debugging

system for embedded processors based upon a device start and stop (DSAS)

approach [1]. Using this approach, a cycle-accurate debugging system can be

dynamically configured to any embedded processor-based design at runtime.

The debugging system offers lossless debugging because the design is stopped

during data transfer to prevent the loss of data. The data can be transferred by

any available data communication interface such as Ethernet or UART and can

be viewed by open-source waveform viewers. The technique offers debugging

without the need to re-synthesize the design by using the dynamic partial

reconfiguration.

Keywords: FPGA Á Debugging Á Simulation Á Device start and stop

DSAS Á Device under test Á Dynamic partial reconfiguration



1 Introduction

The debugging process of current embedded designs is becoming cumbersome because

of increasing design complexities. It is revealed that 35 to 45% of the total development

effort is spent on verification [2] and this fraction is likely to grow. Moreover, these

studies reveal that debugging constitutes about 60% of the total verification efforts.

This is due to the fact that FPGA-based designs have lower visibility.

On-chip visibility normally can be enhanced by instrumentation of the design [3]

before implementation. These instruments called integrated logic analyzers (ILA) can be

used to save a predetermined window of a subset of signal data into memory blocks for

offline analysis. However, because of resource limitation, the signals have to be selected

before compilation. Hence a new set of data can only be observed after the circuit has been

recompiled, a process that can take hours [4]. Moreover, such trace based embedded

design solutions operate mainly on the design before place and route (PAR). These tools

instrument the original user circuit with trace buffers and their connections before

mapping, making fewer resources available for the original design. Insertion of debug

circuitry can alter the place and route of the design and hence can prove hazardous for the

design in many ways such as the embedded design may no longer fit in the FPGA device,

or timing issues may arise because of the debugging circuitry.

© Springer International Publishing AG, part of Springer Nature 2018

N. Voros et al. (Eds.): ARC 2018, LNCS 10824, pp. 433–445, 2018.

https://doi.org/10.1007/978-3-319-78890-6_35



434



H. H. Khan et al.



With the advent of Dynamic Partial Reconfiguration (DPR) [5], the time consuming

recompilation step can be avoided. Since reconfiguration of an embedded design is

very fast compared to recompilation (tens of milliseconds versus minutes to hours), by

taking advantage of DPR, the debug-cycle can be sped up.

This paper presents a DPR-based debugging system for embedded processors using

a device start and stop (DSAS) approach. In this methodology, the debugging system is

present on the dynamic partition and is configured at runtime. Then the debugging

system starts and stops the Device Under Test (DUT) which is the static design and

saves the data to external memory without any debug window limitation hence providing a continuous, lossless stream of data without any limitation. Moreover, as the

debugging system is configured for the design under test through DPR at runtime

therefore re-compilation of the design is no longer required. Furthermore, the debugging data stored on the external devices can be viewed based on open source waveform

viewers like GTKwave.

The rest of the paper is organized as follows. Section 2 presents related work and

provides background information. Section 3 discusses the design methodology of the

proposed design. In Sect. 4 the results are discussed. The paper is concluded in Sect. 5.



2 Related Work

Commercial signal capture tools offered by the two major FPGA vendors: Xilinx’s

ChipScope Pro and Altera’s SignalTap II are based upon embedded logic analyzer IP

which is instantiated into the user-circuit during regular compilation. A device-neutral

product is offered by Synopsys as Identify, offering similar functionality. It is possible to

modify the trigger conditions at runtime, but not the signal sets. Hence changing the

signals under observation requires FPGA recompilation. Furthermore, instrumentation

is normally done after a failure is observed, hence requiring an iteration of the development process. Another tool called Certus by Mentor, allows pre-instrumentation of a

large set of interesting signals in the FPGA prior to compilation. Then, during debugging, a small subset of signals can be selected for observation. This may provide more

runtime flexibility to designers than in other tools, but it still requires a set of signals to

be preselected for observation before any information about possible bugs is available.

A design-level scan was proposed [6] to connect memory elements such as

Flip-Flops (FFs) and embedded RAMs in sequence by using the FPGA resources.

However, the main drawback of the technique is its high area overhead because FPGA

resources are used to implement the scan-chains in the design. In [7], the authors

proposed to pre-insert trace buffers into the design in advance, and then perform low

level bitstream modification using incremental techniques for connecting the trace

buffers to the desired signals. However, this technique still requires pre-reservation of

FPGA resources, making them unavailable to the original design. Furthermore, once

the debugging process is complete, the trace buffers need to be removed which may

alter the place and route of the design.

A virtual overlay network was introduced in [8] which multiplexes the signals into

the trace buffers instantiated into the free FPGA resources to avoid unnecessary re-spins.

However, this technique requires spare resources which is not always the case.



An Intrusive Dynamic Reconfigurable Cycle-Accurate Debugging System



435



A framework called Dynamic Modular Development (DMD) [9] used the Xilinx Partial

reconfiguration flow for accelerating the embedded design process by partitioning the

design modules into separate partially reconfigurable regions and automatically merging

embedded modules which are not required to be modified anymore into the surrounding

static region. Consequently, rapid turnaround times can be achieved by partitioning

frequently modified modules into separate partial reconfigurable regions [10].

A bitstream modification technique was presented in [11] which allows to modify

the bitstream after PAR process. The embedded logic analyzer is instantiated to the

design prior to netlisting. The signals of interest can be connected to the embedded

design by changing the partial bitstream hence reducing the time spent in PAR process.

But, when the set of signals for tracing is changed, re-routing needs to be performed

which can significantly affect the design’s time to market. Furthermore, logic analyzer

needs to be removed from the design after design validation which can affect design

response of the validated design. Software-like debug features such as watchpoints and

breakpoints to enhance debug capability in reconfigurable platforms was presented in

[12]. But changing the watchpoints or breakpoints required recompilation of designs.

A new methodology based upon reconfigurability of FPGA was proposed [13]

which permits to monitor a large number of internal signals for an arbitrary number of

clock cycles by using only limited external pins and hence eliminating the need for

repeated iterations of the re-synthesis, placement and routing processes. A multiplexer

(MUX) is instantiated into the design with the MUX inputs being all the potential

signals required to trace. Different signals can then be selected by reconfiguring the

bitstream for select signals of the MUX. The main disadvantage of this methodology is

that the contents of the registers need to be shifted within one clock cycle which greatly

affects the maximum frequency (Fmax) of the design.

A design-for-debug infrastructure namely distributed reconfigurable fabric was

proposed [14] whose components can be distributed widely in the FPGA and can

debug a large number of signals. The reconfigurable logic is programmed to implement

various debug paradigms, such as assertions, signal capture and what-if analysis which

can accelerate the debugging process. However, still the design needs to be synthesized

and implemented after placement of the debug architecture and also needs significant

hardware resources. A programmable logic core based debugging system [15] comprising an access network was introduced which can be controlled by the PLC to select

the signals required to be debugged.

In some intrusive debugging works [16, 17], the clock of the embedded design was

controlled to get debugging data however these works required breakpoints to stop the

clock and hence system state, very close to the breakpoint, could be monitored. An

intrusive debug approach [18] based upon stopping the clock by monitoring the

occupancy of trace buffers was proposed. However, the approach needs a lot of scarce

FPGA resources (1 MB RAM), emulation hardware, and also requires external intervention for data handling. In our previous work [1], we introduced a debugging

solution which required only 4 KB RAM for saving the data hence even small FPGAs

can be equipped with the debugging system with automated data saving process.

However, the debugging system is required to be instantiated before synthesis and PAR

process which in some case require a lot of time.



436



H. H. Khan et al.



From the above discussion, it is evident that clock management in response to

memory occupancy can be used to get a continuous, cycle accurate stream of debugging data. The above methodology can be augmented with DPR to save the time spent

on the iterative process of synthesis and PAR.

The main contributions of the work are:

• An access network associated device start and stop approach for complete debugging of microprocessors.

• Using DPR technique to employ our debugging system as a reconfigurable module

to the embedded design on requirement basis to reduce the time spent on iterative

PAR process of traditional debugging solutions.



3 Debugging Methodology

In this section, we will describe our methodology for a dynamic partial reconfigurable

debugging system for embedded processors based upon a device start and stop (DSAS)

approach. In this methodology, the device under test (DUT) is the static partition and

the debugging system is configured as the dynamic reconfigurable part. The embedded

processor can keep on performing the desired task without the debugging system if not

required. However, once required, the debugging system is dynamically configured to

the design using partial bitstream, then it clocks the DUT present on the static partition

and performs data logging to the trace buffers. Once the trace buffers are full, the

debugging system stops the clock so that no data is lost and saves the data to external

memory during the intermediate period and once done it starts clocking the DUT again.

Hence providing a continuous, lossless, stream of data with effectively unlimited debug

window. Moreover, since the debugging system is installed to the design under test

(DUT) through a partial bitstream hence re-implementation of the design is not

required. Furthermore, the debugging data can be sent to the terminal using a UART or

Ethernet interface which is saved in a log file on the external devices can be used for

debugging based on open source waveform viewers like GTKWave. A block diagram

of the debugging methodology is shown in Fig. 1.

The main benefits of the proposed technique are debugging of embedded processors due to no loss of debugging data, re-utilization of the same FPGA resources for



Signal set to be selected by the controller

May have thousands of

signal here

Selec on

Register



Clock

Manage

-ment



Clock to

DUT



Embedded

Processor



Access

Network



16 signals

here



Trace Buffers



Device start/stop signal



Fig. 1. Debugging methodology



Data transfer either by

Ethernet or UART



Processor

Control



Terminal



An Intrusive Dynamic Reconfigurable Cycle-Accurate Debugging System



437



other applications thanks to DPR, no requirement of any specific data acquisition

interface (even a UART can be used) and no requirement of an external emulation

system.

Furthermore, open-source waveform viewers are used subsequently removing the

dependency to use proprietary software. Hence, a cost-effective solution is presented.

3.1



Device Under Test (DUT)



The debugging solution is generic and can be used for any embedded design. However,

the methodology is ideally suited for complex embedded microprocessors where it is

difficult to identify bugs in the absence of a continuous stream of lossless data. The

methodology has been validated by using two different embedded microprocessors as

DUT. The embedded processor is treated as a Blackbox and all the interfaces originating from the processor are monitored continuously hence providing a complete

picture of the embedded processor activities. The details of the two processors are

described below.

3.1.1. The first embedded processor is Xilinx Microblaze [19]. Microblaze is

debugged by connecting its interfaces to the debugging system. AXI interconnects can

also be connected. Microblaze has already been equipped with a special debugging port

(Trace port) which can provide debugging data including the status of the internal

registers.

The proposed debugging system can be connected to the Trace port for a continuous stream of data without any loss. The trace port also provides access to some inner

registers which are not available on other Microblaze interfaces. In order to debug

Microblaze by trace port, a debugging solution was provided by Lauterbach [20] which

required an external hardware needed to be connected to the trace port hence required

extra cost. By utilizing our proposed debugging system, any interfaces (not limited to

trace port) can be debugged without extra overhead cost.

3.1.2. We have chosen an embedded processor based upon RISC-V architecture to

highlight the usability of our proposed debugging system. The microprocessor (ORCA

developed by Vectorblox) [21] is an open source core based upon RV32IM architecture. Software compilation can be carried out through the available RISC-V cross

compiler toolchain. The core was chosen because of its low hardware utilization and

hence is suitable for small FPGAs [22]. However, the core doesn’t have a debugging

solution and hence is hard to debug. The proposed debugging system can be used for

complete debugging of the core.

We used the black box approach for debugging of the microprocessor. Hence all

the exposed interfaces (including AXI interfaces) are connected to the debugging

system for monitoring. The microprocessor fetches the instructions from the memory

which are decoded and then executed. The execution of the instruction can result in

either saving the data to the memory or the data is used for processing the next

instruction. In the first case, once the data is being written to the memory, the data can

be acquired by the debugging system. In the second case, data after processing will be

saved to the memory. Since in our methodology, there is a continuous lossless stream

of data, therefore, monitoring the interfaces results in debugging of the microprocessors. The internal registers can also be debugged by making them visible to the



438



H. H. Khan et al.



debugging system. One important feature of the processor is that it can be stalled by

writing to a Control and Status Register CSR (0x800).

3.2



Clock Management



When the trace buffer is full, the microprocessor needs to be halted so that the data can

be sent to the terminal without data loss. Halting the microprocessor is a necessity for

debugging the microprocessor at runtime because the data communication is not fast

enough to ensure the completeness of data. As already mentioned, the processor can be

stalled by writing to a specific register but we didn’t choose to stall the processor by

writing to the register but by managing the clock. In order to halt the processor, a

custom-made clock manager is developed which can stop the clocking of the embedded

processor once the connected trace buffers are full.

However, another solution is available for Xilinx FPGAs. The power down pin

available at the clocking wizard Xilinx IP [23] can also be used for stopping the clock.

Xilinx provided the power down function for power gating but the same function can

be used for debugging without the need to develop any custom made IP. However, if

the design contains any logic which gets resets upon the absence of clock, that specific

logic need to be removed. Otherwise, it will not be possible to get the continuous

stream of debugging data from the embedded processor (Fig. 2).



Clocking wizard



Device

main clock



Clock

Manager



DSAS

Controlled clock



DUT



Buffer full signal



Fig. 2. Clock management



3.3



Concentration Network



In order to have low resource utilization, our proposed debugging system has been

configured to debug 16 signals simultaneously. However, the embedded designs may

contain large number of nodes need to be debugged. In order to have provision for

connecting a large number of nodes, a concentration network can be used. A concentration network has more number of inputs than outputs. The controlling processor can

select any output set from the input nodes of the concentration network by just

changing the parameter of the concentration network by writing to a selection register

without the need to synthesize the block. A concentration network proposed in [24] can

be used to connect the DUT with the debugging system. The concentration network can

increase the observability of the embedded design at the expense of some logic

resources.



An Intrusive Dynamic Reconfigurable Cycle-Accurate Debugging System



3.4



439



Microprocessor Interfacing



Since the design was verified on the Zedboard, ARM processor has been used in the

design as the main controller for data transfer. However, an embedded processor can also

be used instead of the ARM processor to make the design independent of any specific

processor. Hence, the debugging approach remains valid not only for Xilinx Zynq SoCs

but also for other FPGA families without ARM processor. Furthermore, the data can be

transmitted by either an Ethernet interface or a UART (whichever is available). Data

transmission through Ethernet is faster than UART and hence it is preferred. However,

since the processor is not being clocked in either case, no data is lost.

The data is received in a log file in *.txt format. First, a de-multiplexing operation

has been performed and then the data has been converted to the Value Change Dump

(VCD). Since *.txt format is not directly convertible to VCD format, an application has

been created for this conversion so that the design can be monitored by any open

source HDL simulator like GTKWave.

3.5



Dynamic Partial Reconfiguration



Configuration

Mode 2



DS



RA



Reconfigurable region

RA



Blank



Blank



Configuration

Mode 3



Configuration

Mode 1



Active HW

Components



Dynamic Partial Reconfiguration (DPR) is the ability to reconfigure a portion of the

FPGA at run-time while the rest of FPGA remains active [25]. DPR offers the flexibility to change a part of the system’s hardware components to reconfigure it to another

mode of operation reusing the same hardware resources on the FPGA without halting

the rest of the system. In current research work, DPR is used to load the proposed

Debugging System (DS) to debug the embedded microprocessors at runtime without

the need to repeat the FPGA design flow to add the DS with DUT and re-implement the

whole system again on the FPGA. Furthermore, an added advantage is to reuse the

same hardware resources consumed by the DS for another Reconfigurable Modules

(RM) at runtime after the debugging phase is ended as shown in Fig. 3.



DS

DUT



DUT



Static region



Time



FPGA Floorplan



Fig. 3. Using DPR to load the Debugging System (DS)



Xilinx DPR design flow [25] is used for our proposed debugging system. The DPR

design flow requires the partitioning of the system into a static region and a Reconfigurable Region (RR). In our case, the static region is the DUT that will not change

during the runtime and the RR is allocated for the DS or any RM that will be configured at the same RR after the debugging phase is over. The Hardware Description



440



H. H. Khan et al.



Language (HDL) files of the different constituents of the DUT and debugging system

were used as input for DPR design flow. Floorplanning was carried out to ensure

efficient utilization of the hardware resources. Time of reconfiguration treconf : is the time

consumed to switch to a new operation mode. As treconf : depends on the size of the RR

(Fig. 3), the size of the RR should be optimized to host the largest RM.

In Fig. 3, the proposed reconfigurable system has three RMs (DS, Blank and

Reconfigurable Application (RA) for another application). The RR on the FPGA is

dynamically reconfigured with one of these RMs according to the time slot. A full

configuration mode is the DUT with one of the RMs. The output of the DPR flow is a

set of partial bitstream files for each RM of the system and a set of full bitstream files

for each configuration mode.

In the proposed DPR-based debugging system, it is possible to load other RMs for

another application to reuse the same allocated resources on the reconfigurable region

when the DS is not activated (Fig. 3). Therefore, routing or interconnections between

the DUT on the static region and DS or any other RMs on the RR should be changed

according to the mode of configuration. Hence, to maintain the validation of data flow

between the DUT and the RM, a reconfigurable re-routing technique should be used as

shown in Fig. 4. In a previous work [26], a proposed re-routing technique is presented

to reconfigure the interconnections between the static region and RR for DPR design at

runtime.

Reconfigurable

Region



DUT



RA



DUT



DS



DUT



Blank



Configuration Configuration

mode 3

mode 2



Configuration

mode 1



Static Region



A Re-routing technique between the

Reconfigurable partition and the Static part



Fig. 4. Routing between the static and RR.



4 Results

The proposed methodology has been tested on the Digilent Zedboard, which has an

XC7Z020-484 FPGA We used Xilinx Vivado 2017.1 for the design process carried out

on Intel Core i7-6700 CPU running at 3.4 GHz and having 16 GB of RAM. The time

taken by the design process when the debugging circuitry is synthesized as a reconfigurable module was about 23 min in comparison to the traditional flow without DPR

which took 17 min. It is evident that the difference in synthesis time between the two

methodologies is negligible. The main advantage of the presented methodology is the

capability of dynamic reconfiguration. The DPR-based debugging system provides the

flexibility to load debugging circuitry to the DUT at runtime without the need to repeat



Tài liệu bạn tìm kiếm đã sẵn sàng tải về

6 Performance/Energy Relative to GPU

Tải bản đầy đủ ngay(0 tr)

×