Tải bản đầy đủ - 0 (trang)
2 (Automated) Parallelization Tools for Embedded SW

2 (Automated) Parallelization Tools for Embedded SW

Tải bản đầy đủ - 0trang

Fast DSE for Automated Parallelization of Embedded Legacy Applications



475



constructing a data pipeline. However, an additional parallelization like a superscalar pipeline is still conceivable. The user has to extend the application with

few pragma annotations (#pragma microstreams task) to indicate different

tasks in the application. Pragmas can be placed before function calls, loops or

ordinary statements/statement blocks. The created task’s code will reach until

the next microstreams task pragma, the end of the scope where the pragma

was placed in or until an explicit task end pragma is reached. However, the

task end pragma is rarely necessary. A processing pipeline is automatically

created from the pragmas if the data dependencies allow it. A communication

infrastructure between tasks is automatically created based on the dependency

analysis, to exchange the non-exclusive variables. Each task will be mapped to

one processing core and a hardware communication infrastructure with simple

FIFO buffers is designed. It was demonstrated that this methodology is applicable despite exchanging big data arrays between the tasks in [6]. In addition

to the automated firmware splitting, the tool also offers evaluation techniques

to help users judge the created design. The tool can create an environment to

measure the time spent for task execution and data transmission to the next

task with cycle counters. With those measurements, the user can determine

if the pipeline stages are balanced and see the communication overhead. The

pipeline’s processing speed is determined by the longest task’s execution time,

including communication time.

As output, µStreams delivers several firmware files and a HW system description. Additionally, we showed that µStreams is able to extract used peripherals/hardware from the firmware [5] and add it in the system description. This

enables building the required hardware completely based on the underlying

firmware with no or little aid from the user.

Enhancements for fast DSE. To sum up, the following steps are automated by µStreams and help the user to do a fast DSE: (1) Split the application at points defined by pragmas and create a processing pipeline. (2) Do a

variable dependency analysis between the split code parts. (3) Resolve variable

dependencies by adding source code to communicate variable states in between

the split parts. (4) Optionally instantiate cycle counters that measure variable

receive/send times and program execution time for easy system evaluation. (5)

Create a hardware design description to import into the system builder.

While the manual source code splitting is relatively easy, manually analyzing

the variable dependency on bigger programs is very tedious and error prone.

Depending on the amount of used variables, this step can easily take hours not

accounting for errors that are highly likely. Thus, using µStreams allows to create

much more design points for evaluation in a shorter amount of time.

3.3



Synthesis Acceleration Tool



Repeatedly synthesizing similar designs performs many tasks over and over.

Doing them just once can save a considerable amount of time. HMFlow [11]



476



K. Heid et al.



synthesizes all modules in the design into hard macros, which are internally

placed and routed. Due to the regularity of FPGAs, they can still be moved,

although the use of less frequent primitives (like DSP blocks or RAM blocks)

may limit the number of possible locations. It is based on RapidSmith [12], a tool

to interface to Xilinx’s ISE. HMFlow’s input is a Xilinx System Generator design,

which is focused on modules as small as individual flip flops or multiplexers and

therefore is not particularly suitable for multi-core SoC designs.

RapidSoC [18] employs a similar approach, but is focused on SoC designs.

Every module (processor core, peripherals) is separately synthesized, placed and

internally routed. All inputs and outputs are connected to FPGA pins. Placement constraints force the modules into a rectangular shape. The resulting design

is then loaded by using RapidSmith as well. The pins used in the separate synthesis are then disconnected to get a reusable macro.

Once the user provides a configuration, the corresponding modules are

inserted into a single design. The modules are then connected as specified by

the system configuration. Afterwards, the placement of modules and the routing

between them needs to be calculated. The placement and routing problems that

need to be solved contain far fewer items than a regular synthesis. This reduction

in problem size allows for a significant speedup. However, the maximum clock

frequency is reduced, because the placement is not as optimal as in a regular

synthesis flow. If the FPGA’s utilization is very high, placement might even fail.

Enhancements for fast DSE. To speed up intermediate synthesis runs, they

are done using RapidSoC. Only the final synthesis is performed with the vendor

tools. RapidSoC achieves a lower target clock frequency than traditional synthesis, but the number of cycles until the result is computed is the same. Routing time is increased by tight timing constraints. Intermediate synthesis runs

are therefore performed with lower operating frequencies than the maximum

possible.

The workflow does not require the user to modify any module definitions.

This is the best possible case for RapidSoC, because all modules can be created

ahead of time.

Using RapidSoC reduces the waiting time to deploy a design point on the

FPGA roughly by a factor of three.

3.4



Evaluation Platform



We use the SpartanMC SoC Kit2 as evaluation platform for convenience, since

it is supported by the synthesis acceleration tool as well as the parallelization

tool and is freely available. The SpartanMC [7] is a soft-core SoC Kit with an

instruction and data width of 18 bit. The 18 bit width makes optimal use of

the structure of modern FPGAs, since internal memory blocks and arithmetic

blocks are 18 bit wide. It comes with a library of hardware building blocks, a

2



http://spartanmc.de/.



Fast DSE for Automated Parallelization of Embedded Legacy Applications



477



system builder software, a software toolchain consisting of GCC, GNU binutils,

GDB and a cycle accurate simulator.

As the processor core and the infrastructure occupy very few resources, it

is natural to put more than one core on an FPGA. For this purpose, a set

of specialized FIFO based communication peripherals (core-connectors) can be

used for efficient message exchange.

Each SoC can be tailored in terms of peripherals and processor features to

perfectly match the nature of the application. Also, multi-core systems can be

employed to improve the response behavior for individual applications. In this

case all critical code blocks are executed on a dedicated core. Through spreading

the application over several cores, the multi-core systems will improve the overall

performance of the system.



4



Methodology for Application Profiling with Fast

Design Space Exploration



The goal of the approach is to increase the processing rate of an application,

by constructing a processing pipeline. We do not expect the maximum clock

frequency to vary much between different pipeline arrangements, because the

critical path is inside the processor core and is not affected by its connectivity.

Therefore, the number of cycles until the result is computed is a good indicator

to compare different systems.

To be able to successfully transform a legacy application to fit new design

requirements, the user will go through different steps, as shown in Fig. 1. The

steps are described in detail in the following sections.

4.1



Initial Application Profiling



To be able to parallelize an existing application, the user first needs to identify

the critical parts of the software. This can be done by: (1) Creating an evaluation

environment for the application with AutoPerf. The tool will instantiate cycle

counters and inject calls to the cycle counter in the firmware. The tool will

measure the cycles to execute each function, loop, and statement block. (2)

Finding the system’s maximum operating frequency. Since Xilinx’s router stops

optimizing the design once the target frequency is met, the design needs to be

synthesized repeatedly with increasing target frequencies. To this end, a script is

provided which uses a binary search to find the absolute maximum frequency. (3)

Reading the application profile provided through AutoPerf and match it against

the design requirements. If the overall execution time of the application matches

the required runtime, the user is done. If the design requirements are not met,

the user starts with the parallelization.

4.2



Application Parallelization



Since the timing of the different steps is known from the initial application

profiling phase, the developer can start parallelizing the application. If the user



K. Heid et al.



(4) Place/Change

pragmas



4.1 Initial application profiling



(1) AutoPerf: inject profiling code



(2) Find maximum operation frequency of the system



no



(5) µStreams: parallelize

with evaluation HW/SW

(6) RapidSoC: synthesize

& run on FPGA



(3) Read application

profile. Design

requirements met?



(7) Read parallelization

report. Design requirements met?



yes

Done



4.2 Application parallelization



478



yes

Synthesize final design

at maximum frequency



µStreams: parallelize

without evaluation HW/SW



4.3 Final design tuning



Fig. 1. Workflow: fast DSE for parallelizing legacy applications (manual steps in

ellipses, automated steps in boxes)



already has an estimate of how many times faster he wants the application to

run, then he already knows into how many tasks the code has to be split at

least. In an ideal case if the transmission overhead of the variables is neglected

and every pipeline stage has the same execution time, the speedup factor is the

number of created tasks. Under normal conditions a splitting into equally long

executing tasks is barely possible. The task’s execution time might additionally

depend on the input data to be processed. In order to meet the timing constraints

of the application, the user might run through several iterations of the following

process: (4) Add a pragma annotation in the code for each task to be created,

while trying to balance task execution times. (5) Parallelize the application with

µStreams with evaluation hardware and execution time measurements in the

firmware. Read the design into the system builder to instantiate the system.

(6) Synthesize the design with RapidSoC and run it on the FPGA. (7) Read

the parallelization report of each task. The runtime of the longest pipeline stage

defines the speed of the whole pipeline. Check if the processing plus the variables

send/receive time meets the timing requirements. If the timing requirement is not

met, the amount of tasks should be increased and another parallelization process

has to be started with more task pragmas, or a better pragma placement. If the

design constraints are met, one can go to the final design tuning.



Fast DSE for Automated Parallelization of Embedded Legacy Applications



4.3



479



Final Design Tuning



After the parallelization report meets the design requirements, one can remove

the profiling HW/SW by simply running µStreams again without evaluation

options. As a final step, the system is synthesized at its maximum frequency.

The parallelized system’s maximum frequency should be equal to the single core’s

frequency.



5

5.1



Evaluation

Use Case



To evaluate our tool and methodology, we use the application of Adaptive Differential Pulse Code Modulation [2] (ADPCM). It is a compression approach used

in many places like ITU audio codec G.726 or for signal compression in wireless

sensing applications. We focus on the encoding procedure, as it was observed to

consume more processing power.

ADPCM is based on differential pulse code modulation, where only the difference between consecutive values is transmitted (together with one initial absolute

value). Due to the continuous nature of most signals, this leads to a reduced variance of the transmitted values and thus to smaller codes (given that differences

are efficiently encoded, e.g. with Huffman encoding). The overall computation

steps of ADPCM can be found in Table 1.

5.2



Evaluation Stages



Profile Source Application. To parallelize the ADPCM application, we first

profiled the application as described in Sect. 4.1 with AutoPerf. The measured

cycles for each processing step are shown in Table 1 column 2.

1st Parallelization Iteration. Computation steps 1–6 take about as long as

steps 7–8. Thus, the first try shall be to put steps 1–6 into one task and 7–8 into

another one. This is done by adding a pragma annotation before step 7. After

running µStreams with the pragma annotated code and synthesizing the system

with RapidSoC, the runtime of the different tasks including the communication

overhead can be measured (see Table 1 column 3). In this case, sending and

receiving variables (316 cycles) is negligible compared to the calculation time.

However, this situation might change if the pragma is set at a different position.

2nd Parallelization Iteration. We assume that the application requirements

are not met and the target is to support an input stream at a higher data rate.

The processing time of the pipeline stages needs to be decreased. Looking at the

application profile in Table 1 column 2, step 7 consumes most processing time,

forming the longest pipeline stage. Thus, this step should be optimized. This can

be done by splitting the compression loop in step 7 into two smaller loops (7.1



480



K. Heid et al.



Table 1. Execution time of the different processing steps and the parallelized variants

1 Core 2 Cores 3 Cores 4 Cores

Processing step

1:

2:

3:

4:

5:

6:

7:



In units of 103 cycles



Read input

729

Auto correlation

7

Extract equation system

1

Solve equation system

151

Backsubstitution

9

Write coefficients

1

Compression loop

1113



8: Write results



899



729

722



729

171



653



552

653



1205



98



and 7.2), calculating the first and the second half of the samples in separate tasks.

This optimization would most certainly reveal the adaptive pass as the longest

pipeline stage. Thus, steps 1–6 are also split into two tasks, as shown in Table 1

column 5. After processing the application with µStreams a second time, building

and synthesizing the system with RapidSoC, another parallelization report can

be read. As shown in Table 1 column 5, it was possible to increase the pipeline

processing speed. Let’s now assume that the design fulfills the requirements.

Refine/Optimize Parallelization Hardware. However, to further optimize

the design it is obvious from Table 1 column 2 that processing step 2–7.1 in total

should not take longer than step 1. So steps 2–7.1 can form one task instead of

two, as shown in Table 1 column 4. This step saves hardware without decreasing

the processing speed of the pipeline.

Further Parallelization. Depending on the application it might get harder

and harder for each additional pipeline stage, to find a solution for an balanced

workload. Possible ways to increase the pipelines processing speed are either to

reformat the application (loop splitting of processing step 1) to have smaller

processing steps or to create a superscalar pipeline.

5.3



Results



Synthesis time, maximum clock frequencies and required pipeline execution

cycles for each design point are shown in Table 2. Typically the user would perform intermediate synthesis runs with RapidSoC at a low frequency and then

final regular synthesis runs to get the maximum frequency. The data generated

by those runs is printed in bold. The remaining data serves to evaluate the

approach.



Fast DSE for Automated Parallelization of Embedded Legacy Applications



481



Table 2. Synthesis time, maximum clock frequency and required cycles for different

pipeline arrangements (Synthesis run on a i7-6700 with 16 GB RAM for XC6SLX45)

Design



Synthesis time (s)

Synthesis Maximum clock

Execution time

@40MHz

speedup

frequency (MHz)

(cycles)

Regular RapidSoC

Regular RapidSoC



1-Core



77



27



2.8



84.8



56



2012609



2-Cores



92



34



2.7



86.4



54



1205689



3-Cores 124



44



2.8



87.7



53



729424



4-Cores 167



53



3.1



87.7



52



729424



As it can be seen, the usefulness of RapidSoC is reflected in the synthesis

time speedup. The maximum achievable frequency is about 30 MHz lower than

the classically synthesized system. This is not dramatic since in a typical DSE

process usually a known working frequency is taken to test the functionality and

the maximum operating frequency will be evaluated after the functionality has

been thoroughly tested. The maximum operating frequency was evaluated with

the tool described in Sect. 4.3. It is the highest result of ten synthesis runs with

different seeds.

Looking at the maximum achievable frequencies in Table 2, there are two

contradicting phenomena: With regular synthesis, the more cores there are on

the FPGA, the higher the maximum achievable frequency gets, because each

core needs less memory. When synthesizing with RapidSoC, another effect dominates: More cores on the FPGA result in lower frequencies. The cost function for

placement only optimizes for individual net lengths and does not weigh combinatoric paths spanning multiple modules higher. The critical path in multi-core

designs spans one more module than in single-core designs, leading to lower clock

frequencies.

For getting a rough idea of how much faster our DSE is, we have measured

the required manual steps of the DSE process with and without the helper tools.

For analyzing the manual source code transformations, we have counted the lines

of code that have to be added/deleted/modified and the amount of necessary

clicks in the system builder. Table 3 shows the manual steps needed for the application profiling of the ADPCM example and one parallelization iteration into

the three-core-design variant. The user only needs four manual steps, instead of

57, to profile the application with the fast design space exploration tools. For

a parallelization iteration in fast DSE, 8 user interactions versus 174 interactions are required. It has to be noted that the parallelization iteration with the

ADPCM example was executed in total three times. The runtime of all tools

except for synthesis and RapidSoC is less than one second and is not noticeably

influenced by the input data size. Another important point is that the higher

the system is parallelized, the harder and more time consuming an iteration in

regular DSE will become, since the hardware design becomes bigger and more

variable dependencies have to be resolved.



482



K. Heid et al.

Table 3. Manual effort comparison DSE vs fast DSE



Step



DSE



Fast DSE



Add cycle counter calls in code



36 code insertions



Tool run: <1 s



Create hardware design



19 clicks in the system builder



Tool run: <1 s



Synthesize & run on FPGA



Tool run: 77 s



Tool run: 27 s



Analyze application profile



Analyze Table 1 column 2



Manual steps for app. profiling



57



Place pragmas/manually split code



2x copy firmware, 4 code-block deletions 3 code insertions



Resolve variable dependencies



31 code insertions



Add cycle counters for evaluation



28 code insertions



Create hardware design



48 clicks in system builder



Tool run: <1 s



Synthesize & run on FPGA



Tool run: 124 s



Tool run: 44 s



Analyze parallelization report



Analyze Table 1 column 4



Remove evaluation code and HW



Delete 59 code lines



Manual steps for parallelization 174



4

Tool run: <1 s



Tool run: <1 s

8



To get an idea of the time saved through fast DSE, we have ourselves tried

to parallelize the three core variant by hand and it took us roughly 45 min even

though exactly knowing what to do. With the usage of the proposed tools we

needed less than 2 min.



6



Conclusion and Outlook



In this contribution we have shown that the design space exploration for multicore embedded systems can be considerably shortened by using tools to firstly

speed up the creation of different design points and secondly speed up the implementation of those design points. Additionally, we show that profiling support in

the target system can be added automatically. As a result, the pure implementation of the different designs becomes ≈3 times faster. The time saved through

the automated parallelization might be much bigger for medium sized designs.

RapidSoC currently only works with Xilinx ISE. We are trying to interface

Vivado in a similar way. Unfortunately, first tries like RapidSmith2 exhibit a

much higher latency for individual manipulations compared to RapidSmith.

Although µStreams already greatly relieves the user, more automation can

be envisioned. E.g. it would be helpful if µStreams automatically suggests split

points depending on the users design requirements. Also, in some cases a replication of critical tasks could be used to reduce the processing time of one stage.

Finally, splitting loops is necessary in some cases to achieve the required processing rate of a application. It would be very supportive if µStreams could perform

this modification on its own.



Fast DSE for Automated Parallelization of Embedded Legacy Applications



483



References

1. Abdallah, A.B.: Multicore Systems On-Chip: Practical Software/Hardware Design.

Atlantis Press, Amsterdam (2013). https://doi.org/10.2991/978-94-91216-92-3

2. Cummiskey, P., Jayant, N., Flanagan, J.: Adaptive quantization in differential

PCM coding of speech. Bell Syst. Techn. J. 52, 1105–1118 (1973)

3. Dave, C., Bae, H., Min, S.J., Lee, S., Eigenmann, R., Midkiff, S.: Cetus: a sourceto-source compiler infrastructure for multicores. Computer 42, 36–42 (2009)

4. Ha, S., Lee, C., Yi, Y., Kwon, S., Joo, Y.P.: Hardware-software codesign of multimedia embedded systems: the peace approach. In: RTCSA (2006)

5. Heid, K., Wirsch, R., Hochberger, C.: Automated inference of SoC configuration through firmware source code analysis. In: FPGAs for Software Programmers

(FSP), pp. 1–9 (2016)

6. Heid, K., Weber, J., Hochberger, C.: µStreams: a tool for automated streaming

pipeline generation on soft-core processors. In: FPGAs for General Purpose Computing (2016)

7. Hempel, G., Hochberger, C.: A resource optimized SoC kit for FPGAs. In: International Conference on Field Programmable Logic and Applications, pp. 761–764

(2007)

8. Kangas, T., Kukkala, P., Orsila, H., Salminen, E., Hă

annikă

ainen, M., Hă

amă

ală

ainen,

T., Riihimă

aki, J., Kuusilinna, K.: UML-based MPSoC design framework. ACM

TECS 5, 281–320 (2006)

9. Keinert, J., Streubhr, M., Schlichter, T., Falk, J., Gladigau, J., Haubelt, C., Teich,

J., Meredith, M.: SystemCoDesigner - an automatic ESL synthesis approach by

design space exploration and behavioral synthesis for streaming applications. ACM

TODAES 14, 1:1–1:23 (2009)

10. Kinsy, M.A., Pellauer, M., Devadas, S.: Heracles: a tool for fast RTL-based design

space exploration of multicore processors. In: ACM/SIGDA FPGA, pp. 125–134

(2013)

11. Lavin, C., Padilla, M., Lamprecht, J., Lundrigan, P., Nelson, B., Hutchings, B.:

HMFlow: accelerating FPGA compilation with hard macros for rapid prototyping. In: Field-Programmable Custom Computing Machines (FCCM), pp. 117–124

(2011)

12. Lavin, C., Padilla, M., Lundrigan, P., Nelson, B., Hutchings, B.: Rapid prototyping

tools for FPGA designs: RapidSmith. In: FPT, pp. 353–356 (2010)

13. Monchiero, M., Canal, R., Gonz´

alez, A.: Design space exploration for multicore

architectures: a power/performance/thermal view. In: ICS, pp. 177–186 (2006)

14. Munk, H., Ayguad´e, E., Bastoul, C., Carpenter, P., Chamski, Z., Cohen, A., Cornero, M., Dumont, P., Duranton, M., Fellahi, M., Ferrer, R., Ladelsky, R., Lindwer,

M., Martorell, X., Miranda, C., Nuzman, D., Ornstein, A., Pop, A., Pop, S.,

Pouchet, L.N., Ram´ırez, A., R´

odenas, D., Rohou, E., Rosen, I., Shvadron, U.,

Trifunovi´c, K., Zaks, A.: ACOTES project: advanced compiler technologies for

embedded streaming. Int. J. Parallel Program. 39, 397–450 (2010)

15. Pop, A., Cohen, A.: OpenStream: expressiveness and data-flow compilation of

OpenMP streaming programs. ACM TACO 9, 53 (2013)

16. Dolbeau, R., Bihan, S., Bodin, F.: HMPP: a hybrid multi-core parallel programming environment. In: Workshop on General Purpose Processing on GPU (2007)



484



K. Heid et al.



17. Thompson, M., Nikolov, H., Stefanov, T., Pimentel, A.D., Erbas, C., Polstra, S.,

Deprettere, E.F.: A framework for rapid system-level exploration, synthesis, and

programming of multimedia MP-SoCs. In: IEEE/ACM/IFIP CODES+ISSS, pp.

9–14 (2007)

18. Wenzel, J., Hochberger, C.: RapidSoC: short turnaround creation of FPGA based

SoCs. In: International Symposium on Rapid System Prototyping, pp. 86–92 (2016)



Control Flow Analysis for Embedded

Multi-core Hybrid Systems

Augusto W. Hoppe1,2(B) , Fernanda Lima Kastensmidt2 , and Jă

urgen Becker1

1



Institute for Information Processing Technologies (ITIV) KIT, Karlsruhe, Germany

{augusto.hoppe,becker}@kit.edu

2

Instituto de Inform´

atica – PGMICRO,

Universidade Federal do Rio Grande do Sul (UFRGS), Porto Alegre, Brazil

fglima@inf.ufrgs.br



Abstract. The use of program tracing subsystems is already ubiquitous during the validation phase of an application’s life-cycle. However,

these functionalities are also extremely useful in the domain of embedded fault tolerance. In this paper we explore the ARM CoreSight Debug

and Trace architecture as a new tool for fault diagnosis and control flow

assurance. The CoreSight is a dedicated ARM architecture that provides

support for Program Flow Tracing without overhead costs for the running application. New FPGA integrated System-on-Chips (SoCs) enable

the implementation of Hardware modules with direct access to system

peripherals, bypassing the use of external control interfaces such as JTAG

or Serial Wire Debug (SWD). We show here a new implementation for

an integrated configurable hardware controller that can collect and send

program trace data for a ARM Cortex-A9 integrated FPGA SoC. We

also propose the use of this interface to measure hang latency, the time

between the occurrence of a fault and failure detection.



Keywords: ARM CoreSight

Soft-error · Control flow



1



· Online trace · Fault injection · FPGA



Introduction



Real-time embedded systems are in the spotlight of current fault-tolerance

research. Safety-critical applications show a clear and still increasing demand for

digital processing power, e.g., for automated driving and interconnected intelligent systems with real-time requirements. The usage of multi-core technologies is

an imperative for embedded systems in the near future. The ARAMIS-II project

[1] aims at the development of processes, tools and platforms for the efficient use

of multi-core architectures in such safety-critical domains. New fault tolerant

and fault safe techniques must be implemented to work with high performance

systems. The project focuses on three major domains related to critical systems,

Automotive, Avionic and Industrial applications. Safety standards related to

such domains recommend extensive branch coverage and conditional execution

c Springer International Publishing AG, part of Springer Nature 2018

N. Voros et al. (Eds.): ARC 2018, LNCS 10824, pp. 485–496, 2018.

https://doi.org/10.1007/978-3-319-78890-6_39



Tài liệu bạn tìm kiếm đã sẵn sàng tải về

2 (Automated) Parallelization Tools for Embedded SW

Tải bản đầy đủ ngay(0 tr)

×