Tải bản đầy đủ - 0 (trang)
3 ReneGENE-Novo: Measure of Accuracy and Performance on Platform P1

3 ReneGENE-Novo: Measure of Accuracy and Performance on Platform P1

Tải bản đầy đủ - 0trang

ReneGENE-Novo



571



Fig. 3. The ReneGENE-Novo hardware



3.4



ReneGENE-Novo Performance Analysis on Platform P2



The ReneGENE-Novo was implemented on a reconfigurable hardware platform,

with multiple FPGAs. The prototype was modelled in Verilog HDL, embedded

within the acceleration framework. The application workflow is shown in Fig. 3.

Coded in a multithreaded fashion, the Novo firmware runs on the host system. The firmware implements the de novo assembly workflow in Fig. 1, with

the support of the accelerator hardware consisting of the Xilinx 6vlx550tff17592 Virtex-6 devices. The firmware is built around custom APIs, supported by

stand-alone libraries for ReneGENE-Novo. These APIs help in setting up and

maintaining a streaming interface with the hardware. The firmware hosts drivers

for the scalable Novo hardware on reconfigurable platform. These drivers help in

initializing and setting up the hardware, with application and algorithm parameters. The Novo hardware abstraction layer interacts with hardware during configuration, and also participates in control and data transfers during assembly

run time.

Table 3. ReneGENE-Novo configuration and occupancy on P2

Feature utilization 8 units

Slice registers



70 units



27756 out of 687360 (4%) 83137 out of 687360 (4%)



128 units

213082 out of 687360 (31%)



Slice LUTs



16247 out of 343680 (4%) 124908 out of 687360 (36%) 292128 out of 687360 (85%)



Bonded IOBs



34 out of 840 (4%)



34 out of 840 (4%)



34 out of 840 (4%)



BRAM/FIFO



63 out of 632 (9%)



63 out of 632 (9%)



63 out of 632 (9%)



As seen in Table 3, three different configurations were realized on the hardware; with 8, 70 and 128 parallel instances of the ReneGENE-Novo module, with



572



S. Natarajan et al.



each instance growing one seed to a readtig. The growth of one seed to a readtig is

totally independent of any other seed or readtig growth. The performance numbers are extracted based on the simulation of the ReneGENE-Novo using Xilinx

proprietary simulation tools. The performance parameters for ReneGENE-Novo

on P2 is given in Table 4.

Table 4. ReneGENE-Novo scalability and performance parameters on P2

Symbols



Description



L



Short read length



N



Total number of reads



N OPread



Total Novo Operations per seed for a single read, N OPread = 4 × L



τclk



Operating clock period of the ReneGENE-Novo module, in seconds



Tconf ig



Time taken to configure the ReneGENE-Novo module on the

hardware, in seconds



TContigN



Time taken in seconds, per seed or novo module, to grow readtigs

from N reads, TContigN = τclk × (3 × N + 1)



TT otal



Time taken in seconds, to grow readtigs from N reads for N modules

over B iterations (n seeds per iteration),

TT otal = B × TContigN + Tconf ig



N OPN reads



Total Novo Operations for N reads, per seed,

N OPN reads = 3 × N × N OPread



n



Number of Novo instances within the FPGA



N OPn



Total Novo operations across n parallel modules within the

hardware, N OPn = n × N OPN reads



B



Number of batches required to cover the reads



N OPnAssembly Total Novo Operations across B Batches,

N OPnAssembly = B × N OPn



The performance of a group of n ReneGENE-novo modules, over a single

batch of n seeds, covering readtig growth across N reads per seed, measured in

Giga Novo Operations Per Second (GNOPS), is given by:

PN ovo =



N OPn

TContigN × 109



(1)



The performance of a group of n ReneGENE-Novo modules (considering the

assembly time only), over a B batches having n seed growth per batch, covering

readtig growth across N reads per seed, measured in GNOPS, is given by:

PN ovoAssembly =



N OPnAssembly

B × TContigN × 109



(2)



The performance of a group of n ReneGENE-Novo modules (considering

assembly time and configuration time), over a single batch of n seeds, covering

readtig growth across N reads, measured in GNOPS, is given by:



ReneGENE-Novo



PN ovoT otal =



B × N OPn

TT otal × 109



573



(3)



Table 5 summarizes the performance comparison of ReneGENE-Novo for all

the three configurations. Here, we can see that as the number of Novo modules

increase, there is a substantial increase in the performance in terms of GNOPS

from the parallel ReneGENE-Novo hardware. This is only limited by the maximum number of novo modules that we can accommodate within a single FPGA.

Beyond the boundaries of the single FPGA, we can scale the hardware across

multiple FPGAs, thereby providing an improved performance.

3.5



ReneGENE-Novo Performance Analysis on Platform P3



ReneGENE-Novo was configured on the Intel Xeon Phi co-processors on the

Cray XC40 platform [11]. Cray XC40 has a total of 48 Intel Xeon Phi coprocessors, each node is a combination of 12 CPU cores and One Xeon-Phi CoProcessor. The host CPU cores are composed of Intel Xeon Ivybridge E5-2695

v2 12-core processors operating at 2.4 GHz. The Intel Xeon-Phi Co-Processor

5120D (Knights Corner) for each node has 60 cores. There is 64 GB of main

memory and 8 GB of device memory available for data handling. The nodes

are interconnected through Proprietary Cray Aries Interconnect with Dragonfly

Topology.

The results of the experiments conducted on 24 Intel Xeon Phi co-processors

on P3 are provided in Table 6. Here, we have considered the short reads from

Table 5. ReneGENE-Novo performance analysis on P2

Feature



8 units 70 units 128 units



Operating frequency F (MHz) 225.202 122.585 120.746

τclk



4.4 ns



L



64



64



64



N



10000



10000



10000

128



8.16 ns



8.4 ns



n



8



70



PN ovo (GNOPS)



461.24



2196.55 3956.4



PN ovoAssembly (GNOPS)



461.24



2196.55 3956.4



PN ovoT otal (GNOPS)



115.23



143.69



149.45



Table 6. ReneGENE-Novo performance analysis on P3

Feature



Helix



P3



Parallelization



64 processes



288 processes



Heuristics



NIL, accurate overlap NIL, accurate overlap



Number of reads 596,100



596,100



Time taken



42 min 45 s



3211 min 13 s



574



S. Natarajan et al.



E.coli organism (SRR1948068 of the NCBI database), consisting of around

600,000 short reads, each read with 251 bases. We have compared the performance with a workstation named Helix with AMD Opteron(TM) Processor 6284

SE having 64 cores.

We can see that P3 reports a better performance as the irregular computing

scales well across the 288 processes in P3 compared to 64 cores in Helix. Though

we could not achieve a linear scaling in performance, the parallelization involved

at both the algorithmic and architectural levels, along with architecture-specific

optimizations, offers a notable improvement in performance.

3.6



ReneGENE-Novo: Effect of Algorithm-Architecture Co-design

on Various Platforms



Here, we have compiled the performance of ReneGENE-Novo across the platforms P1, P2 and P3. The results of algorithm-architecture co-design of

ReneGENE-Novo, where the algorithm and hardware parameters have together

been tailored for individual platforms, have been presented in Table 7. The

GNOPS based performance analysis is done on P1, P2 and P3 while varying the

number of processes/units for a total number of 76800000000 novo operations

to extend all seeds against all reads across all processes/units. The results suggest that the hardware based implementation of ReneGENE-Novo outperforms

the single and eight-process variations of P1. The presence of 288 processes in

Table 7. ReneGENE-Novo: analysis of effects of algorithm-architecture co-design on

various platforms

Feature



P1: 1

process



P1: 8

processes



P2: 8

units



P2: 70

units



P2: 128

units



P3: 288

processes



1. Number of reads



10000



10000



10000



10000



10000



10000



2. Number of parallel

units/processes



1



8



8



70



128



288



3. Number of seeds per

process/unit



10000



1250



8



70



128



35



4. Total number of Novo

operations to extend single

seed against all reads in a

single process/unit



7680000



7680000



7680000



7680000



7680000



7680000



5. Time taken (seconds) to

extend single seed against

all reads in a single

process/unit



0.17



0.02



0.000133



0.000244



0.000248



0.000590



6. Time taken (in seconds)

to extend all seeds against

all reads across all

processes/units



1700



25



0.667



0.535



0.520



0.0305



7. ReneGENE-Novo

performance for multiple

processes/units in GNOPS



0.0452



3.07



115.23



143.69



149.45



2518.38



ReneGENE-Novo



575



P3 provides superior performance compared to P2, due to a more optimal and

parallel distribution of data, control and synchronization.

There is more scope of parallelism on P2, as we are currently using only a

single FPGA. The ReneGENE-Novo is further scalable across multiple FPGAs

which shall result in an improved performance. The reconfigurable hardware

platforms serve to be an optimal choice for better parallelism and scalability for

ReneGENE-Novo deployment. With many reconfigurable HPC platforms available with high-end FPGAs of larger logic capacity from vendors like Xilinx at

affordable costs, we can deploy more parallel units of our model with optimal

timing and power ratings. Platform P2 offers a more cost effective model when

compared with a supercomputing platform like P3, as the power consumption

of such deployments are much lesser, along with lesser maintenance and NonRecurring Engineering (NRE) costs.



4



ReneGENE-Novo Use Case: Deployment for Genome

Informatics



The ReneGENE-Novo is currently being used as part of a larger Genome Informatics (GI) pipeline, namely ReneGENE-GI. Fig. 4 illustrates the ReneGENEGI pipeline, which performs Short Read Mapping (SRM) [12,13]. The short

reads are mapped or aligned against a reference genome string through SRM.

The novelty of the ReneGENE-GI pipeline lies in the fact that it offers a unique

blend of comparative genomics and de novo sequence assembly, offering the most

precise SRM. The comparative genomics module exploits the parallel Dynamic

Programming [14] methodology to accurately map the short reads against the

reference genome. The alignment is backed by an accurate indexing and lookup of

reads against the reference using the parallel implementation of dynamic Monotonic Minimal Perfect Hashing (MMPH) method [15–17]. The ReneGENE-Novo

module generates readtigs as explained in the previous sections. These readtigs



Fig. 4. The ReneGENE-GI pipeline



Tài liệu bạn tìm kiếm đã sẵn sàng tải về

3 ReneGENE-Novo: Measure of Accuracy and Performance on Platform P1

Tải bản đầy đủ ngay(0 tr)

×