Tải bản đầy đủ - 0 (trang)
1 de Novo Assembly: A Complex Big-Data Engineering Problem

1 de Novo Assembly: A Complex Big-Data Engineering Problem

Tải bản đầy đủ - 0trang


S. Natarajan et al.

The assembly is guided by the amount of similarity among the short read

substrings, which helps in grouping reads with sufficient neighbourhood and

originating from nearby locations within the target genome. However, a repeat

region sequencing produces short reads which are very similar, but originate

from highly separated repeat locations in the genome. The presence of short

reads from genomic repeats provides an exponential increase in the number of

possible overlaps of short reads to form contigs, compounded by an exponential increase in the number of contig re-arrangements involving similar contigs,

at each location of the target repeat region. In addition, an assembler cannot

accurately generate contigs from short reads that are from NGS platforms with

sufficiently large sequencing error rates. As a result, the resultant contigs diverge

more from the actual target genome segments than what the assembler’s heuristics alone would have offered. In presence of such ambiguous inputs, a reliable

and efficient reconstruction of genome fragments is not possible without involving read-error corrections prior to assembly and computational heuristics during

assembly. The de novo assembly works on random input data, irregular memory

access patterns and spontaneous growth of contigs, thus presenting an irregular

computing pattern.

The choice of an assembler [3], defines the throughput and performance of

the de novo assembly pipeline. Sequence assemblers are chosen predominantly

based on the short read characteristics. Highly accurate short reads of smaller

length (<200 bases) are typically assembled using the de Bruijn graph based

approaches [4,5]. As the short reads get longer and data is loaded with sequencing

errors, overlap-based approaches like Overlap Layout Consensus (OLC) [6] and

string-graphs [7,8] are preferred. To apply de Bruijn graphs for longer reads,

a pre-processing stage is used in the assembly pipeline to correct the errors in


The de Bruijn graph based methods avoid the computational complexity of

extracting overlaps among billions of short reads and accelerate the assembly

process. The OLC and string-graph based methods use highly memory-efficient

data structures and compressed storage formats which allow for a highly scalable

assembler. Attempts have been made to accelerate various stages of the assembly

pipeline on reconfigurable and heterogeneous platforms, with profiling studies

and extensive design space exploration of the algorithmic implementations [9,10].

Nevertheless, all these approaches suffer from the need for heuristics-based and

ad hoc preprocessing involved in handling longer reads, error correction, memory

management and identifying contigs in repeat regions of genome. The large data

redundancy observed in billions of raw short read substrings, compounded by the

target genome length of billions of bases, requires a complex big data engineering


In this context, we present ReneGENE-Novo, a co-designed algorithmarchitecture model for running de novo assembly of short reads. Our solution

computationally extends the small short reads to more accurate “readtigs” prior

to the assembly, as against conventional assemblers that use heuristic and ad

hoc corrections for length and errors. This helps in harvesting the full potential



Fig. 2. ReneGENE-Novo workflow

of short reads from NGS and de novo assembly. ReneGENE-Novo takes randomly presented short reads from NGS platforms and extends them iteratively

and accurately to an appropriate length by identifying overlaps among them,

aiding high-coverage assembly with minimal error rates. This task is parallelized

across multiple processes, to allow parallel read assembly with performance scalability. Supported by parallel algorithms, multi-dimensional data structures and

fine-grain synchronization, the module realises irregular computing for de novo

assembly. The extended readtigs can then be subjected to normal assembly using

conventional assemblers. ReneGENE-Novo also serves to increase the coverage

in the comparative genomics pipeline by extending the unaligned reads which

were left aside by the short read mapping algorithms while aligning them against

a reference genome. The unaligned reads are extended to readtigs alongside the

aligned reads, which allows to cover the large gaps for an otherwise poor alignment.



The ReneGENE-Novo module, shown in Fig. 2, works on the principle of identifying overlap among the incoming reads presented at random, by a novel and

parallel scheme, thereby generating readtigs or extended reads in a deterministic number of cycles and operations, as captured in Algorithm 1. An overlap

is said to occur, when a seed/partially grown readtig has substrings which are

computationally similar to the substrings of the incoming read, over regions spanning the string boundaries. Once an overlap is identified, the non-overlapping


S. Natarajan et al.

























// Purpose: de novo assembly function to generate readtigs

// Input: Input short reads R in fastq format

// Output: Assembled Readtigs C in fastq format

-------------------------------------------Partition the short reads R into Sn read sets

for each read set Si in Sn do

Load each Mi in Mn readtig maps with reads from Si to form readtig seeds

end for

for each readtig map Mi in Mn parallel maps do

for each read r in R in the forward direction do

if r overlaps with a readtig seed from Mi then

Assemble r with the seed

end if

end for

for each read r in R in reverse direction do

if r overlaps with a readtig seed from Mi then

Assemble r with the seed

end if

end for

end for

for each readtig map Mi in Mn do

Merge contents to form readtig set C

end for

regions and the overlapped segments are concatenated to enable readtig growth.

For example, for the seed AAAATGCA (length = 8) and the incoming read

string TGCAGGGG (length = 8), the readtig can be constructed as AAAATGCAGGGG (length = 12). The overlap criteria (extent of overlap and mismatches within overlapped string region) across iterations of readtig growth,

determines the length of the readtig.

Here, a single read can share a similar overlap relationship with several of

its sequence neighbours, resulting in a single seed growing into many tangible

readtigs. This is again decided at run time and hence the computations are

clearly irregular due to the irregularity in the relationships among the input data

sets. To accommodate the readtigs that grow on the fly, the de novo assembly

module implements dynamically growing multi-dimensional data structures cast

in the map-reduce framework, hence allowing a parallel deployment.

The readtigs, coming from the ReneGENE-Novo module, are now accurately

extended forms of the short reads, combined with the base quality information of

the reads. The accuracy of readtig growth stems from the fact that the two reads

are expected to have absolutely matching substrings over the overlap distance.

The overlap distance decides the extent and rate of growth during the initial

rounds of readtig growth. Once a reasonable number of reads have participated

in the initial rounds, an iterative growth among the already formed readtigs and

the residual reads would result in a set of final readtigs sufficiently long enough



to advance to contig growths. The overlap distance is a function of the read

length. This value is provided as a user input, where the user can choose to start

from a lower overlap distance and close with a larger one and vice versa.

The final set of readtigs can now be assembled using any conventional read

assemblers like Velvet, leading to more accurate contigs and scaffolds. Since the

growth of readtigs are independent of each other, this provides a good coverage

of the short reads from the repeat region of the genome, thus preventing purging

of valid reads that can give rise to contigs for target repeats. While conventional

assemblers tend to work only with unique seeds and unique contigs from seeds,

ReneGENE-Novo allows every read to hold the status of the seed during the

readtig growth stage. This allows all reads to grow to readtigs and hence to

contigs that encompass the repeat regions.



Prototypes and Results

Experimental Setup

The prototype for the ReneGENE-Novo was deployed on three different platforms, as detailed in Table 1. The first platform is a workstation, with an Intel

Core i7-4770 based 8-core processor and 32 GB of system memory. The second

platform is an accelerator platform based on the Intel Xeon E5620 host processor, supported by multiple Xilinx Virtex-6 6vlx550tff1759-2 FPGAs, each capable of hosting a maximum of 256 instances of ReneGENE-Novo hardware models, developed and implemented in Verilog HDL. The inherent parallel nature

of the reconfigurable hardware provides additional room for parallelizing the

already parallel Novo algorithm. The third platform is Cray XC40, a hybrid

Table 1. ReneGENE-Novo platform details


P1: Workstation

P2: Reconfigurable

Hardware Accelerators


P3: Cray XC40




C++/Verilog HDL



GNU 4.8.2

Intel 4.8.1



MPICH 3.1.4

MPICH 3.1.4

Cray MPICH 7.4.3


Intel Core i7-4770 Intel Xeon E5620

Intel Xeon-Phi Coprocessor

5120D (KNC) and host CPU

of Intel Xeon Ivybridge

E5-2695 v2

Number of nodes 1




32 GB

48 GB

64 GB host memory, 8 GB

co-processor memory




Proprietary Cray Aries

Interconnect with Dragonfly



S. Natarajan et al.

High Performance Computing (HPC/supercomputing) system, configured with

Intel’s Xeon-Phi 5120D (KNC) based cards. Each platform comes with its own

set of compiler-specific and architecture specifications, which have their effects

on the related performance numbers.


ReneGENE-Novo Test Data

To verify the correctness and analyse the performance scaling of ReneGENENovo, we have run experiments on a non-random short read set synthetically

derived from the reference genome of the organism E.coli. The reads are derived

contiguously at various read lengths, ranging through 32, 64, 128, 256, 512,

1024 and 2048 bases. If the reads are extended accurately, the extended readtigs

are expected to grow to the full length of the reference from which the reads

are derived, provided an overlap exists for all reads. This check ensures the

functional correctness of the algorithm. The choice of the read length is expected

to influence the performance of the prototypes. The performance of the module

is compared for the various implementations as shown in Table 1.


ReneGENE-Novo: Measure of Accuracy and Performance

on Platform P1

The performance comparison of the prototype for ReneGENE-Novo on platform P1, for various read length options, is compiled in Table 2. As mentioned

in Sect. 2, ReneGENE-Novo validates the accuracy of the readtigs by allowing all the reads to become seeds and then grow independently to readtigs.

The performance numbers in this paper account for the computations done by

ReneGENE-Novo to not only grow the seeds to form readtigs, but to also validate their accuracy. All the seeds in ReneGENE-Novo grow to the same readtig,

as the seeds are extracted from contiguous locations of the same genome. This

essentially means that we have successfully achieved accurate readtig growth.

ReneGENE-Novo generates readtigs of same length and coverage.

Table 2. ReneGENE-Novo performance on P1

Read length ReneGENE-Novo

(single process): time

in seconds



ReneGENE-Novo (8

processes): time in



number of readtigs





























Fig. 3. The ReneGENE-Novo hardware


ReneGENE-Novo Performance Analysis on Platform P2

The ReneGENE-Novo was implemented on a reconfigurable hardware platform,

with multiple FPGAs. The prototype was modelled in Verilog HDL, embedded

within the acceleration framework. The application workflow is shown in Fig. 3.

Coded in a multithreaded fashion, the Novo firmware runs on the host system. The firmware implements the de novo assembly workflow in Fig. 1, with

the support of the accelerator hardware consisting of the Xilinx 6vlx550tff17592 Virtex-6 devices. The firmware is built around custom APIs, supported by

stand-alone libraries for ReneGENE-Novo. These APIs help in setting up and

maintaining a streaming interface with the hardware. The firmware hosts drivers

for the scalable Novo hardware on reconfigurable platform. These drivers help in

initializing and setting up the hardware, with application and algorithm parameters. The Novo hardware abstraction layer interacts with hardware during configuration, and also participates in control and data transfers during assembly

run time.

Table 3. ReneGENE-Novo configuration and occupancy on P2

Feature utilization 8 units

Slice registers

70 units

27756 out of 687360 (4%) 83137 out of 687360 (4%)

128 units

213082 out of 687360 (31%)

Slice LUTs

16247 out of 343680 (4%) 124908 out of 687360 (36%) 292128 out of 687360 (85%)

Bonded IOBs

34 out of 840 (4%)

34 out of 840 (4%)

34 out of 840 (4%)


63 out of 632 (9%)

63 out of 632 (9%)

63 out of 632 (9%)

As seen in Table 3, three different configurations were realized on the hardware; with 8, 70 and 128 parallel instances of the ReneGENE-Novo module, with

Tài liệu bạn tìm kiếm đã sẵn sàng tải về

1 de Novo Assembly: A Complex Big-Data Engineering Problem

Tải bản đầy đủ ngay(0 tr)