Scalable de Novo Genome Assembly Using a Pregel-Like Graph-Parallel System

Guimu Guo, Hongzhi Chen, Da Yan, James Cheng, Jake Y. Chen, Zechen Chong

Research output: Contribution to journalArticlepeer-review

2 Scopus citations


De novo genome assembly is the process of stitching short DNA sequences to generate longer DNA sequences, without using any reference sequence for alignment. It enables high-Throughput genome sequencing and thus accelerates the discovery of new genomes. In this paper, we present a toolkit, called PPA-Assembler, for de novo genome assembly in a distributed setting. The operations in our toolkit provide strong performance guarantees, and can be assembled to implement various sequencing strategies. PPA-Assembler adopts the popular de Bruijn graph based approach for sequencing, and each operation is implemented as a program in Google's Pregel framework which can be easily deployed in a generic cluster. Experiments on large real and simulated datasets demonstrate that PPA-Assembler is much more efficient than the state-of-The-Arts while providing comparable sequencing quality. PPA-Assembler has been open-sourced at

Original languageEnglish (US)
Article number8731736
Pages (from-to)731-744
Number of pages14
JournalIEEE/ACM Transactions on Computational Biology and Bioinformatics
Issue number2
StatePublished - Mar 1 2021
Externally publishedYes

All Science Journal Classification (ASJC) codes

  • Biotechnology
  • Genetics
  • Applied Mathematics


Dive into the research topics of 'Scalable de Novo Genome Assembly Using a Pregel-Like Graph-Parallel System'. Together they form a unique fingerprint.

Cite this