TY - JOUR
T1 - Scalable de Novo Genome Assembly Using a Pregel-Like Graph-Parallel System
AU - Guo, Guimu
AU - Chen, Hongzhi
AU - Yan, Da
AU - Cheng, James
AU - Chen, Jake Y.
AU - Chong, Zechen
N1 - Funding Information:
The research of Guimu Guo and Da Yan is supported by NSF OAC 1755464 and NSF DGE 1723250. The research of Hongzhi Chen and James Cheng is supported by ITF 6904945 and GRF 14222816. The research of Jake Chen is supported by NIH/NCATS U54TR002731 and NCI/NIH/ DHHS U01CA223976. The research of Zechen Chong is supported by NIMHD U54MD000502, NHGRI 3U01HG007301-06S1, and AHA 17IF33890015.
Publisher Copyright:
© 2004-2012 IEEE.
PY - 2021/3/1
Y1 - 2021/3/1
N2 - De novo genome assembly is the process of stitching short DNA sequences to generate longer DNA sequences, without using any reference sequence for alignment. It enables high-Throughput genome sequencing and thus accelerates the discovery of new genomes. In this paper, we present a toolkit, called PPA-Assembler, for de novo genome assembly in a distributed setting. The operations in our toolkit provide strong performance guarantees, and can be assembled to implement various sequencing strategies. PPA-Assembler adopts the popular de Bruijn graph based approach for sequencing, and each operation is implemented as a program in Google's Pregel framework which can be easily deployed in a generic cluster. Experiments on large real and simulated datasets demonstrate that PPA-Assembler is much more efficient than the state-of-The-Arts while providing comparable sequencing quality. PPA-Assembler has been open-sourced at https://github.com/yaobaiwei/PPA-Assembler.
AB - De novo genome assembly is the process of stitching short DNA sequences to generate longer DNA sequences, without using any reference sequence for alignment. It enables high-Throughput genome sequencing and thus accelerates the discovery of new genomes. In this paper, we present a toolkit, called PPA-Assembler, for de novo genome assembly in a distributed setting. The operations in our toolkit provide strong performance guarantees, and can be assembled to implement various sequencing strategies. PPA-Assembler adopts the popular de Bruijn graph based approach for sequencing, and each operation is implemented as a program in Google's Pregel framework which can be easily deployed in a generic cluster. Experiments on large real and simulated datasets demonstrate that PPA-Assembler is much more efficient than the state-of-The-Arts while providing comparable sequencing quality. PPA-Assembler has been open-sourced at https://github.com/yaobaiwei/PPA-Assembler.
UR - http://www.scopus.com/inward/record.url?scp=85104047012&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85104047012&partnerID=8YFLogxK
U2 - 10.1109/TCBB.2019.2920912
DO - 10.1109/TCBB.2019.2920912
M3 - Article
C2 - 31180898
AN - SCOPUS:85104047012
SN - 1545-5963
VL - 18
SP - 731
EP - 744
JO - IEEE/ACM Transactions on Computational Biology and Bioinformatics
JF - IEEE/ACM Transactions on Computational Biology and Bioinformatics
IS - 2
M1 - 8731736
ER -