TY - JOUR
T1 - Discovering the unknown
T2 - Improving detection of novel species and genera from short reads
AU - Rosen, Gail L.
AU - Polikar, Robi
AU - Caseiro, Diamantino A.
AU - Essinger, Steven D.
AU - Sokhansanj, Bahrad A.
PY - 2011
Y1 - 2011
N2 - High-throughput sequencing technologies enable metagenome profiling, simultaneous sequencing of multiple microbial species present within an environmental sample. Since metagenomic data includes sequence fragments (reads) from organisms that are absent from any database, new algorithms must be developed for the identification and annotation of novel sequence fragments. Homology-based techniques have been modified to detect novel species and genera, but, composition-based methods, have not been adapted. We develop a detection technique that can discriminate between known and unknown taxa, which can be used with composition-based methods, as well as a hybrid method. Unlike previous studies, we rigorously evaluate all algorithms for their ability to detect novel taxa. First, we show that the integration of a detector with a composition-based method performs significantly better than homology-based methods for the detection of novel species and genera, with best performance at finer taxonomic resolutions. Most importantly, we evaluate all the algorithms by introducing an unknown class and show that the modified version of PhymmBL has similar or better overall classification performance than the other modified algorithms, especially for the species-level and ultrashort reads. Finally, we evaluate theperformance of several algorithms on a real acid mine drainage dataset.
AB - High-throughput sequencing technologies enable metagenome profiling, simultaneous sequencing of multiple microbial species present within an environmental sample. Since metagenomic data includes sequence fragments (reads) from organisms that are absent from any database, new algorithms must be developed for the identification and annotation of novel sequence fragments. Homology-based techniques have been modified to detect novel species and genera, but, composition-based methods, have not been adapted. We develop a detection technique that can discriminate between known and unknown taxa, which can be used with composition-based methods, as well as a hybrid method. Unlike previous studies, we rigorously evaluate all algorithms for their ability to detect novel taxa. First, we show that the integration of a detector with a composition-based method performs significantly better than homology-based methods for the detection of novel species and genera, with best performance at finer taxonomic resolutions. Most importantly, we evaluate all the algorithms by introducing an unknown class and show that the modified version of PhymmBL has similar or better overall classification performance than the other modified algorithms, especially for the species-level and ultrashort reads. Finally, we evaluate theperformance of several algorithms on a real acid mine drainage dataset.
UR - http://www.scopus.com/inward/record.url?scp=79959326732&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=79959326732&partnerID=8YFLogxK
U2 - 10.1155/2011/495849
DO - 10.1155/2011/495849
M3 - Article
C2 - 21541181
AN - SCOPUS:79959326732
SN - 2314-6133
VL - 2011
JO - BioMed Research International
JF - BioMed Research International
M1 - 495849
ER -