TY - GEN
T1 - Incremental and Semi-Supervised Learning of 16S-rRNA Genes for Taxonomic Classification
AU - Ozdogan, Emrecan
AU - Sabin, Norman C.
AU - Gracie, Thomas
AU - Portley, Steven
AU - Halac, Mali
AU - Coard, Thomas
AU - Trimble, William
AU - Sokhansanj, Bahrad
AU - Rosen, Gail
AU - Polikar, Robi
N1 - Publisher Copyright:
© 2021 IEEE.
PY - 2021
Y1 - 2021
N2 - Genome sequencing generates large volumes of data and hence requires increasingly higher computational resources. The growing data problem is even more acute in metagenomics applications, where data from an environmental sample include many organisms instead of just one for the common single organism sequencing. Traditional taxonomic classification and clustering approaches and platforms - while designed to be computationally efficient - are not capable of incrementally updating a previously trained system when new data arrive, which then requires complete re-training with the augmented (old plus new) data. Such complete retraining is inefficient and leads to poor utilization of computational resources. An ability to update a classification system with only new data offers a much lower run-time as new data are presented, and does not require the approach to be re-trained on the entire previous dataset. In this paper, we propose Incremental VSEARCH (I-VSEARCH) and its semi-supervised version for taxonomic classification, as well as a threshold independent VSEARCH (TI-VSEARCH) as wrappers around VSEARCH, a well-established (unsupervised) clustering algorithm for metagenomics. We show - on a 16S rRNA gene dataset - that I-VSEARCH, running incrementally only on the new batches of data that become available over time, does not lose any accuracy over VSEARCH that runs on the full data, while providing attractive computational benefits.
AB - Genome sequencing generates large volumes of data and hence requires increasingly higher computational resources. The growing data problem is even more acute in metagenomics applications, where data from an environmental sample include many organisms instead of just one for the common single organism sequencing. Traditional taxonomic classification and clustering approaches and platforms - while designed to be computationally efficient - are not capable of incrementally updating a previously trained system when new data arrive, which then requires complete re-training with the augmented (old plus new) data. Such complete retraining is inefficient and leads to poor utilization of computational resources. An ability to update a classification system with only new data offers a much lower run-time as new data are presented, and does not require the approach to be re-trained on the entire previous dataset. In this paper, we propose Incremental VSEARCH (I-VSEARCH) and its semi-supervised version for taxonomic classification, as well as a threshold independent VSEARCH (TI-VSEARCH) as wrappers around VSEARCH, a well-established (unsupervised) clustering algorithm for metagenomics. We show - on a 16S rRNA gene dataset - that I-VSEARCH, running incrementally only on the new batches of data that become available over time, does not lose any accuracy over VSEARCH that runs on the full data, while providing attractive computational benefits.
UR - http://www.scopus.com/inward/record.url?scp=85125795556&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85125795556&partnerID=8YFLogxK
U2 - 10.1109/SSCI50451.2021.9660093
DO - 10.1109/SSCI50451.2021.9660093
M3 - Conference contribution
AN - SCOPUS:85125795556
T3 - 2021 IEEE Symposium Series on Computational Intelligence, SSCI 2021 - Proceedings
BT - 2021 IEEE Symposium Series on Computational Intelligence, SSCI 2021 - Proceedings
PB - Institute of Electrical and Electronics Engineers Inc.
T2 - 2021 IEEE Symposium Series on Computational Intelligence, SSCI 2021
Y2 - 5 December 2021 through 7 December 2021
ER -