TY - GEN
T1 - Semi-supervised and Incremental VSEARCH for Metagenomic Classification
AU - Ozdogan, Emrecan
AU - Fasino, Adriana
AU - Nguyen, Rachel
AU - Sokhansanj, Bahrad
AU - Rosen, Gail
AU - Polikar, Robi
N1 - Publisher Copyright:
© 2022 IEEE.
PY - 2022
Y1 - 2022
N2 - DNA Sequencing of microbial communities from en-vironmental samples generates large volumes of data, which can be analyzed using various bioinformatics pipelines. Unsupervised clustering algorithms are usually an early and critical step in an analysis pipeline, since much of such data are unlabeled, unstructured, or novel. However, curated reference databases that provide taxonomic label information are also increasing and growing, which can help in the classification of sequences, and not just clustering. In this contribution, we report on our progress in developing a semi-supervised approach for genomic clustering algorithms, such as U/VSEARCH. The primary contribution of this approach is the ability to recognize previously seen or unseen novel sequences using an incremental approach: for sequences whose examples were previously seen by the algorithm, the algorithm can predict a correct label. For previously unseen novel sequences, the algorithm assigns a temporary label and then updates that label with a permanent one if/when such a label is established in a future reference database. The incremental learning aspect of the proposed approach provides the additional benefit and capability to process the data continuously as new datasets become available. This functionality is notable as most sequence data processing platforms are static in nature, designed to run on a single batch of data, whose only other remedy to process additional data is to combine the new and old data and rerun the entire analysis. We report our promising preliminary results on an extended 16S rRNA database.
AB - DNA Sequencing of microbial communities from en-vironmental samples generates large volumes of data, which can be analyzed using various bioinformatics pipelines. Unsupervised clustering algorithms are usually an early and critical step in an analysis pipeline, since much of such data are unlabeled, unstructured, or novel. However, curated reference databases that provide taxonomic label information are also increasing and growing, which can help in the classification of sequences, and not just clustering. In this contribution, we report on our progress in developing a semi-supervised approach for genomic clustering algorithms, such as U/VSEARCH. The primary contribution of this approach is the ability to recognize previously seen or unseen novel sequences using an incremental approach: for sequences whose examples were previously seen by the algorithm, the algorithm can predict a correct label. For previously unseen novel sequences, the algorithm assigns a temporary label and then updates that label with a permanent one if/when such a label is established in a future reference database. The incremental learning aspect of the proposed approach provides the additional benefit and capability to process the data continuously as new datasets become available. This functionality is notable as most sequence data processing platforms are static in nature, designed to run on a single batch of data, whose only other remedy to process additional data is to combine the new and old data and rerun the entire analysis. We report our promising preliminary results on an extended 16S rRNA database.
UR - http://www.scopus.com/inward/record.url?scp=85147796781&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85147796781&partnerID=8YFLogxK
U2 - 10.1109/SSCI51031.2022.10022184
DO - 10.1109/SSCI51031.2022.10022184
M3 - Conference contribution
AN - SCOPUS:85147796781
T3 - Proceedings of the 2022 IEEE Symposium Series on Computational Intelligence, SSCI 2022
SP - 1119
EP - 1126
BT - Proceedings of the 2022 IEEE Symposium Series on Computational Intelligence, SSCI 2022
A2 - Ishibuchi, Hisao
A2 - Kwoh, Chee-Keong
A2 - Tan, Ah-Hwee
A2 - Srinivasan, Dipti
A2 - Miao, Chunyan
A2 - Trivedi, Anupam
A2 - Crockett, Keeley
PB - Institute of Electrical and Electronics Engineers Inc.
T2 - 2022 IEEE Symposium Series on Computational Intelligence, SSCI 2022
Y2 - 4 December 2022 through 7 December 2022
ER -