TY - JOUR
T1 - Information-theoretic approaches to SVM feature selection for metagenome read classification
AU - Garbarine, Elaine
AU - Depasquale, Joseph
AU - Gadia, Vinay
AU - Polikar, Robi
AU - Rosen, Gail
N1 - Funding Information:
This work was supported by the National Science Foundation CAREER award #0845827 and DOE award DE-SC0004335.
PY - 2011/6
Y1 - 2011/6
N2 - Analysis of DNA sequences isolated directly from the environment, known as metagenomics, produces a large quantity of genome fragments that need to be classified into specific taxa. Most composition-based classification methods use all features instead of a subset of features that may maximize classifier accuracy. We show that feature selection methods can boost performance of taxonomic classifiers. This work proposes three different filter-based feature selection methods that stem from information theory: (1) a technique that combines Kullback-Leibler, Mutual Information, and distance information, (2) a text mining technique, TF-IDF, and (3) minimum redundancy-maximum-relevance (mRMR). The feature selection methods are compared by how well they improve support vector machine classification of genomic reads. Overall, the 6mer mRMR method performs well, especially on the phyla-level. If the number of total features is very large, feature selection becomes difficult because a small subset of features that captures a majority of the data variance is less likely to exist. Therefore, we conclude that there is a trade-off between feature set size and feature selection method to optimize classification performance. For larger feature set sizes, TF-IDF works better for finer-resolutions while mRMR performs the best out of any method for N = 6 for all taxonomic levels.
AB - Analysis of DNA sequences isolated directly from the environment, known as metagenomics, produces a large quantity of genome fragments that need to be classified into specific taxa. Most composition-based classification methods use all features instead of a subset of features that may maximize classifier accuracy. We show that feature selection methods can boost performance of taxonomic classifiers. This work proposes three different filter-based feature selection methods that stem from information theory: (1) a technique that combines Kullback-Leibler, Mutual Information, and distance information, (2) a text mining technique, TF-IDF, and (3) minimum redundancy-maximum-relevance (mRMR). The feature selection methods are compared by how well they improve support vector machine classification of genomic reads. Overall, the 6mer mRMR method performs well, especially on the phyla-level. If the number of total features is very large, feature selection becomes difficult because a small subset of features that captures a majority of the data variance is less likely to exist. Therefore, we conclude that there is a trade-off between feature set size and feature selection method to optimize classification performance. For larger feature set sizes, TF-IDF works better for finer-resolutions while mRMR performs the best out of any method for N = 6 for all taxonomic levels.
UR - http://www.scopus.com/inward/record.url?scp=79959749487&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=79959749487&partnerID=8YFLogxK
U2 - 10.1016/j.compbiolchem.2011.04.007
DO - 10.1016/j.compbiolchem.2011.04.007
M3 - Article
C2 - 21704267
AN - SCOPUS:79959749487
SN - 1476-9271
VL - 35
SP - 199
EP - 209
JO - Computational Biology and Chemistry
JF - Computational Biology and Chemistry
IS - 3
ER -