
EnSVMB: Metagenomics Fragments Classification using Ensemble SVM and BLAST
Author(s) -
Yuan Jiang,
Jun Wang,
Dawen Xia,
Guoxian Yu
Publication year - 2017
Publication title -
scientific reports
Language(s) - English
Resource type - Journals
SCImago Journal Rank - 1.24
H-Index - 213
ISSN - 2045-2322
DOI - 10.1038/s41598-017-09947-y
Subject(s) - metagenomics , support vector machine , false positive paradox , computer science , set (abstract data type) , pattern recognition (psychology) , sensitivity (control systems) , task (project management) , data mining , training set , artificial intelligence , biology , engineering , biochemistry , systems engineering , electronic engineering , gene , programming language
Metagenomics brings in new discoveries and insights into the uncultured microbial world. One fundamental task in metagenomics analysis is to determine the taxonomy of raw sequence fragments. Modern sequencing technologies produce relatively short fragments and greatly increase the number of fragments, and thus make the taxonomic classification considerably more difficult than before. Therefore, fast and accurate techniques are called to classify large-scale fragments. We propose EnSVM ( En semble S upport V ector M achine) and its advanced method called EnSVMB ( EnSVM with B LAST) to accurately classify fragments. EnSVM divides fragments into a large confident (or small diffident) set, based on whether the fragments get consistent (or inconsistent) predictions from linear SVMs trained with different k -mers. Empirical study shows that sensitivity and specificity of EnSVM on confident set are higher than 90% and 97%, but on diffident set are lower than 60% and 75%. To further improve the performance on diffident set, EnSVMB takes advantage of best hits of BLAST to reclassify fragments in that set. Experimental results show EnSVM can efficiently and effectively divide fragments into confident and diffident sets, and EnSVMB achieves higher accuracy, sensitivity and more true positives than related state-of-the-art methods and holds comparable specificity with the best of them.