
Feature Selection Approach for Solving Imbalanced Data Problem in Single Nucleotide Polymorphism Discovery
Author(s) -
Rossy Nurhasanah,
Lailan Sahrina Hasibuan,
Wisnu Ananta Kusuma
Publication year - 2020
Publication title -
journal of physics. conference series
Language(s) - English
Resource type - Journals
SCImago Journal Rank - 0.21
H-Index - 85
eISSN - 1742-6596
pISSN - 1742-6588
DOI - 10.1088/1742-6596/1566/1/012035
Subject(s) - snp , single nucleotide polymorphism , feature selection , classifier (uml) , selection (genetic algorithm) , artificial intelligence , computer science , feature (linguistics) , computational biology , pattern recognition (psychology) , biology , data mining , machine learning , genetics , genotype , gene , linguistics , philosophy
Single Nucleotide Polymorphism (SNP) is a type of molecular marker which constitutes the phenotypic variations between individuals in certain species. In recent years, the advantages of SNP were widely considered in many fields, for instance in designing precision medicine in humans and assembling superior cultivars in plant breeding. The main challenge in SNP discovery is imbalanced data distribution between classes, where the number of true SNPs in question is much fewer than false SNPs. While the study in observing the benefit of feature selection in classification problem was widely reported, the use of this technique in solving imbalanced class problem still become interesting topic for research. In this study, we selected the features that most contribute in identifying SNP using Feature Assessment by Sliding Thresholds (FAST) method. FAST evaluates the contribution of each feature in identifying SNPs based on the Area under ROC Curve (AUC) value. SNP identification using 4 best features resulted in improved classifier performance in terms of G-Means compared to using 24 features. In addition, using feature selection techniques can reduce computational time and save resource needed.