z-logo
Premium
A machine learning method for selection of genetic variants to increase prediction accuracy of type 2 diabetes mellitus using sequencing data
Author(s) -
Jung Luann C.,
Wang Haiyan,
Li Xukun,
Wu Cen
Publication year - 2020
Publication title -
statistical analysis and data mining: the asa data science journal
Language(s) - English
Resource type - Journals
SCImago Journal Rank - 0.381
H-Index - 33
eISSN - 1932-1872
pISSN - 1932-1864
DOI - 10.1002/sam.11456
Subject(s) - feature selection , classifier (uml) , computer science , single nucleotide polymorphism , artificial intelligence , logistic regression , population , random forest , machine learning , computational biology , genetics , biology , genotype , medicine , gene , environmental health
Type 2 diabetes mellitus (T2DM) affects millions of people through its life‐altering complications. Worldwide, 3.4 million people die of diabetes annually. Studying the effect of genetic polymorphism on T2DM has been plagued by the available sample size. A 2016 Nature Reviews article summarized that the accuracy of predicting future type 2 diabetes from genetic polymorphism is very low at the population level. Innumerable associations between genes, environmental factors, and type 2 diabetes remain to be discovered. This research presents a method to identify subtle effects of genetic variants using whole genome sequencing data and improve prediction accuracy of T2DM at the population level. To achieve this, a new feature selection procedure and a classifier are proposed. The method involves (a) first applying sparse principal component analysis to genotype data to obtain orthogonal features; (b) building a new classifier using single nucleotide polymorphism (SNP)‐specific regularization parameters to reduce the false positive rate of feature selection; (c) verifying feature relevance through penalized logistic regression. After application to a dataset containing 625 597 SNPs and 23 environmental variables from each of 3326 humans, the method identified 271 genetic variants with subtle effects on T2DM prediction. These variants led to greatly improved prediction accuracy for new patients at the population level. The proposed method also has the advantage of computational efficiency, over 15 times faster than random forest and extreme gradient boosting (XGBoost) classifiers, and thus provides a promising tool for large‐scale genome‐wide association studies.

This content is not available in your region!

Continue researching here.

Having issues? You can contact us here