Premium
Bootstrap Aggregating of Alternating Decision Trees to Detect Sets of SNP s That Associate With Disease
Author(s) -
Guy Richard T.,
Santago Peter,
Langefeld Carl D.
Publication year - 2012
Publication title -
genetic epidemiology
Language(s) - English
Resource type - Journals
SCImago Journal Rank - 1.301
H-Index - 98
eISSN - 1098-2272
pISSN - 0741-0395
DOI - 10.1002/gepi.21608
Subject(s) - snp , decision tree , tree (set theory) , single nucleotide polymorphism , computer science , tag snp , logistic regression , genetic algorithm , machine learning , set (abstract data type) , decision tree learning , artificial intelligence , computational biology , data mining , mathematics , biology , genetics , combinatorics , genotype , gene , programming language
Complex genetic disorders are a result of a combination of genetic and nongenetic factors, all potentially interacting. Machine learning methods hold the potential to identify multilocus and environmental associations thought to drive complex genetic traits. Decision trees, a popular machine learning technique, offer a computationally low complexity algorithm capable of detecting associated sets of single nucleotide polymorphisms ( SNP s) of arbitrary size, including modern genome‐wide SNP scans. However, interpretation of the importance of an individual SNP within these trees can present challenges. We present a new decision tree algorithm denoted as Bagged Alternating Decision Trees (BADTrees) that is based on identifying common structural elements in a bootstrapped set of Alternating Decision Trees (ADTrees). The algorithm is order n k 2 , where n is the number of SNP s considered and k is the number of SNP s in the tree constructed. Our simulation study suggests that BAD Trees have higher power and lower type I error rates than ADT rees alone and comparable power with lower type I error rates compared to logistic regression. We illustrate the application of these data using simulated data as well as from the L upus L arge A ssociation Study 1 (7,822 SNP s in 3,548 individuals). Our results suggest that BADT rees hold promise as a low computational order algorithm for detecting complex combinations of SNP and environmental factors associated with disease.