Open Access
Efficient population assignment and outlier detection in human populations using biallelic markers chosen by principal component–based rankings
Author(s) -
Ryan L. Raaum,
Alex B. Wang,
Ali AlMeeri,
Connie J. Mulligan
Publication year - 2010
Publication title -
biotechniques/biotechniques
Language(s) - English
Resource type - Journals
SCImago Journal Rank - 0.617
H-Index - 131
eISSN - 1940-9818
pISSN - 0736-6205
DOI - 10.2144/000113426
Subject(s) - outlier , population , principal component analysis , biology , genetics , selection (genetic algorithm) , evolutionary biology , identification (biology) , genetic variation , ancestry informative marker , genome , population genetics , allele frequency , sample (material) , allele , statistics , gene , computer science , demography , artificial intelligence , mathematics , botany , chemistry , chromatography , sociology
Whole-genome studies of genetic variation are now performed routinely and have accelerated the identification of disease-associated allelic variants, positive selection, recombination, and structural variation. However, these studies are sensitive to the presence of outlier data from individuals of different ancestry than the rest of the sample. Currently, the most common method of excluding outlier individuals is to collect a population sample and exclude outliers after genome-wide data have been collected. Here we show that a small collection of 20–27 polymorphic Alu insertions, selected using a principal component–based method with genetic ancestry estimates, may be used to easily assign Africans, East Asians, and Europeans to their population of origin. In addition, we show that samples from a geographically and genetically intermediate population (in our study, samples from India) can be identified within the original sample of Africans, East Asians, and Europeans. Finally, we show that outlier individuals from neighboring geographic regions (in our study, Yemen and sub-Saharan Africa) can be identified. These results will be of value in preselection of samples for more in-depth analysis as well as customized identification of maximally informative polymorphic markers for regional studies.