Premium
Predicting gene expression in massively parallel reporter assays: A comparative study
Author(s) -
Kreimer Anat,
Zeng Haoyang,
Edwards Matthew D.,
Guo Yuchun,
Tian Kevin,
Shin Sunyoung,
Welch Rene,
Wainberg Michael,
Mohan Rahul,
SinnottArmstrong Nicholas A.,
Li Yue,
Eraslan Gökcen,
AMIN Talal Bin,
Tewhey Ryan,
Sabeti Pardis C.,
Goke Jonathan,
Mueller Nikola S.,
Kellis Manolis,
Kundaje Anshul,
Beer Michael A,
Keles Sunduz,
Gifford David K.,
Yosef Nir
Publication year - 2017
Publication title -
human mutation
Language(s) - English
Resource type - Journals
SCImago Journal Rank - 1.981
H-Index - 162
eISSN - 1098-1004
pISSN - 1059-7794
DOI - 10.1002/humu.23197
Subject(s) - biology , expression quantitative trait loci , computational biology , genetics , chromatin , transcription factor , gene , regulation of gene expression , locus (genetics) , allele , regulatory sequence , single nucleotide polymorphism , genotype
In many human diseases, associated genetic changes tend to occur within noncoding regions, whose effect might be related to transcriptional control. A central goal in human genetics is to understand the function of such noncoding regions: given a region that is statistically associated with changes in gene expression (expression quantitative trait locus [eQTL]), does it in fact play a regulatory role? And if so, how is this role “coded” in its sequence? These questions were the subject of the Critical Assessment of Genome Interpretation eQTL challenge. Participants were given a set of sequences that flank eQTLs in humans and were asked to predict whether these are capable of regulating transcription (as evaluated by massively parallel reporter assays), and whether this capability changes between alternative alleles. Here, we report lessons learned from this community effort. By inspecting predictive properties in isolation, and conducting meta‐analysis over the competing methods, we find that using chromatin accessibility and transcription factor binding as features in an ensemble of classifiers or regression models leads to the most accurate results. We then characterize the loci that are harder to predict, putting the spotlight on areas of weakness, which we expect to be the subject of future studies.