z-logo
open-access-imgOpen Access
How to Train Your Genome
Author(s) -
Caryn Navarro
Publication year - 2019
Publication title -
cell
Language(s) - English
Resource type - Journals
SCImago Journal Rank - 26.304
H-Index - 776
eISSN - 1097-4172
pISSN - 0092-8674
DOI - 10.1016/j.cell.2019.03.003
Subject(s) - biology , genome , computational biology , evolutionary biology , genetics , gene
Artificial intelligence, machine learning, deep learning: Do these terms sound like plots from a science fiction novel? Perhaps, but they are also powerful tools currently being used in genomics to understand sequence variation and how such variation leads to disease, development, and evolution in human populations. High-throughput sequencing (HTS), omics technology, and genome-wide association studies (GWAS) have led to a massive increase in the amount of data currently available to decipher the human genome. These data are often complex, and the significance of single-nucleotide changes can be hard to predict from sequence alone. So, what is the best and most efficient way to analyze all the data to find meaning in the information? Not only does deep learning provide a powerful method to analyze data from one specific type of analysis, but it also has the capability to combine data from several complimentary approaches to identify genes and pathways that could be important in understanding the development of disease. What is deep learning? Deep learning is a subcategory of machine learning, a programming method that allows computers to learn solutions to a particular task (Zou et al., 2019; Ching et al., 2018). Deep learning utilizes layers that contain multiple sources of collaborative data integrated with each other (information from one layer is passed on to the next), allowing more refined patterns to arise from each layer, to ultimately reach the output or prediction. Through training, machines can learn highly complex functions. In deep learning, all data integration is done by the machine with no preconceived model, allowing the machine to make predictions without strong assumptions about biological mechanism. Deep learning methods have been successfully applied across disciplines to understand and manipulate data from many sources including images and speech and language (Zou et al., 2019; Ching et al., 2018). With the rise in HTS and omics technology, along with the plethora of GWAS data available, deep learning presents a great opportunity to harness the information contained in data in a manner that humans alone do not have the time or resources to achieve. The resulting new approaches for modeling biological processes and integrating multiple types of omics data could eventually help predict how these processes are disrupted in disease. Recently, deep learning has been used to find patterns in genomic data in order to identify transcription factor binding sites, splice sites, and enhancer and promoter sequences and has also been used in population biology applications (Schrider and Kern, 2018; Ching et al., 2018). These large omics datasets contain a myriad of information, and it is practically impossible for humans alone to extract all the useful data. Deep learning methods can capitalize on this information, especially when complimentary datasets (i.e., methylation, transcription factor binding, chromatin accessibility) are used in the samemethod, as themachines have the capability of finding unpredictable, often nonlinear, interactions among the data. Although deep learning methods have been broadly applied across human genomics and healthcare, a few key recent examples—where deep learning has been implemented to analyze genome sequence and function—illustrate the importance and potential of this method. Accurate variant calling from HTS data has been historically challenging, as many sources of error contribute to the sequence that is ultimately compared to the reference genome. These can stem from the sequencing read itself, the properties of the instrument, data processing tools, and/or the reference genome (Telenti et al., 2018). Additionally, different sequencing technologies and library preparation protocols can affect the signals obtained. Often, a variety of statistical techniques are combined to assess these sources of error in order to identify real variants. To address this problem, a technology called DeepVariant was recently reported by Poplin et al. (2018). DeepVariant is a deep learning method designed to replace the statistical modeling components, allowing the machine to learn the rules of predicting error directly from the data. DeepVariant has been shown to more accurately identify true variants than statistical modeling, reducing the numbers of both false positives and false negatives. Additionally, this method has the advantage of being adaptable to multiple sequencing platforms, which previous statistical packages were not. Even with clean variant calls, many variants are found in the non-coding region of the genome, making it hard to predict their functional significance. Additionally, predicting a variant as functional due to a change in sequence alone is often not enough, as functionality can be influenced by surrounding genomic context. Combinatorial approaches taking advantage of large datasets, including transcription factor binding sites, histone marks, DNA accessibility, and 3D genome organization, have the power to harness complimentary information to provide more accurate predictions of how

The content you want is available to Zendy users.

Already have an account? Click here to sign in.
Having issues? You can contact us here
Accelerating Research

Address

John Eccles House
Robert Robinson Avenue,
Oxford Science Park, Oxford
OX4 4GP, United Kingdom