Premium
Accurate Annotation of Protein‐coding Small Open Reading Frames in the Human Genome
Author(s) -
Martinez Thomas F.,
Chu Qian,
Donaldson Cynthia,
Tan Dan,
Shokhirev Maxim N.,
Saghatelian Alan
Publication year - 2020
Publication title -
the faseb journal
Language(s) - English
Resource type - Journals
SCImago Journal Rank - 1.709
H-Index - 277
eISSN - 1530-6860
pISSN - 0892-6638
DOI - 10.1096/fasebj.2020.34.s1.03051
Subject(s) - open reading frame , annotation , genome , gene , computational biology , human genome , biology , genome project , transcriptome , workflow , genetics , gene expression , computer science , database , peptide sequence
Functional protein‐coding small open reading frames (smORFs) are emerging as an important class of genes. Several smORF‐encoded microproteins have been characterized and implicated in a variety of critical processes, including regulation of mRNA decay, DNA repair, and muscle formation. Thus, rigorous and comprehensive annotation of protein‐coding smORFs is of fundamental importance. Here, we integrate de novo transcriptome assembly and Ribo‐Seq into an improved workflow that overcomes obstacles with previous methods to more confidently annotate thousands of novel smORFs across multiple human cell lines. Over 1,500 smORFs are found in two or more cell lines, and ~40% lack a canonical AUG start codon. Evolutionary conservation analyses suggest that hundreds of smORF‐encoded microproteins are likely functional. Additionally, monitoring RNA expression and translational efficiency during cellular stress revealed regulated smORFs, demonstrating an approach for identifying biologically relevant smORFs. We also find that smORF‐derived peptides are detectable on human leukocyte antigen complexes, positioning smORFs as a source of novel antigens. The annotation of protein‐coding smORFs radically alters the current view of the human genome’s coding capacity and will provide a rich pool of unexplored, functional human genes. Support or Funding Information This research was supported by NIH/NIGMS (R01 GM102491, A.S.), Leona M. and Harry B. Helmsley Charitable Trust grant (A.S.), Dr. Frederick Paulsen Chair/Ferring Pharmaceuticals (A.S.), NIH/NIGMS postdoctoral fellowship (F32 GM123685, T.F.M.), George E. Hewitt Foundation for medical research (Q.C.), and the Pioneer Fellowship (D.T.). This work was also supported by the Razavi Newman Integrative Genomics and Bioinformatics Core and the Next Generation Sequencing Core Facilities of the Salk Institute with funding from the NIH‐NCICCSG (P30 014195) and the Chapman Foundation.