
A probabilistic gene expression barcode for annotation of cell types from single-cell RNA-seq data
Author(s) -
Isabella N. Grabski,
Rafael A. Irizarry
Publication year - 2022
Publication title -
biostatistics
Language(s) - English
Resource type - Journals
SCImago Journal Rank - 3.493
H-Index - 82
eISSN - 1468-4357
pISSN - 1465-4644
DOI - 10.1093/biostatistics/kxac021
Subject(s) - overfitting , annotation , computer science , rna seq , computational biology , barcode , gene , data mining , biology , artificial intelligence , gene expression , genetics , transcriptome , artificial neural network , operating system
Single-cell RNA sequencing (scRNA-seq) quantifies gene expression for individual cells in a sample, which allows distinct cell-type populations to be identified and characterized. An important step in many scRNA-seq analysis pipelines is the annotation of cells into known cell types. While this can be achieved using experimental techniques, such as fluorescence-activated cell sorting, these approaches are impractical for large numbers of cells. This motivates the development of data-driven cell-type annotation methods. We find limitations with current approaches due to the reliance on known marker genes or from overfitting because of systematic differences, or batch effects, between studies. Here, we present a statistical approach that leverages public data sets to combine information across thousands of genes, uses a latent variable model to define cell-type-specific barcodes and account for batch effect variation, and probabilistically annotates cell-type identity from a reference of known cell types. The barcoding approach also provides a new way to discover marker genes. Using a range of data sets, including those generated to represent imperfect real-world reference data, we demonstrate that our approach substantially outperforms current reference-based methods, particularly when predicting across studies.