How does the structure of data impact cell–cell similarity? Evaluating how structural properties influence the performance of proximity metrics in single cell RNA-seq data | Zendy

Ebony Watson | Zendy; Ariane Mora | Zendy; Atefeh Taherian Fard | Zendy; Jessica C. Mar | Zendy

Open Access

How does the structure of data impact cell–cell similarity? Evaluating how structural properties influence the performance of proximity metrics in single cell RNA-seq data

Author(s) -

Ebony Watson,

Ariane Mora,

Atefeh Taherian Fard,

Jessica C. Mar

Publication year - 2022

Publication title -

briefings in bioinformatics

Language(s) - English

Resource type - Journals

SCImago Journal Rank - 3.204

H-Index - 113

eISSN - 1477-4054

pISSN - 1467-5463

DOI - 10.1093/bib/bbac387

Subject(s) - cluster analysis , metric (unit) , benchmarking , computer science , euclidean distance , data mining , curse of dimensionality , dimensionality reduction , population , similarity (geometry) , pairwise comparison , neighbourhood (mathematics) , artificial intelligence , mathematics , engineering , mathematical analysis , operations management , demography , marketing , sociology , business , image (mathematics)

Accurately identifying cell-populations is paramount to the quality of downstream analyses and overall interpretations of single-cell RNA-seq (scRNA-seq) datasets but remains a challenge. The quality of single-cell clustering depends on the proximity metric used to generate cell-to-cell distances. Accordingly, proximity metrics have been benchmarked for scRNA-seq clustering, typically with results averaged across datasets to identify a highest performing metric. However, the 'best-performing' metric varies between studies, with the performance differing significantly between datasets. This suggests that the unique structural properties of an scRNA-seq dataset, specific to the biological system under study, have a substantial impact on proximity metric performance. Previous benchmarking studies have omitted to factor the structural properties into their evaluations. To address this gap, we developed a framework for the in-depth evaluation of the performance of 17 proximity metrics with respect to core structural properties of scRNA-seq data, including sparsity, dimensionality, cell-population distribution and rarity. We find that clustering performance can be improved substantially by the selection of an appropriate proximity metric and neighbourhood size for the structural properties of a dataset, in addition to performing suitable pre-processing and dimensionality reduction. Furthermore, popular metrics such as Euclidean and Manhattan distance performed poorly in comparison to several lessor applied metrics, suggesting that the default metric for many scRNA-seq methods should be re-evaluated. Our findings highlight the critical nature of tailoring scRNA-seq analyses pipelines to the dataset under study and provide practical guidance for researchers looking to optimize cell-similarity search for the structural properties of their own data.

The content you want is available to Zendy users.

Already have an account? Click here to sign in.

Having issues? You can contact us here

Accelerating Research