CMash: fast, multi-resolution estimation of k-mer-based Jaccard and containment indices | Zendy

Shaopeng Liu | Zendy; David Koslicki | Zendy

AI Assistant Blog Pricing

Home ZAIA Blog

Open Access

CMash: fast, multi-resolution estimation of k-mer-based Jaccard and containment indices

Author(s) -

Shaopeng Liu,

David Koslicki

Publication year - 2022

Publication title -

bioinformatics

Language(s) - English

Resource type - Journals

SCImago Journal Rank - 3.599

H-Index - 390

eISSN - 1367-4811

pISSN - 1367-4803

DOI - 10.1093/bioinformatics/btac237

Subject(s) - python (programming language) , computer science , jaccard index , data mining , algorithm , cluster analysis , artificial intelligence , operating system

K-mer-based methods are used ubiquitously in the field of computational biology. However, determining the optimal value of k for a specific application often remains heuristic. Simply reconstructing a new k-mer set with another k-mer size is computationally expensive, especially in metagenomic analysis where datasets are large. Here, we introduce a hashing-based technique that leverages a kind of bottom-m sketch as well as a k-mer ternary search tree (KTST) to obtain k-mer-based similarity estimates for a range of k values. By truncating k-mers stored in a pre-built KTST with a large k=kmax value, we can simultaneously obtain k-mer-based estimates for all k values up to kmax. This truncation approach circumvents the reconstruction of new k-mer sets when changing k values, making analysis more time and space-efficient.

The content you want is available to Zendy users.

Already have an account? Click here to sign in.

Having issues? You can contact us here

Accelerating Research