z-logo
open-access-imgOpen Access
Catwalk: identifying closely related sequences in large microbial sequence databases
Author(s) -
Denis Volk,
Fan Yang-Turner,
Xavier Didelot,
Derrick W. Crook,
David Wyllie
Publication year - 2022
Publication title -
microbial genomics
Language(s) - English
Resource type - Journals
ISSN - 2057-5858
DOI - 10.1099/mgen.0.000850
Subject(s) - python (programming language) , genome , computer science , reference genome , annotation , computational biology , pairwise comparison , sequence alignment , biology , data mining , database , genetics , artificial intelligence , programming language , gene , peptide sequence
There is a need to identify microbial sequences that may form part of transmission chains, or that may represent importations across national boundaries, amidst large numbers of SARS-CoV-2 and other bacterial or viral sequences. Reference-based compression is a sequence analysis technique that allows both a compact storage of sequence data and comparisons between sequences. Published implementations of the approach are being challenged by the large sample collections now being generated. Our aim was to develop a fast software detecting highly similar sequences in large collections of microbial genomes, including millions of SARS-CoV-2 genomes. To do so, we developed Catwalk, a tool that bypasses bottlenecks in the generation, comparison and in-memory storage of microbial genomes generated by reference mapping. It is a compiled solution, coded in Nim to increase performance. It can be accessed via command line, rest api or web server interfaces. We tested Catwalk using both SARS-CoV-2 and Mycobacterium tuberculosis genomes generated by prospective public-health sequencing programmes. Pairwise sequence comparisons, using clinically relevant similarity cut-offs, took about 0.39 and 0.66 μs, respectively; in 1 s, between 1 and 2 million sequences can be searched. Catwalk operates about 1700 times faster than, and uses about 8 % of the RAM of, a Python reference-based compression and comparison tool in current use for outbreak detection. Catwalk can rapidly identify close relatives of a SARS-CoV-2 or M. tuberculosis genome amidst millions of samples.

The content you want is available to Zendy users.

Already have an account? Click here to sign in.
Having issues? You can contact us here