z-logo
open-access-imgOpen Access
DSK: k-mer counting with very low memory usage
Author(s) -
Guillaume Rizk,
Dominique Lavenier,
Rayan Chikhi
Publication year - 2013
Publication title -
bioinformatics
Language(s) - English
Resource type - Journals
SCImago Journal Rank - 3.599
H-Index - 390
eISSN - 1367-4811
pISSN - 1367-4803
DOI - 10.1093/bioinformatics/btt020
Subject(s) - k mer , computer science , hash table , hash function , substring , auxiliary memory , parallel computing , bloom filter , data structure , algorithm , operating system , dna sequencing , dna , biology , genetics , computer security
Counting all the k-mers (substrings of length k) in DNA/RNA sequencing reads is the preliminary step of many bioinformatics applications. However, state of the art k-mer counting methods require that a large data structure resides in memory. Such structure typically grows with the number of distinct k-mers to count. We present a new streaming algorithm for k-mer counting, called DSK (disk streaming of k-mers), which only requires a fixed user-defined amount of memory and disk space. This approach realizes a memory, time and disk trade-off. The multi-set of all k-mers present in the reads is partitioned, and partitions are saved to disk. Then, each partition is separately loaded in memory in a temporary hash table. The k-mer counts are returned by traversing each hash table. Low-abundance k-mers are optionally filtered. DSK is the first approach that is able to count all the 27-mers of a human genome dataset using only 4.0 GB of memory and moderate disk space (160 GB), in 17.9 h. DSK can replace a popular k-mer counting software (Jellyfish) on small-memory servers.

The content you want is available to Zendy users.

Already have an account? Click here to sign in.
Having issues? You can contact us here
Accelerating Research

Address

John Eccles House
Robert Robinson Avenue,
Oxford Science Park, Oxford
OX4 4GP, United Kingdom