MetaSpark: a spark-based distributed processing tool to recruit metagenomic reads to reference genomes | Zendy

Wei Zhou | Zendy; Ruilin Li | Zendy; Shuo Yuan | Zendy; Changchun Liu | Zendy; Shaowen Yao | Zendy; Jing Luo | Zendy; Beifang Niu | Zendy

AI Assistant Blog Pricing

Home ZAIA Blog

Open Access

MetaSpark: a spark-based distributed processing tool to recruit metagenomic reads to reference genomes

Author(s) -

Wei Zhou,

Ruilin Li,

Shuo Yuan,

Changchun Liu,

Shaowen Yao,

Jing Luo,

Beifang Niu

Publication year - 2016

Publication title -

bioinformatics

Language(s) - English

Resource type - Journals

SCImago Journal Rank - 3.599

H-Index - 390

eISSN - 1367-4811

pISSN - 1367-4803

DOI - 10.1093/bioinformatics/btw750

Subject(s) - spark (programming language) , metagenomics , computer science , genome , computational biology , software , biology , genetics , programming language , gene

With the advent of next-generation sequencing, traditional bioinformatics tools are challenged by massive raw metagenomic datasets. One of the bottlenecks of metagenomic studies is lack of large-scale and cloud computing suitable data analysis tools. In this paper, we proposed a Spark based tool, called MetaSpark, to recruit metagenomic reads to reference genomes. MetaSpark benefits from the distributed data set (RDD) of Spark, which makes it able to cache data set in memory across cluster nodes and scale well with the datasets. Compared with previous metagenomics recruitment tools, MetaSpark recruited significantly more reads than many programs such as SOAP2, BWA and LAST and increased recruited reads by ∼4% compared with FR-HIT when there were 1 million reads and 0.75 GB references. Different test cases demonstrate MetaSpark's scalability and overall high performance.

The content you want is available to Zendy users.

Already have an account? Click here to sign in.

Having issues? You can contact us here

Accelerating Research