z-logo
open-access-imgOpen Access
CFSAN SNP Pipeline: an automated method for constructing SNP matrices from next-generation sequence data
Author(s) -
Steve Davis,
James Pettengill,
Yan Luo,
Justin L. Payne,
Al Shpuntoff,
Hugh Rand,
Errol Strain
Publication year - 2015
Publication title -
peerj computer science
Language(s) - English
Resource type - Journals
SCImago Journal Rank - 0.806
H-Index - 24
ISSN - 2376-5992
DOI - 10.7717/peerj-cs.20
Subject(s) - snp , computer science , computational biology , single nucleotide polymorphism , python (programming language) , pipeline (software) , data mining , software , genetics , biology , gene , programming language , genotype
The analysis of next-generation sequence (NGS) data is often a fragmented step-wise process. For example, multiple pieces of software are typically needed to map NGS reads, extract variant sites, and construct a DNA sequence matrix containing only single nucleotide polymorphisms (i.e., a SNP matrix) for a set of individuals. The management and chaining of these software pieces and their outputs can often be a cumbersome and difficult task. Here, we present CFSAN SNP Pipeline, which combines into a single package the mapping of NGS reads to a reference genome with Bowtie2, processing of those mapping (BAM) files using SAMtools, identification of variant sites using VarScan, and production of a SNP matrix using custom Python scripts. We also introduce a Python package (CFSAN SNP Mutator) that when given a reference genome will generate variants of known position against which we validate our pipeline. We created 1,000 simulated Salmonella enterica sp. enterica Serovar Agona genomes at 100× and 20× coverage, each containing 500 SNPs, 20 single-base insertions and 20 single-base deletions. For the 100× dataset, the CFSAN SNP Pipeline recovered 98.9% of the introduced SNPs and had a false positive rate of 1.04 × 10−6; for the 20× dataset 98.8% of SNPs were recovered and the false positive rate was 8.34 × 10−7. Based on these results, CFSAN SNP Pipeline is a robust and accurate tool that it is among the first to combine into a single executable the myriad steps required to produce a SNP matrix from NGS data. Such a tool is useful to those working in an applied setting (e.g., food safety traceback investigations) as well as for those interested in evolutionary questions

The content you want is available to Zendy users.

Already have an account? Click here to sign in.
Having issues? You can contact us here
Accelerating Research

Address

John Eccles House
Robert Robinson Avenue,
Oxford Science Park, Oxford
OX4 4GP, United Kingdom