
A comparison of three variant calling pipelines using simulated data
Author(s) -
Nguyen Van Tung,
Nguyễn Thị Kim Liên,
Nguyen Huy Hoang
Publication year - 2021
Publication title -
academia journal of biology
Language(s) - English
Resource type - Journals
eISSN - 2815-5920
pISSN - 2615-9023
DOI - 10.15625/2615-9023/16006
Subject(s) - computer science , pipeline (software) , pipeline transport , data mining , computational biology , biology , environmental engineering , engineering , programming language
Advances in next generation sequencing allow us to do DNA sequencing rapidly at a relatively low cost. Multiple bioinformatics methods have been developed to identify genomic variants from whole genome or whole exome sequencing data. The development of better variant calling methodologies is limited by the difficulty of assessing the accuracy and completeness of a new method. Normally, computational methods can be benchmarked using simulated data which allows us to generate as much data as desired and under controlled scenarios. In this study, we compared three variant calling pipelines: Samtools/VarScan, Samtools/Bcftools, and Picard/GATK using two simulated datasets. The result showed a significant difference between the three pipelines in two cases. In Chromosome 6 dataset, GATK and Bcftools pipelines detected more than 90% of variants. Meanwhile, only 82.19% of mutations were detected by VarScan. In NA12878 datasets, the result showed GATK pipeline was more sensitive than Bcftools and Varscan pipeline. All pipelines showed a high Positive Predictive Value. Moreover, by a measure of run time, VarScan was the highest pipeline but GATK has an option for multithreading which is a way to make a program run faster. Therefore, GATK is more effective than Bcftools and Varscan to variant calling with a lower coverage dataset.