Biospark: scalable analysis of large numerical datasets from biological simulations and experiments using Hadoop and Spark
Author(s) -
Max C. Klein,
Rati Sharma,
Christopher H. Bohrer,
Cameron M. Avelis,
Elijah Roberts
Publication year - 2016
Publication title -
bioinformatics
Language(s) - English
Resource type - Journals
SCImago Journal Rank - 3.599
H-Index - 390
eISSN - 1367-4811
pISSN - 1367-4803
DOI - 10.1093/bioinformatics/btw614
Subject(s) - spark (programming language) , computer science , scalability , license , open source , source code , mit license , big data , code (set theory) , domain (mathematical analysis) , data mining , informatics , python (programming language) , software , data science , database , programming language , operating system , mathematical analysis , mathematics , set (abstract data type) , electrical engineering , engineering
Data-parallel programming techniques can dramatically decrease the time needed to analyze large datasets. While these methods have provided significant improvements for sequencing-based analyses, other areas of biological informatics have not yet adopted them. Here, we introduce Biospark, a new framework for performing data-parallel analysis on large numerical datasets. Biospark builds upon the open source Hadoop and Spark projects, bringing domain-specific features for biology.
Accelerating Research
Robert Robinson Avenue,
Oxford Science Park, Oxford
OX4 4GP, United Kingdom
Address
John Eccles HouseRobert Robinson Avenue,
Oxford Science Park, Oxford
OX4 4GP, United Kingdom