Knowledge and Theme Discovery across Very Large Biological Data Sets Using Distributed Queries: A Prototype Combining Unstructured and Structured Data | Zendy

Uma Mudunuri | Zendy; Mohamad Khouja | Zendy; Stephen Repetski | Zendy; Girish Venkataraman | Zendy; Anney Che | Zendy; Brian T. Luke | Zendy; F. Pascal Girard | Zendy; Robert M. Stephens | Zendy

Open Access

Knowledge and Theme Discovery across Very Large Biological Data Sets Using Distributed Queries: A Prototype Combining Unstructured and Structured Data

Author(s) -

Uma Mudunuri,

Mohamad Khouja,

Stephen Repetski,

Girish Venkataraman,

Anney Che,

Brian T. Luke,

F. Pascal Girard,

Robert M. Stephens

Publication year - 2013

Publication title -

plos one

Language(s) - English

Resource type - Journals

SCImago Journal Rank - 0.99

H-Index - 332

ISSN - 1932-6203

DOI - 10.1371/journal.pone.0080503

Subject(s) - computer science , unstructured data , big data , data science , pace , set (abstract data type) , data mining , domain (mathematical analysis) , knowledge extraction , information retrieval , data set , biological data , bioinformatics , artificial intelligence , mathematical analysis , mathematics , geodesy , biology , programming language , geography

As the discipline of biomedical science continues to apply new technologies capable of producing unprecedented volumes of noisy and complex biological data, it has become evident that available methods for deriving meaningful information from such data are simply not keeping pace. In order to achieve useful results, researchers require methods that consolidate, store and query combinations of structured and unstructured data sets efficiently and effectively. As we move towards personalized medicine, the need to combine unstructured data, such as medical literature, with large amounts of highly structured and high-throughput data such as human variation or expression data from very large cohorts, is especially urgent. For our study, we investigated a likely biomedical query using the Hadoop framework. We ran queries using native MapReduce tools we developed as well as other open source and proprietary tools. Our results suggest that the available technologies within the Big Data domain can reduce the time and effort needed to utilize and apply distributed queries over large datasets in practical clinical applications in the life sciences domain. The methodologies and technologies discussed in this paper set the stage for a more detailed evaluation that investigates how various data structures and data models are best mapped to the proper computational framework.

The content you want is available to Zendy users.

Already have an account? Click here to sign in.

Having issues? You can contact us here

Accelerating Research