Transactions on Large-Scale Data- and Knowledge-Centered Systems XLVI | Zendy

Abdelkader Hameurlain | Zendy; A Min Tjoa | Zendy

AI Assistant Blog Pricing

Home ZAIA Blog

Open Access

Transactions on Large-Scale Data- and Knowledge-Centered Systems XLVI

Author(s) -

Abdelkader Hameurlain,

A Min Tjoa

Publication year - 2020

Publication title -

lecture notes in computer science

Language(s) - English

Resource type - Book series

SCImago Journal Rank - 0.249

H-Index - 400

eISSN - 1611-3349

pISSN - 0302-9743

DOI - 10.1007/978-3-662-62386-2

Subject(s) - computer science , schema (genetic algorithms) , scalability , cloud computing , knowledge extraction , big data , volume (thermodynamics) , data mining , distributed computing , data science , machine learning , database , physics , quantum mechanics , operating system

Advances in high throughput sequencing technologies have resulted in a drastic reduction in genome sequencing price and led to an exponential growth in the generation of genomic sequencing data. The genomics data is often stored on shared repositories and is both heterogeneous and unstructured in nature. It is both technically and culturally residing in big data domain due to the challenges of volume, velocity and variety. Appropriate data storage and management, processing and analytic models are required to meet the growing challenges of genomic and clinical data. Existing research on the storage, management and analyses of genomic and clinical data do not provide a comprehensive solution, either providing Hadoop based solution lacking a robust computing solution for data mining and knowledge discovery, or a distributed in memory solution that are effective in reducing runtime but lack robustness on data store, resource management, reservation, and scheduling. In this paper, we present a scalable and elastic framework for genomic data storage, management, and processing that addresses the weaknesses of existing approaches. Fundamental to our framework is a distributed resource management system with a plug and play NoSQL component and an in-memory, distributed computing framework with machine learning and visualisation plugin tools. We evaluated Avro, CSV, HBase, ORC, Parquet datastores and benchmark their performance. A case study of machine learning based genotype clustering is presented to demonstrate and evaluate the effectiveness of the presented framework. The results show an overall performance improvement of the genomics data analysis pipeline by 49% from existing approaches. Finally, we make recommendations on the state of the art technology and tools for effective architecture approaches for the management and knowledge discovery from large datasets.

The content you want is available to Zendy users.

Already have an account? Click here to sign in.

Having issues? You can contact us here

Accelerating Research