Towards optimizing the execution of spark scientific workflows using machine learning‐based parameter tuning | Zendy

Oliveira Douglas | Zendy; Porto Fábio | Zendy; Boeres Cristina | Zendy; Oliveira Daniel | Zendy

AI Assistant Blog Pricing

Home ZAIA Blog

Premium

Towards optimizing the execution of spark scientific workflows using machine learning‐based parameter tuning

Author(s) -

Oliveira Douglas,

Porto Fábio,

Boeres Cristina,

Oliveira Daniel

Publication year - 2020

Publication title -

concurrency and computation: practice and experience

Language(s) - English

Resource type - Journals

SCImago Journal Rank - 0.309

H-Index - 67

eISSN - 1532-0634

pISSN - 1532-0626

DOI - 10.1002/cpe.5972

Subject(s) - workflow , spark (programming language) , computer science , set (abstract data type) , domain (mathematical analysis) , machine learning , big data , data mining , task (project management) , artificial intelligence , database , programming language , systems engineering , engineering , mathematical analysis , mathematics

Summary In the last few years, Apache Spark has become a de facto the standard framework for big data systems on both industry and academy projects. Spark is used to execute compute‐ and data‐intensive workflows in distinct areas like biology and astronomy. Although Spark is an easy‐to‐install framework, it has more than one hundred parameters to be set, besides domain‐specific parameters of each workflow. In this way, to execute Spark‐based workflows efficiently, the user has to fine‐tune a myriad of Spark and workflow parameters (eg, partitioning strategy, the average size of a DNA sequence, etc.). This configuration task cannot be manually performed in a trial‐and‐error manner since it is tedious and error‐prone. This article proposes an approach that focuses on generating interpretable predictive machine learning models (ie, decision trees), and then extract useful rules (ie, patterns) from these models that can be applied to configure parameters of future executions of the workflow and Spark for nonexperts users. In the experiments presented in this article, the proposed parameter configuration approach led to better performance in processing Spark workflows. Finally, the approach introduced here reduced the number of parameters to be configured by identifying the most relevant domain‐specific ones related to the workflow performance in the predictive model.

This content is not available in your region!

Continue researching here.

Having issues? You can contact us here

Accelerating Research