Predicting Hadoop misconfigurations using machine learning | Zendy

Robert Andrew | Zendy; Gupta Apaar | Zendy; Shenoy Vinayak | Zendy; Sitaram Dinkar | Zendy; Kalambur Subramaniam | Zendy

AI Assistant Blog Pricing

Home ZAIA Blog

Premium

Predicting Hadoop misconfigurations using machine learning

Author(s) -

Robert Andrew,

Gupta Apaar,

Shenoy Vinayak,

Sitaram Dinkar,

Kalambur Subramaniam

Publication year - 2020

Publication title -

software: practice and experience

Language(s) - English

Resource type - Journals

SCImago Journal Rank - 0.437

H-Index - 70

eISSN - 1097-024X

pISSN - 0038-0644

DOI - 10.1002/spe.2790

Subject(s) - computer science , workload , spark (programming language) , crash , big data , set (abstract data type) , support vector machine , computer cluster , machine learning , resource (disambiguation) , decision tree , variety (cybernetics) , distributed computing , data mining , database , artificial intelligence , operating system , computer network , programming language

Summary Distributed applications are popular for heavy workloads where the resources of a single machine are not sufficient. These distributed applications come with many parameters to tune so that cluster resources can be effectively utilized. However, any misconfiguration of the available parameters may result in suboptimal performance of one or more machines in the cluster. These events may go unnoticed or can result in crashes. This problem of misconfigured parameters has no straightforward solution due to the variety of parameters and vastly different workloads being processed. In this article, we propose a methodology for machine learning‐based detection of misconfigurations. We collect data mined from system resource utilization, Hadoop logs, and job‐level metrics to train a model using decision tree and support vector machine. The models are used to identify whether a set of configuration parameters could result in a crash or a slowdown for a specific workload. The approach explained in this article can be extended to other distributed big data applications, such as Spark, Hive, Pig, and so on.

This content is not available in your region!

Continue researching here.

Having issues? You can contact us here

Empowering knowledge with every search

About

About Careers Publisher Partners Contact Us

Learn

FAQs Blog Terms of Use Privacy Policy

About

Learn

Discover

Explore