Premium
Predicting Hadoop misconfigurations using machine learning
Author(s) -
Robert Andrew,
Gupta Apaar,
Shenoy Vinayak,
Sitaram Dinkar,
Kalambur Subramaniam
Publication year - 2020
Publication title -
software: practice and experience
Language(s) - English
Resource type - Journals
SCImago Journal Rank - 0.437
H-Index - 70
eISSN - 1097-024X
pISSN - 0038-0644
DOI - 10.1002/spe.2790
Subject(s) - computer science , workload , spark (programming language) , crash , big data , set (abstract data type) , support vector machine , computer cluster , machine learning , resource (disambiguation) , decision tree , variety (cybernetics) , distributed computing , data mining , database , artificial intelligence , operating system , computer network , programming language
Summary Distributed applications are popular for heavy workloads where the resources of a single machine are not sufficient. These distributed applications come with many parameters to tune so that cluster resources can be effectively utilized. However, any misconfiguration of the available parameters may result in suboptimal performance of one or more machines in the cluster. These events may go unnoticed or can result in crashes. This problem of misconfigured parameters has no straightforward solution due to the variety of parameters and vastly different workloads being processed. In this article, we propose a methodology for machine learning‐based detection of misconfigurations. We collect data mined from system resource utilization, Hadoop logs, and job‐level metrics to train a model using decision tree and support vector machine. The models are used to identify whether a set of configuration parameters could result in a crash or a slowdown for a specific workload. The approach explained in this article can be extended to other distributed big data applications, such as Spark, Hive, Pig, and so on.