
SQL QUERY EXECUTION OPTIMIZATION ON SPARK SQL
Author(s) -
G. Mozhaiskii,
Vladimir Korkhov,
Ivan Gankevich
Publication year - 2021
Publication title -
9th international conference "distributed computing and grid technologies in science and education"
Language(s) - English
Resource type - Conference proceedings
DOI - 10.54546/mlit.2021.37.73.001
Subject(s) - computer science , spark (programming language) , sql , database , skew , big data , in memory processing , query by example , programming paradigm , variety (cybernetics) , upgrade , operating system , programming language , world wide web , web search query , search engine , telecommunications , artificial intelligence
Spark and Hadoop ecosystem includes a wide variety of different components and can be integratedwith any tool required for Big Data nowadays. From release-to-release developers of theseframeworks optimize the inner work of components and make their usage more flexible and elaborate.Nevertheless, since inventing MapReduce as a programming model and the first Hadoop releases dataskew has been the main problem of distributed data processing. Data skew leads to performancedegradation, i.e., slowdown of application execution due to idling while waiting for the resources tobecome available. The newest Spark framework versions allow handling this situation easily from thebox. However, there is no opportunity to upgrade versions of tools and appropriate logic in the case ofcorporate environments with multiple large-scale projects development of which was started yearsago. In this article we consider approaches to execution optimization of SQL query in case of dataskew on concrete example with HDFS and Spark SQL 2.3.2 version usage.