The impact of columnar file formats on SQL‐on‐hadoop engine performance: A study on ORC and Parquet | Zendy

Ivanov Todor | Zendy; Pergolesi Matteo | Zendy

AI Assistant Blog Pricing

Home ZAIA Blog

Premium

The impact of columnar file formats on SQL‐on‐hadoop engine performance: A study on ORC and Parquet

Author(s) -

Ivanov Todor,

Pergolesi Matteo

Publication year - 2019

Publication title -

concurrency and computation: practice and experience

Language(s) - English

Resource type - Journals

SCImago Journal Rank - 0.309

H-Index - 67

eISSN - 1532-0634

pISSN - 1532-0626

DOI - 10.1002/cpe.5523

Subject(s) - computer science , benchmark (surveying) , database , sql , file format , compression ratio , operating system , engineering , geodesy , automotive engineering , geography , internal combustion engine

Summary Columnar file formats provide an efficient way to store data to be queried by SQL‐on‐Hadoop engines. Related works consider the performance of processing engine and file format together, which makes it impossible to predict their individual impact. In this work, we propose an alternative approach: by executing each file format on the same processing engine, we compare the different file formats as well as their different parameter settings. We apply our strategy to two processing engines, Hive and SparkSQL, and evaluate the performance of two columnar file formats, ORC and Parquet. We use BigBench (TPCx‐BB), a standardized application‐level benchmark for Big Data scenarios. Our experiments confirm that the file format selection and its configuration significantly affect the overall performance. We show that ORC generally performs better on Hive, whereas Parquet achieves best performance with SparkSQL. Using ZLIB compression brings up to 60.2% improvement with ORC, while Parquet achieves up to 7% improvement with Snappy. Exceptions are the queries involving text processing, which do not benefit from using any compression.

This content is not available in your region!

Continue researching here.

Having issues? You can contact us here

Empowering knowledge with every search

About

About Careers Publisher Partners Contact Us

Learn

FAQs Blog Terms of Use Privacy Policy

About

Learn

Discover

Explore