Academia.eduAcademia.edu

SQL Query Execution Optimization on Spark SQL

2021, 9th International Conference "Distributed Computing and Grid Technologies in Science and Education"

Abstract

Spark and Hadoop ecosystem includes a wide variety of different components and can be integratedwith any tool required for Big Data nowadays. From release-to-release developers of theseframeworks optimize the inner work of components and make their usage more flexible and elaborate.Nevertheless, since inventing MapReduce as a programming model and the first Hadoop releases dataskew has been the main problem of distributed data processing. Data skew leads to performancedegradation, i.e., slowdown of application execution due to idling while waiting for the resources tobecome available. The newest Spark framework versions allow handling this situation easily from thebox. However, there is no opportunity to upgrade versions of tools and appropriate logic in the case ofcorporate environments with multiple large-scale projects development of which was started yearsago. In this article we consider approaches to execution optimization of SQL query in case of dataskew on concrete example wi...

Key takeaways

  • Required tools to use are Apache Spark framework version 2.3.2 and its SQL module.
  • The data were read using Dask python3 library than converted to Pandas data frames and finally read by PySpark.
  • The examination of the query execution process provided by Spark analytics module gives the following results: the data extraction process takes 7.6 seconds and leads to unbalanced shuffle readwrite of 877.4 MB ( fig.2).
  • To optimize time efficiency of SQL query ( fig. 1) next approaches were used: BroadcastHashJoin, spark.shuffle.partitions parameter tuning and bucketing.
  • The optimization techniques and their effect on query execution time