SQL Query Execution Optimization on Spark SQL

Vladimir Korkhov

SQL Query Execution Optimization on Spark SQL

Vladimir Korkhov

2021, 9th International Conference "Distributed Computing and Grid Technologies in Science and Education"

visibility

…

description

5 pages

link

1 file

Sign up for access to the world's latest research

checkGet notified about relevant papers

checkSave papers to use in your research

checkJoin the discussion with peers

checkTrack your impact

Abstract

Spark and Hadoop ecosystem includes a wide variety of different components and can be integratedwith any tool required for Big Data nowadays. From release-to-release developers of theseframeworks optimize the inner work of components and make their usage more flexible and elaborate.Nevertheless, since inventing MapReduce as a programming model and the first Hadoop releases dataskew has been the main problem of distributed data processing. Data skew leads to performancedegradation, i.e., slowdown of application execution due to idling while waiting for the resources tobecome available. The newest Spark framework versions allow handling this situation easily from thebox. However, there is no opportunity to upgrade versions of tools and appropriate logic in the case ofcorporate environments with multiple large-scale projects development of which was started yearsago. In this article we consider approaches to execution optimization of SQL query in case of dataskew on concrete example wi...

Key takeaways

Required tools to use are Apache Spark framework version 2.3.2 and its SQL module.
The data were read using Dask python3 library than converted to Pandas data frames and finally read by PySpark.
The examination of the query execution process provided by Spark analytics module gives the following results: the data extraction process takes 7.6 seconds and leads to unbalanced shuffle readwrite of 877.4 MB ( fig.2).
To optimize time efficiency of SQL query ( fig. 1) next approaches were used: BroadcastHashJoin, spark.shuffle.partitions parameter tuning and bucketing.
The optimization techniques and their effect on query execution time

Meenu Chawla

TURKISH JOURNAL OF ELECTRICAL ENGINEERING & COMPUTER SCIENCES

Apache Spark is one of the most technically challenged frameworks for cluster computing in which data are processed in a parallel fashion. The cluster consists of unreliable machines. It processes a large amount of data faster compared to the MapReduce framework. For providing the facility of optimized and fast SQL query processing, a new unit is developed in Apache Spark named Spark SQL. It allows users to use relational processing and functional programming in one place. It provides many optimizations by leveraging the benefits of its core. This is called the catalyst optimizer. This optimizer has many rules to optimize queries for efficient execution. In this paper, we discuss a scenario in which the catalyst optimizer is not able to optimize the query competently for a specific case. This is the reason for inefficient memory usage and increases in the time required for the execution of the query by Spark SQL. For dealing with this issue, we propose a solution in this paper by which the query is optimized up to the peak level. This significantly reduces the time and memory consumed by the shuffling process.

Log In

SQL Query Execution Optimization on Spark SQL

Sign up for access to the world's latest research

Abstract

Key takeaways

Related papers

Related papers

Related topics