0% found this document useful (0 votes)
93 views12 pages

Parallel Query Processing and Optimization

Parallel query processing in database systems enhances performance by executing parts of a query simultaneously across multiple processors. It involves various techniques such as I/O parallelism, intra-query parallelism, and inter-query parallelism to optimize query execution. The approach improves efficiency, scalability, and resource utilization in distributed databases, addressing challenges like data skew and synchronization.

Uploaded by

sreelakshmitadi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
93 views12 pages

Parallel Query Processing and Optimization

Parallel query processing in database systems enhances performance by executing parts of a query simultaneously across multiple processors. It involves various techniques such as I/O parallelism, intra-query parallelism, and inter-query parallelism to optimize query execution. The approach improves efficiency, scalability, and resource utilization in distributed databases, addressing challenges like data skew and synchronization.

Uploaded by

sreelakshmitadi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 12

parallel query processing and

optimization
Query processing is the process through which a Database
Management System (DBMS) parses, verifies, and optimizes a
given query before creating low-level code that the DB
understands.

Query Processing in DBMS, like any other High-Level Language


(HLL) where code is first generated and then executed to perform
various operations, has two phases: compile-time and runtime.

Query the use of declarative languages and query optimization is


one of the main factors contributing to the success of RDBMS
technology. Any database allows users to create queries to
request specific data, and the database then uses effective
methods to locate the requested data.

The task of the query optimizer is to assess all possible ways to


execute a query and choose the most effective one. This is done
during the optimization process.

The development of advanced database systems has coincided


with notable advancements in processing and distributed
computing technology. Parallel database systems have emerged
as a result of the addition of the next two features.

Data management and parallel processing techniques are


combined in parallel database systems to offer scalability, high
performance, and high availability for data-intensive applications.

Subject Matter
Parallel query processing in database systems has been the
subject of a lot of research over the past two decades. Its
conclusion is unquestionably currently being used as a
component of significant commercial DBMSs. Shared nothing (SN)
architectures were the focus of the majority of the study, such as
the research prototypes Gamma and Bubba.

It is highly suited for the current grid and cluster computing and
serves as the focal point for various symmetric multiprocessing
architectures (SMP) strategies like XPRS and Volcano.

A database optimization approach based on CMP has been


studied by numerous other academics. But the majority of their
effort was on optimizing join operations while taking into account
the L2-cache and the parallel buffers of the shared main memory.

By dividing a question into portions that may be executed in


parallel, parallelism in a query enables us to parallel execute
numerous queries. Shared-nothing architecture can help with this.

With the availability of more and more resources, like processors


and discs, parallelism is also employed to speed up the process of
query execution.

The following techniques can be used to make a query parallel

I/O parallelism
Internal parallelism of queries
Parallelism among queries
Within-operation parallelism
Parallelism in inter-operation
I/O parallelism
This type of parallelism involves partitioning the relationships
among the discs in order to speed up the retrieval of relationships
from the disc.

The inputted data is divided within, and each division is processed


simultaneously. After processing all of the partitioned data, the
results are combined. Another name for it is data partitioning.

Hash partitioning is best suited for point queries that are based on
the partitioning attribute and have the benefit of offering an even
distribution of data across the discs.

It should be mentioned that partitioning is beneficial for the


sequential scans of the full table stored on “n” discs and the
speed at which the table may be scanned. For a single disc
system, relationship takes around 1/n of the time needed to scan
the table. In I/O parallelism, there are four different methods of
partitioning:

Hash partitioning
A hash function is a quick mathematical operation. The
partitioning properties are hashed for each row in the original
relationship.

Let’s say that the data is to be partitioned across 4 drives,


numbered disk1, disk2, disk3, and disk4. The row is now stored on
disk3 if the function returns 3.

Range partitioning
Each disc receives continuous attribute value ranges while using
range partitioning. For instance, if we are range partitioning three
discs with the numbers 0, 1, and 2, we may assign a relation with
a value of less than 5 is written to disk0, numbers from 5 to 40
are sent to disk1, and values above 40 are written to disk2.

It has several benefits, such as putting shuffles on the disc that


have attribute values within a specified range.

Round-robin partitioning
Any order can be used to study the relationships in this method. It
sends the ith tuple to the disc number (i% n).

Therefore, new rows of data are received by discs in turn. For


applications that want to read the full relation sequentially for
each query, this strategy assures an even distribution of tuples
across drives.

Schema Partitioning
Various tables inside a database are put on different discs using a
technique called schema partitioning.
Intra-query parallelism
Using a shared-nothing paralleling architecture technique, intra-
query parallelism refers to the processing of a single query in a
parallel process on many CPUs. This employs two different
strategies:

First method — In this method, a duplicate task can be executed


on a small amount of data by each CPU.

Second method — Using this method, the task can be broken up


into various sectors, with each CPU carrying out a separate
subtask.

Inter-query parallelism
Each CPU executes numerous transactions when inter-query
parallelism is used. Parallel transaction processing is what it is
known as. To support inter-query parallelism, DBMS leverages
transaction dispatching.

We can also employ a variety of techniques, such as effective lock


management. This technique runs each query sequentially, which
slows down the running time.

In such circumstances, DBMS must be aware of the locks that


various transactions operating on various processes have
acquired. When simultaneous transactions don’t accept the same
data, inter-query parallelism on shared storage architecture works
well.

Additionally, the throughput of transactions is boosted, and it is


the simplest form of parallelism in DBMS.

Intra-operation parallelism
In this type of parallelism, we execute each individual operation of
a task, such as sorting, joins, projections, and so forth, in parallel.
Intra-operation parallelism has a very high parallelism level.

Database systems naturally employ this kind of parallelism.


Consider the following SQL example:

SELECT * FROM the list of vehicles and sort by model number;

Since a relation might contain a high number of records, the


relational operation in the aforementioned query is sorting.

Because this operation can be done on distinct subsets of the


relation in several processors, it takes less time to sort the data.

Inter-operation parallelism
This term refers to the concurrent execution of many operations
within a query expression. They come in two varieties:

Pipelined parallelism — In pipeline parallelism, a second operation


consumes a row of the first operation’s output before the first
operation has finished producing the whole set of rows in its
output.

Additionally, it is feasible to perform these two processes


concurrently on several CPUs, allowing one operation to consume
tuples concurrently with another operation and thereby reduce
them.

It is advantageous for systems with a limited number of CPUs and


prevents the storage of interim results on a disc.

Independent parallelism- In this form of parallelism, operations


contained within query phrases that are independent of one
another may be carried out concurrently. This analogy is
extremely helpful when dealing with parallelism of a lower
degree.

Execution Of a Parallel Query


The relational model has been favoured over prior hierarchical
and network models because of commercial database
technologies. Data independence and high-level query languages
are the key advantages that relational database systems
(RDBMSs) have over their forerunners (e.g., SQL).

The efficiency of programmers is increased, and routine


optimization is encouraged.

Additionally, distributed database management is made easier by


the relational model’s set-oriented structure. RDBMSs may now
offer performance levels comparable to older systems thanks to a
decade of development and tuning.

They are therefore widely employed in the processing of


commercial data for OLTP (online transaction processing) or
decision-support systems. Through the use of many processors
working together, parallel processing makes use of multiprocessor
computers to run application programmes and boost
performance.

It is most commonly used in scientific computing, which it does by


the speed of numerical applications’ responses.
The development of parallel database systems is an example of
how database management and parallel computing can work
together. A given SQL statement can be divided up in the parallel
database system PQO such that its components can run
concurrently on several processors in a multi-processor machine.

Full table scans, sorting, sub-queries, data loading, and other


common operations can all be performed in parallel.

As a form of parallel database optimization, Parallel Query


enables the division of SELECT or DML operations into many
smaller chunks that can be executed by PQ slaves on different
CPUs in a single box.

The order of joins and the method for computing each join are
fixed in the first part of the Fig, which is sorting and rewriting. The
second phase, parallelization, turns the query tree into a parallel
plan. Parallelization divides this stage into two parts: extraction of
parallelism and scheduling.

In Parallel Database, Query Processing And Optimization Approach


A huge task is broken down into numerous smaller tasks using
parallel processing, which then runs each of the smaller tasks on
various nodes and processors at the same time.
Parallel query optimization means improving how database
queries run by using multiple processors at the same time. This is
usually done by splitting the data into parts so different
processors can work on them simultaneously.

The greater task thus gets finished faster as a result. Separate


jobs compete for the same resource in sequential processing.
Only Task 1 can be completed immediately. Task 1 must be
finished before task 2 may be started, and so on. Task 3 must
follow suit.

A larger portion of the CPU is allocated to the jobs during parallel


processing. There is no waiting involved because each
autonomous task starts working right away on its own processor.

Concurrency Management, Task Synchronisation, Resource


Sharing, Data Placement, and Network Scaling are qualities that a
parallel database system should retain. Synchronicity occurs in is
a vital component of success.

In a shared-nothing or loosely linked design, locking mechanisms


are primarily used to synchronize the database. The same method
is implemented for data placement in a strongly connected
architecture. The load balancing of a parallel database system
depends on the placement of the data.

The load balancing of a parallel database system depends on the


placement of the data. Ideally, each parallel process should
operate on a separate dataset to prevent interference between
them.

By declusting (horizontal partitioning) the relationships based on


the hash function or range index and assigning each partition to a
distinct memory module, it is possible to create these
independent datasets. The following is the suggested execution
plan:

The first step is to reduce the overall amount of effort required to


assess the query.

Then, make an effort to equally distribute that little work across


the available processors.
Sure! Here's the 7-mark answer in bullet format for Parallel Query
Optimization in Distributed Databases:

---

Parallel Query Optimization in Distributed Databases

Definition: It is the process of enhancing query performance by


executing parts of a query in parallel across multiple nodes in a
distributed database.

Types of Parallelism:

Intra-query parallelism: Entire query is split into parallel tasks.

Intra-operation parallelism: A single operation (e.g., scan, join)


runs in parallel.

Inter-operation parallelism: Multiple query operations run


concurrently.

Key Steps in Optimization:

Data Partitioning: Data is split across nodes to enable parallel


access.

Task Scheduling: Tasks are assigned to different processors or


nodes to balance the load.

Minimizing Communication: The optimizer reduces data


movement between nodes to save time.

Cost Estimation: Different execution plans are compared, and the


most efficient is selected.

Advantages:

Faster query execution.


Efficient use of distributed hardware.

Scalable to large datasets.

Challenges:

Data skew leading to load imbalance.

Synchronization and coordination issues.

Network delays and node failures.

Conclusion: Parallel query optimization is crucial in distributed


databases for improving performance and scalability by
effectively using multiple processors or nodes.

---

Let me know if you'd like a diagram or example added.

60 following
A tech enthusiast with a great taste in technology, avid gamer
and a marketer by profession. 😎

No responses yet
To respond to this story,
get the free Medium app.

Open in app
More from Sarang S. Babu and Plumbers Of Data Science
12 Tips for Optimizing SQL Queries for Faster Performance
Learning SQL
In
Learning SQL

by

Sarang S. Babu

12 Tips for Optimizing SQL Queries for Faster Performance


Ways to Optimize SQL Queries
Mar 6, 2023

Understanding Spark DAGs


Plumbers Of Data Science
In

Plumbers Of Data Science

by

Vivek Chaudhary

Understanding Spark DAGs


Directed Acyclic Graph (DAG)
Sep 12, 2023

Exploring the Different Join Types in Spark SQL: A Step-by-Step


Guide
Plumbers Of Data Science
In

Plumbers Of Data Science

by

Ahmed Uz Zaman

Exploring the Different Join Types in Spark SQL: A Step-by-Step


Guide
Understand the Key Concepts and Syntax of Cross, Outer, Anti,
Semi, and Self Joins
Feb 3, 2023

Python for SQL: An Introduction to Database Connectivity


Learning SQL
In

Learning SQL

by

Sarang S. Babu

Python for SQL: An Introduction to Database Connectivity


Overview
Apr 6, 2023

See all from Sarang S. Babu


See all from Plumbers Of Data Science
Recommended from Medium
This new IDE from Google is an absolute game changer
Coding Beauty
In

Coding Beauty

by

Tari Ibaba

This new IDE from Google is an absolute game changer


This new IDE from Google is

You might also like