Advanced Database system 2019
Chapter 1
Query Processing and Optimization
Introduction
In this chapter we shall discuss the techniques used by a DBMS to process, Optimize and
execute high level queries.
The techniques used to split complex queries into multiple simple operations and methods
of implementing these low-level operations.
The query optimization techniques are used to choose an efficient execution plan that
will minimize the runtime as well as many other types of resources such as number of
disk I/O, CPU time and so on.
What is Query Processing?
The procedure of transforming high level SQL query into a correct and efficient
execution plan expressed in low-level language.
When a database system receives a query for update or retrieval of information, it
goes through a series of compilation steps, called execution plan.
It goes through various phases.
1. First phase is called syntax checking phase: -the system parses the query and
checks that it follows the syntax rules or not. It then matches the objects in the
query syntax with the view tables and columns listed in the system table. This
phase is divided into three: -Scanning, Parsing, Validating
A. Scanner: The scanner identifies the language tokens such as SQL
Keywords, attribute names, and relation names in the text of the query.
B. Parser: The parser checks the query syntax to determine whether it is
formulated according to the syntax rules of the query language.
C. Validation: The query must be validated by checking that all attributes
and relation names are valid and semantically meaningful names in the
schema of the particular database being queried.
AUWC, School of Technology and Informatics Page 1
Advanced Database system 2019
2. In second phase the SQL query is translated in to an algebraic expression
using various rules. So that the process of transforming a high level SQL query
into a relational algebraic form is called Query Decomposition. The relational
algebraic expression now passes to the query optimizer.
3. In third phase Optimization is performed by substituting equivalent
expression depends on the factors such that the existence of certain database
structures, whether or not a given file is stored, the presence of different
indexes and so on. Query optimization module work in cycle with the join
manager module to improve the order in which joins are performed.
At this stage the cost model and several other estimation formulas
are used to rewrite the query.
The modified query is written to utilize system resources so as to bring
the optimal performance.
The query optimizer then generates an action plan also called
execution plan. This action plans are converted into a query codes that
are finally executed by a run time database processor.
Query Code Generator: It generates the code to execute the plan.
The run time database processor estimate the cost of each action plan
and chose the optimal one for the execution. It has the task of running
the query code whether in compiled or interpreted mode. If a runtime
error results an error message is generated by the runtime database
processor.
AUWC, School of Technology and Informatics Page 2
Advanced Database system 2019
Figure 1: -Steps in Processing High-Level Query
What is the aim of query processing?
To transform a query written in a high-level language, typically SQL, into a correct
and efficient execution strategy expressed in a low-level language (implementing the
relational algebra), and to execute the strategy to retrieve the required data.
Query Analyzer
The syntax analyzer takes the query from the users, parses it into tokens and analyses the
tokens (symbols) and their order to make sure they follow the rules of language grammar.
If the error is found in the query submitted by the user, it is rejected and an error
code together with an explanation of why the query was rejected is return to the user.
Query Decomposition
In query decomposition the query processing aims are to transfer the high-level query
into a relational algebra query and to check whether that query is syntactically or
semantically correct. Thus the query decomposition is start with a high level query
and transform into query graph of low-level operations, which satisfy the query.
AUWC, School of Technology and Informatics Page 3
Advanced Database system 2019
The SQL query is decomposed into query blocks (low-level operations), which form
the basic unit. Hence nested queries within a query are identified as separate query
blocks.
The query decomposer goes through five stages of processing for decomposition
into low-level operation and translation into algebraic expressions.
Query Analysis
During the query analysis phase, the query is syntactically analyzed using the
programming language compiler (parser). A syntactically legal query is then
validated, using the system catalog, to ensure that all data objects (relations and
attributes) referred to by the query are defined in the database.
The type specification of the query qualifiers and result is also checked at this stage.
Example: -SELECT emp_nm FROM EMPLOYEE WHERE emp_desg>100
This query will be rejected because the comparison ">100" is
incompatible with the data type of emp_desg which is a variable character
string.
AUWC, School of Technology and Informatics Page 4
Advanced Database system 2019
At the end of query analysis phase, the high-level query (SQL) is transformed into
some internal representation that is more suitable for processing. This internal
representation is typically a kind of query tree.
A Query Tree is a tree data structure that corresponds expression.
A Query Tree is also called a relational algebra tree.
Leaf node of the tree, representing the base input relations of the query.
Internal nodes result of applying an operation in the algebra.
Root of the tree representing a result of the query.
SELECT (P.proj_no, P.dept_no, E.name, E.add, E.dob)
FROM PROJECT P, DEPARTMENT D, EMPLOYEE E
WHERE P.dept_no = D.d_no AND D.mgr_id = E.emp_id AND
P.proj_loc = `Mumbai) ;
The three relations PROJECT, DEPARTMENT, EMPLOYEE are represent as a
leaf nodes P, D and E, while the relational algebra operations of the represented
by internal tree nodes.
Same SQL query can have many different relational algebra expressions and
hence many different query trees.
The query parser typically generates a standard initial (canonical) query tree.
AUWC, School of Technology and Informatics Page 5
Advanced Database system 2019
Query Normalization
The primary phase of the normalization is to avoid redundancy. The normalization
phase converts the query into a normalized form that can be more easily manipulated.
In the normalization phase, a set of equivalency rules are applied so that the projection
and selection operations included on the query are simplified to avoid redundancy.
The projection operation corresponds to the SELECT clause of SQL query and the
selection operation correspond to the predicate found in WHERE clause.
The equivalency transformation rules are applied.
Semantic Analyzer
The objective of this phase of query processing is to reduce the number of predicates.
The semantic analyzer rejects the normalized queries that are incorrectly formulated.
A query is incorrectly formulated if components do not contribute to the generation
of result. This happens in case of missing join specification. A query is contradictory if
its predicate cannot satisfy by any tuple in the relation.
The semantic analyzer examine the relational calculus query (SQL) to make sure it contains only data
objects that is table, columns, views, indexes that are defined in the database catalog. It makes sure
that each object in the query is referenced correctly according to its data type.
In case of missing join specifications the components do not contribute to the
generation of the results, and thus, a query may be incorrect formulated.
Query Simplifier
The objectives of this phase are: -
To detect redundant qualification,
To eliminate common sub-expressions and
To transform sub-graph too semantically equivalent but easier and efficiently
computed form.
Why to simplify?
Commonly integrity constraints, view definitions and access restrictions are introduced into
the graph at this stage of analysis so that the query must be simplified as much as possible.
AUWC, School of Technology and Informatics Page 6
Advanced Database system 2019
Integrity constraints defines constants which must holds for all state of database, so any
query that contradict an integrity constraints must be avoid and can be rejected without
accessing the database.
Query Restructuring
In the final stage of the query decomposition, the query can be restructured to give a
more efficient implementation. Transformation rules are used to convert one
relational algebra expression into an equivalent form that is more efficient.
The query can now be regarded as a relational algebra program, consisting of a series
of operations on relation.
Query Optimization
The primary goal of query optimization is of choosing an efficient execution strategy for
processing a query. The query optimizer attempts to minimize the use of certain
resources (mainly the number of I/O and CPU time) by selecting a best execution
plan (access plan).
A query optimization start during the validation phase by the system to validate the
user has appropriate privileges. Now an action plan is generate to perform the query.
AUWC, School of Technology and Informatics Page 7
Advanced Database system 2019
Relational algebra query tree generated by the query simplifier module of query
decomposer.
Estimation formulas used to determine the cardinality of the intermediate result
table.
A cost Model.
Statistical data from the database catalogue.
The output of the query optimizer is the execution plan in form of optimized
relational algebra query.
A query typically has many possible execution strategies, and the process of choosing
a suitable one for processing a query is known as Query Optimization.
The basic issues in Query Optimization
How to use available indexes?
How to use memory to accumulate information and perform immediate steps such
as sorting?
How to determine the order in which joins should be performed?
AUWC, School of Technology and Informatics Page 8
Advanced Database system 2019
Objective of query optimization
The term query optimization does not mean giving always an optimal (best) strategy
as the execution plan. It is just a responsibly efficient strategy for execution of the
query.
The decomposed query block of SQL is translating into an equivalent extended
relational algebra expression and then optimized.
Techniques for Query Optimization
1. The first technique is based on Heuristic Rules for ordering the operations in a
query execution strategy.
2. The second technique involves the systematic estimation of the cost of the
different execution strategies and choosing the execution plan with the lowest
cost.
3. The third technique is Semantic query optimization: - it is used with the
combination of the heuristic query transformation rules. It uses constraints
specified on the database schema such as unique attributes and other more
complex constraints, in order to modify one query into another query that
is more efficient to execute.
Heuristic Rules
The heuristic rules are used as an optimization technique to modify the internal
representation of query. Usually, heuristic rules are used in the form of query tree of
query graph data structure, to improve its performance.
One of the main heuristic rules is to apply SELECT operation before applying the
JOIN or other BINARY operations. This is because the size of the file resulting
from a binary operation such as JOIN is usually a multi-value function of the sizes of
the input files.
The main idea behind is to reduce intermediate results. This includes performing
SELECT operation to reduce the number of tuples &
PROJECT operation to reduce number of attributes.
AUWC, School of Technology and Informatics Page 9
Advanced Database system 2019
The SELECT and PROJECT reduced the size of the file and hence, should be
applied before the JOIN or other binary operation. Heuristic query optimizer
transforms the initial (canonical) query tree into final query tree using equivalence
transformation rules. This final query tree is efficient to execute.
Examples for query Optimization: Identify all managers who work in a London branch
SQL:-
SELECT * FROM Staff s, Branch b WHERE s.branchNo = b.branchNo AND
s.position = ‘Manager’ AND b.city = ‘london’;
Results in these equivalent relational algebra statements
1. S(position =’Manager’) ^(city=’London’) ^(Staff.branchNo=Branch.branchNo) (Staff X Branch)
2. S(position =’Manager’) ^(city=’London’) (Staff Staff.branchNo = Branch.branchNo Branch)
3. [S(position =’Manager’)( Staff)] Staff.branchNo = Branch.branchNo [s(city=‘London’)
(Branch)]
Assume:
1000 tuples in Staff.
50 Managers
50 tuples in Branch.
5 London branches
No indexes or sort keys
All temporary results are written back to disk (memory is small)
Tuples are accessed one at a time (not in blocks)
Query 1 (Bad)
Requires (1000+50) disk accesses to read from Staff and Branch relations
Creates temporary relation of Cartesian Product (1000*50) tuples
Requires (1000*50) disk access to read in temporary relation and test predicate
Total Work = (1000+50) + 2*(1000*50) = 101,050 I/O operations
AUWC, School of Technology and Informatics Page 10
Advanced Database system 2019
Query 2 (Better)
Again requires (1000+50) disk accesses to read from Staff and Branch
Joins Staff and Branch on branchNo with 1000 tuples (1 employee : 1 branch )
Requires (1000) disk access to read in joined relation and check predicate
Total Work = (1000+50) + 2*(1000) = 3050 I/O operations
3300% Improvement over Query 1
Query 3 (Best)
Read Staff relation to determine ‘Managers’ (1000 reads)
Create 50 tuple relation(50 writes)
Read Branch relation to determine ‘London’ branches (50 reads)
Create 5 tuple relation(5 writes)
Join reduced relations and check predicate (50 + 5 reads)
Total Work = 1000 + 2*(50) + 5 + (50 + 5) = 1160 I/O operations
8700% Improvement over Query 1
Summary of Heuristics for Algebraic Optimization:
1. The main heuristic is to apply first the operations that reduce the size of intermediate
results.
2. Perform select operations as early as possible to reduce the number of tuples and
perform project operations as early as possible to reduce the number of attributes.
(This is done by moving select and project operations as far down the tree as
possible.)
3. The select and join operations that are most restrictive should be executed before
other similar operations. (This is done by reordering the leaf nodes of the tree among
themselves and adjusting the rest of the tree appropriately.)
AUWC, School of Technology and Informatics Page 11