0% found this document useful (0 votes)
20 views104 pages

Unit 2

This document outlines the concepts of query processing and optimization in databases, detailing the steps involved in translating SQL queries into relational algebra, optimizing query execution plans, and evaluating queries. It covers various algorithms for external sorting, SELECT operations, and JOIN operations, emphasizing the importance of efficient execution strategies. Additionally, it introduces concepts such as query trees, query evaluation plans, and different search methods for selection and join operations.

Uploaded by

Akkal Bista
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
20 views104 pages

Unit 2

This document outlines the concepts of query processing and optimization in databases, detailing the steps involved in translating SQL queries into relational algebra, optimizing query execution plans, and evaluating queries. It covers various algorithms for external sorting, SELECT operations, and JOIN operations, emphasizing the importance of efficient execution strategies. Additionally, it introduces concepts such as query trees, query evaluation plans, and different search methods for selection and join operations.

Uploaded by

Akkal Bista
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 104

Query Processing and

Optimization
Unit 2
Chapter Outline
0. Introduction to Query Processing
1. Translating SQL Queries into Relational Algebra
2. Algorithms for External Sorting
3. Algorithms for SELECT and JOIN Operations
4. Algorithms for PROJECT and SET Operations
5. Implementing Aggregate Operations and Outer Joins
6. Combining Operations using Pipelining
7. Using Heuristics in Query Optimization
8. Using Selectivity and Cost Estimates in Query Optimization
9. Overview of Query Optimization in Oracle
10. Semantic Query Optimization
Overview of Query Processing
• Query processing is the process of retrieving data from a database in
response to a user query.
• Query Processing
• Translation of high-level queries into low-level expression that can be used at the physical level
of the file system, query optimization and actual execution of the query to get the result.
• Query Optimization
• Process in which multiple query-execution plans for satisfying a query are examined and a most
efficient query plan is identified for execution.
• The process of choosing a suitable execution strategy for processing a query.
• Query processing is the activities involved in parsing , validating, optimizing and
executing a query.
• Two internal representations of a query:
• Query Tree
• Query Graph 3
Query Processing and Optimization

4
Query Processing Steps

1. Parsing and translation: translate the query


into its internal form.
 This is then translated into relational
algebra.
 Query Tree, Graph
 Parser checks syntax, verifies relations
2. Optimization: The process of choosing a
suitable execution strategy for processing
a query.
3. Evaluation
 The query-evaluation engine takes a
query-execution plan,
 executes that plan,
 and returns the answers to the query.
5
Query Processing Steps

6
Translating SQL Queries into Relational Algebra

• Query submitted to the database system is first decomposed into


Query Blocks.
• Query block:
• The basic unit that can be translated into the algebraic operators and
optimized.
• A query block contains a single SELECT-FROM-WHERE expression, as
well as GROUP BY and HAVING clause if these are part of the block.
• Nested queries within a query are identified as separate query blocks.
• Aggregate operators in SQL must be included in the extended algebra.

7
Translating SQL Queries into Relational Algebra
SELECT LNAME, FNAME
FROM EMPLOYEE
WHERE SALARY > ( SELECT MAX (SALARY)
FROM EMPLOYEE
WHERE DNO = 5);

SELECT LNAME, FNAME SELECT MAX (SALARY)


FROM EMPLOYEE FROM EMPLOYEE
WHERE SALARY > C WHERE DNO = 5

πLNAME, FNAME (σSALARY>C(EMPLOYEE)) ℱMAX SALARY (σDNO=5 (EMPLOYEE))

The optimizer would then choose an execution plan for each query block
Copyright © 2007 Ramez Elmasri and Shamkant B. Navathe
Additional Operators Semi-Join and Anti-Join
 Semi-join
 Generally used for unnesting EXISTS, IN, and ANY subqueries
 Syntax: T1.X S = T2.Y

T1 is the left table and T2 is the right table of the semi-join
 A row of T1 is returned as soon as T1.X finds a match with any value
of T2.Y without searching for further matches

Copyright © 2007 Ramez Elmasri and Shamkant B. Navathe


Additional Operators Semi-Join and Anti-Join
(cont’d.)
Q (SJ) : SELECT COUNT(*)
FROM DEPARTMENT D
WHERE D.Dnumber IN ( SELECT E.Dno
FROM EMPLOYEE E
WHERE E.Salary > 200000)

SELECT COUNT(*)
FROM EMPLOYEE E, DEPARTMENT D
WHERE D.Dnumber S= E.Dno and E.Salary > 200000;

Copyright © 2007 Ramez Elmasri and Shamkant B. Navathe


Additional Operators Semi-Join and Anti-Join
(cont’d.)
 Anti-join
 Used for unnesting NOT EXISTS, NOT IN, and ALL subqueries
 Syntax: T1.x A = T2.y

T1 is the left table and T2 is the right table of the anti-join
 A row of T1 is rejected as soon as T1.x finds a match with any value
of T2.y
 A row of T1 is returned only if T1.x does not match with any value of
T2.y

Copyright © 2007 Ramez Elmasri and Shamkant B. Navathe


Additional Operators Semi-Join and Anti-Join
(cont’d.)
SELECT COUNT(*)
FROM EMPLOYEE
WHERE EMPLOYEE.Dno NOT IN (SELECT DEPARTMENT.Dnumber
FROM DEPARTMENT
WHERE Zipcode =30332)

SELECT COUNT(*)
FROM EMPLOYEE, DEPARTMENT
WHERE EMPLOYEE.Dno A= DEPARTMENT AND Zipcode =30332

Copyright © 2007 Ramez Elmasri and Shamkant B. Navathe


Query Tree
• A query tree is a data structure that represents a
relational algebra expression.
• It is a tree-like structure, with each node representing a
relational algebra operator and each edge representing
a relation.
• The root node of the tree represents the entire query,
while the leaf nodes represent the base(input) relations.
• Two query trees are equivalent if their root relations are
the same (query result)
• A query tree may have different execution plans.
• Some query trees plans are more efficient to execute
than others.
13
Query Tree
• It is used to represent relational algebra expressions.
• Tree Data Structure
• Input relations are represented as leaf nodes
• Relational algebra operation as internal nodes
• It specifies what operations to apply, and the order to
apply them, but not how to actually implement the
operations.
• Example: Book
• Relation:
Bid B_name Category A_id Price

• SQL Query
Select B_name, Price
From Book
Where Price > 50; 14
Query Tree: Example
Select B_name, Price
From Book
Where Price > 50;

Price>50(B_name, Price(Book)) B_name, Price(Price>50(Book))

Price>50 B_name, Price

B_name, Price Price>50

Book Book
15
Query Evaluation Plan
• A query evaluation plan (QEP) is a sequence of operations that are used to evaluate a
query.
• It is generated by the query optimizer, and it specifies how the query will be executed on
the underlying hardware.
• The annotations in the query tree specify the algorithm or the index to be used to
evaluate each operation.
• This allows the query optimizer to generate a more efficient execution plan for the query.
B_name, Price(Price>50(Book))
• Note:
• Different evaluation plans for a given query can have
different costs. It is the responsibility of the query
B_name, Price optimizer to generate a least costly plan.
• Best Query Evaluation plan is finally submitted to the query
evaluation engine for actual execution.
Price>50

Book 16
Algorithms for External Sorting

• Sorting is an often-used algorithm in query processing


• External sorting
• Algorithms suitable for large files that do not fit entirely in main memory
• Sort-merge strategy based on sorting smaller subfiles (runs) and merging the sorted
runs
• Requires buffer space in main memory
• Buffer space is DBMS cache an area in the computer’s main memory that is controlled by
the DBMS.
• The buffer space is divided into individual buffers, where each buffer is the same size in
bytes as the size of one disk block.
• Thus, one buffer can hold the contents of exactly one disk block.
• The external-sort merge is the most suitable method used for external sorting
Algorithms for External Sorting (cont’d.)
• Sort-Merge strategy:
• The sort-merge strategy is a divide-and-conquer algorithm for sorting
large datasets.
• It works by dividing the data into smaller subfiles (runs), sorting each run,
and then merging the sorted runs together to form a single sorted file.
• The sorting phase of the sort-merge strategy is as follows:
• The data is divided into subfiles of size nB, where nB is the available buffer space.
• Each subfile is sorted using a standard sorting algorithm, such as merge sort or
quicksort.
• The sorted subfiles are written to temporary files.
• The merging phase of the sort-merge strategy is as follows:
• The sorted subfiles are merged together in pairs, creating larger sorted subfiles.
• The merging continues until there is a single sorted file.
• In the merging phase, the sorted runs are merged during one or more merge
passes. 18
Algorithms for External Sorting
(cont’d.)
• Sorting phase:
• nR = (b/nB)
• Merging phase:
• dM = Min (nB-1, nR);
• nP = (logdM(nR))
• Where,
• nR: number of initial runs;
• b: number of file blocks;
• nB: available buffer space;
• dM: degree of merging;
• nP: number of passes.

• Degree of merging
• Number of sorted subfiles that can be merged in each merge step
• Performance of the sort-merge algorithm
19
• Number of disk block reads and writes before sorting is completed
Example: External Sorting Using Sort-Merge
Example
• If the number of available main memory buffers nB = 5 disk blocks and the size of the file b
= 1,024 disk blocks, then
• nR= ⎡(b/nB)⎤ or 205 initial runs each of size 5 blocks (except the last run, which will have only 4
blocks).
• Hence, after the sorting phase, 205 sorted runs (or 205 sorted subfiles of the original file) are stored
as temporary subfiles on disk.
• In the merging phase, the sorted runs are merged during one or more merge passes. Each
merge pass can have one or more merge steps.
• The degree of merging (dM) is the number of sorted subfiles that can be merged in each merge step.
• During each merge step, one buffer block is needed to hold one disk block from each of the sorted
subfiles being merged, and one additional buffer is needed for containing one disk block of the merge
result, which will produce a larger sorted file that is the result of merging several smaller sorted
subfiles.
• Hence, dM is the smaller of (nB − 1) and nR, and the number of merge passes is ⎡(logdM(nR)) ⎤.
• In our example, where nB = 5, dM = 4 (four-way merging),
• so the 205 initial sorted runs would be merged 4 at a time in each step into 52 larger sorted subfiles at
the end of the first merge pass.
• These 52 sorted files are then merged 4 at a time into 13 sorted files, which are then merged into 4
sorted files, and then finally into 1 fully sorted file, which means that four passes are needed. 21
Algorithms for External Sorting (Sort-Merge
Algorithm)

22
Algorithms for SELECT Operation
• SELECT operation : There are many algorithms for executing a SELECT operation,
which is basically a search operation to locate the records in a disk file that
satisfy a certain condition.
• Examples:
• (OP1): s SSN='123456789' (EMPLOYEE)
• (OP2): s DNUMBER>5(DEPARTMENT)
• (OP3): s DNO=5(EMPLOYEE)
• (OP4): s DNO=5 AND SALARY>30000 AND SEX=F(EMPLOYEE)
• (OP5): s ESSN=123456789 AND PNO=10(WORKS_ON)

23
Algorithms for SELECT Operation…
• Search Methods for Simple Selection:
• S1 Linear search (brute force):
• Retrieve every record in the file, and test whether its attribute values
satisfy the selection condition.
• S2 Binary search:
• If the selection condition involves an equality comparison on a key
attribute on which the file is ordered, binary search (which is more
efficient than linear search) can be used. (See OP1).
• S3 Using a primary index or hash key to retrieve a single record:
• If the selection condition involves an equality comparison on a key
attribute with a primary index (or a hash key), use the primary index (or
the hash key) to retrieve the record.
• Note that this conditions returns single record (at most).

24
Algorithms for SELECT Operation…
• Search Methods for Simple Selection:
• S4 Using a primary index to retrieve multiple records:
• If the comparison condition is >, ≥, <, or ≤ on a key field with a primary index, use the index
to find the record satisfying the corresponding equality condition, then retrieve all
subsequent records in the (ordered) file. (see OP2)
• S5 Using a clustering index to retrieve multiple records:
• If the selection condition involves an equality comparison on a non-key attribute with a
clustering index, use the clustering index to retrieve all the records satisfying the selection
condition (see OP3).
• S6 Using a secondary (B+-tree) index:
• On an equality comparison, this search method can be used to retrieve a single record if the
indexing field has unique values (is a key) or to retrieve multiple records if the indexing field
is not a key.
• In addition, it can be used to retrieve records on conditions involving >,>=, <, or <=. (FOR
RANGE QUERIES)

25
Algorithms for SELECT Operation…
• Search Methods for Complex Selections
• Conjunctive selection:
• A conjunction selection is of the form
• σ θ1^ θ2^…. ^Θn( r ).
• The database can use the indexes to quickly find the rows that satisfy the first condition, and
then it can use the indexes to quickly find the rows that satisfy the second condition.
• Conjunctive selection using a composite index
• If two or more attributes are involved in equality conditions in the conjunctive condition and a
composite index (or hash structure) exists on the combined field, we can use the index directly.
• Disjunctive selection conditions
• A disjunctive selection is a selection of the form:
• σ θ1v θ2v…. vΘn( r ).
• A disjunctive condition is satisfied by the union of all records satisfying the individual, simple
condition θi.
• Negation :
• The result of selection σ-θ( r ) is the set of tuples of r for which the condition θ evaluates to be
false. 26
Algorithms for SELECT Operation…
• Whenever a single condition specifies the selection, we can only check
whether an access path exists on the attribute involved in that condition.
• If an access path exists, the method corresponding to that access path is
used; otherwise, the “brute force” linear search approach of method S1 is
used. (See OP1, OP2 and OP3)
• For conjunctive selection conditions, whenever more than one of the
attributes involved in the conditions have an access path, query
optimization should be done to choose the access path that retrieves the
fewest records in the most efficient way.

27
Algorithms for JOIN Operations
Algorithms for JOIN Operations
• Like selection, the join operation can be implemented in a variety of
ways. In terms of disk access, the joining can be very expensive, so
implementing and utilizing efficient join algorithms is critical in
minimizing a query’s execution time.
• Methods for implementing joins :
• Nested-Loop Join (brute force):
• Block Nested Loop Join (variation of nested loop join)
• Sort Merge Join
• Hash Join

29
Nested Join Loop
• For each record t in R (outer loop), retrieve every record s from S (inner loop) and test whether the two records satisfy the join
condition t[A] = s[B].
• This algorithm consists of an inner for loop nested within an outer for loop.
• We use following notations
• r,s : Relations r and s
• tr : Tuple in relation r
• ts : Tuple in relation s
• nr : Number of records in relation r
• ns : Number of records in relation s
• br : Number of blocks with records in relation r
• bs : Number of blocks with records in relation s

for each tuple tr in r {


for each tuple ts in s do {
begin
if join condition is true for (tr,ts) then
add tuple tr x ts to the result
}
}
• In the algorithm, ts and tr are the tuples of relations r and s respectively.
30
• The relation ts*tr is a tuple constructed by concatenating the attribute values of tuples tr and ts
Block Nested Loop Join
• When the buffer is too small to hold either relation entirely in memory, we can
still obtain a major saving in block accesses if we process the relations on a per
block, rather than on a per tuple basis.
• The primary difference in cost between the block nested loop join and the
basic nested-loop join is that, in the worst case, each block in the inner relation
s is read only once for each block in the outer relation, instead of once for each
tuple in the outer relation.
• Clearly, it is more efficient to use the smaller relation as the outer relation, in
case neither of the relations fits in memory.

31
Block Nested Loop Join
• The following algorithm shows block nested loop join, where every block of the inner relation is paired with
every block of outer relation.
• Within each pair of blocks, every tuple in one block is paired with every tuple in outer block, to generate all
pairs of tuples.
• As before, all pairs of tuples that satisfy the join condition are added to the result.

for each block br of r {


for each block bs of s {
for each tuple tr in br {
for each tuple ts in bs {
if join condition is true for (tr, ts)
add tuple tr*ts to the result }
}
}
}
32
Sort Merge Join
• If the records of R and S are physically sorted (ordered) by value of the
join attributes A and B, respectively, we can implement the join in the
most efficient way possible.
• Both files are scanned in order of the join attributes, matching the records
that have the same values for A and B.
• This algorithm can be used to perform natural joins and equi-joins and
requires that each relation r and s be sorted by the common attributes
between R ∩ S.
• It is notable to point out that each record in r and s is only scanned once,
thus producing a worst and best-case cost of br + bs.

33
Hash Join
• The records of files R and S are both hashed to the same hash file, using the same hashing
function on the join attributes A of R and B of S as hash keys.
• A single pass through the file with fewer records (say, R) hashes its records to the hash file
buckets.
• A single pass through the other file (S) then hashes each of its records to the appropriate
bucket, where the record is combined with all matching records from R.

• Like the sort-merge join, the hash join algorithm can be used to perform natural joins and equi-
joins.
• The concept behind the hash –join algorithm is to partition the tuples of each given relation
into sets (has buckets).
• The partition is done on the basic of same hash value on join attributes.
• The has function provides the hash value.
• The main goal of using the hash function in the algorithm is to reduce the number of
comparisons and increase the efficiency to complete the join operation on the relations. 34
Algorithms for SELECT and JOIN Operations

• Factors affecting JOIN performance


• Available buffer space
• Join selection factor
• Choice of inner VS outer relation
• For small datasets, nested loops join may be sufficient. For large datasets, hash
join may be a better choice.
• Nested loops join is a simple and straightforward method, but it can be inefficient
for large datasets. This is because it has to scan the entire first table, even if only
a few rows in the second table match the join condition.
• Hash join is a more efficient method for implementing joins, especially for large
datasets. It works by creating a hash table of the first table and then using the
hash table to find the rows in the second table that match the join condition.
Algorithms for PROJECT and SET Operations (1)

• Algorithm for PROJECT operations


 <attribute list>(R)
1. If <attribute list> has a key of relation R, extract all tuples from R with only the
values for the attributes in <attribute list>.
2. If <attribute list> does NOT include a key of relation R, duplicated tuples must be
removed from the results.

• Methods to remove duplicate tuples


1. Sorting
2. Hashing
Algorithms for PROJECT and SET Operations (2)

• Algorithm for SET operations

• Set operations:
• UNION, INTERSECTION, SET DIFFERENCE and CARTESIAN PRODUCT
• CARTESIAN PRODUCT :
• Cartesian Product of relations R and S include all possible combinations of
records from R and S.
• The attribute of the result include all attributes of R and S.
• Cost analysis of CARTESIAN PRODUCT
• If R has n records and j attributes and S has m records and k attributes, the result
relation will have n*m records and j+k attributes.
• CARTESIAN PRODUCT operation is very expensive and should be avoided if possible.
Algorithms for PROJECT and SET Operations (3)

• Algorithm for SET operations (contd.)


• UNION
• Sort the two relations on the same attributes.
• Scan and merge both sorted files concurrently, whenever the same tuple exists in
both relations, only one is kept in the merged results.
• INTERSECTION
• Sort the two relations on the same attributes.
• Scan and merge both sorted files concurrently, keep in the merged results only those
tuples that appear in both relations.
• SET DIFFERENCE R-S
• Keep in the merged results only those tuples that appear in relation R but not in
relation S.
Use of Anti-Join for SET DIFFERENCE
(or EXCEPT or MINUS in SQL)

• The MINUS operator in SQL is transformed into an anti-join as follows.


• Suppose we want to find out which departments have no employees.
• Select Dnumber from DEPARTMENT MINUS Select Dno from EMPLOYEE;
• The above query can be converted into the following:
• SELECT DISTINCT DEPARTMENT.Dnumber FROM DEPARTMENT, EMPLOYEE
WHERE DEPARTMENT.Dnumber A = EMPLOYEE.Dno
• We used the nonstandard notation for anti-join, “A=”, where
DEPARTMENT is on the left of anti-join and EMPLOYEE is on the right.

39
Implementing Aggregate Operations and Outer
Joins (1)
• Implementing Aggregate Operations:
• Aggregate operators:
• MIN, MAX, SUM, COUNT and AVG
• Options to implement aggregate operators:
• Table Scan
• Index
• Example
• SELECT MAX (SALARY) FROM EMPLOYEE;
• If an (ascending) B+-tree index on SALARY exists for the employee relation, then the optimizer can
decide on using the Salary index to search for the largest Salary value in the index by following the
rightmost pointer in each index node from the root to the rightmost leaf.
• That node would include the largest Salary value as its last entry.
• In most cases, this would be more efficient than a full table scan of EMPLOYEE, since no actual
records need to be retrieved.
• The MIN function can be handled in a similar manner.
Implementing Aggregate Operations and Outer
Joins (2)
• Implementing Aggregate Operations (contd.):
• SUM, COUNT and AVG
• For a dense index (each record has one index entry):
• Apply the associated computation to the values in the index.
• For a non-dense index:
• Actual number of records associated with each index entry (value) must be accounted for a
correct computation.
• This can be done if the number of records associated with each value in the index is stored in
each index entry.
• For the COUNT aggregate function, the number of values can be also computed from the
index in a similar manner.
• If a COUNT(*) function is applied to a whole relation, the number of records currently in each
relation are typically stored in the catalog, and so the result can be retrieved directly from the
catalog.
Implementing Aggregate Operations and Outer Joins (2)…

• Implementing Aggregate Operations (contd.):


• SUM, COUNT and AVG
• SELECT Dno, AVG(Salary) FROM EMPLOYEE GROUP BY Dno;
• With GROUP BY: the aggregate operator must be applied separately to each
group of tuples.
• Use sorting or hashing on the group attributes to partition the file into the
appropriate groups;
• Computes the aggregate function for the tuples in each group.
• What if we have Clustering index on the grouping attributes?
• If a clustering index exists on the grouping attribute(s), then the records are
already partitioned (grouped) into the appropriate subsets.
• In this case, it is only necessary to apply the computation to each group.

42
Implementing Aggregate Operations and Outer
Joins (3)
• Implementing Outer Join:
• Outer Join Operators:
• LEFT OUTER JOIN
• RIGHT OUTER JOIN
• FULL OUTER JOIN.
• The full outer join produces a result which is equivalent to the union of the results of the left and
right outer joins.
• Example (LEFT OUTER JOIN):
SELECT FNAME, DNAME
FROM (EMPLOYEE LEFT OUTER JOIN DEPARTMENT
ON DNO = DNUMBER);
• Note: The result of this query is a table of employee names and their associated departments.
• It is similar to a regular join result, with the exception that if an employee does not have an
associated department, the employee's name will still appear in the resulting table, although the
department name would be indicated as null.
• Outer join can be looked upon as a combination of inner join and anti-join.
Implementing Aggregate Operations and Outer
Joins (4)

• Implementing Outer Join (contd.):


• Modifying Join Algorithms:
• Nested Loop or Sort-Merge joins can be modified to implement
outer join. E.g.,
• For left outer join, use the left relation as outer loop and
construct result from every tuple in the left relation.
• If there is a matching tuples in the other relation, the
concatenated tuple is saved in the result.
• However, if an outer tuple does not match, then the tuple is still
included in the result but is padded with a null value(s).
Implementing Aggregate Operations and Outer
Joins (5)
• Implementing Outer Join (contd.):
• Executing a combination of relational algebra operators.
• Implement the previous left outer join example
• {Compute the JOIN of the EMPLOYEE and DEPARTMENT tables}
• TEMP1FNAME,DNAME(EMPLOYEE DNO=DNUMBER DEPARTMENT)
• {Find the EMPLOYEEs that do not appear in the JOIN}
• TEMP2  FNAME (EMPLOYEE) - FNAME (Temp1)
• This minus operation can be achieved by performing an anti-join on Fname between
EMPLOYEE and TEMP1
• {Pad each tuple in TEMP2 with a null DNAME field}
• TEMP2  TEMP2 x 'null'
• {UNION the temporary tables to produce the LEFT OUTER JOIN}
• RESULT  TEMP1 υ TEMP2
• The cost of the outer join, as computed above, would include the cost of the
associated steps (i.e., join, projections, set difference and union).
Cost Components for Query
Execution
• Cost of query evaluation can be measured in terms of different resources,
including:
• Access cost to secondary storage. This is the cost of transferring (reading and
writing) data blocks between secondary disk storage and main memory buffers. This
is also known as disk I/O (input/output) cost
• Disk storage cost. This is the cost of storing on disk any intermediate files that are
generated by an execution strategy for the query.
• Computation cost. This is the cost of performing in-memory operations on the
records within the data buffers during query execution. Such operations include
searching for and sorting records, merging records for a join or a sort operation, and
performing computations on field values. This is also known as CPU (central
processing unit) cost.
• Memory usage cost. This is the cost pertaining to the number of main memory
buffers needed during query execution.
• Communication cost. This is the cost of shipping the query and its results from the
database site to the site or terminal where the query originated. 46
Evaluation of Expression
• One obvious way to evaluate an expression is simply to evaluate one
operation at a time, in an appropriate order.
• There are two approaches how a query expression tree can be evaluated.
• Materialization :
• Compute the result of an evaluation primitive and materialize (store) the new
relation on the disk
• Bottom to top tree order
• Pipelining :
• Pass on tuple to parent operation even while an operation is still being executed.
• Simultaneous execution of several operations

47
Materialization
• It is easier to understand intuitively to evaluate an expression by looking at a
pictorial representation of the expression in an operator tree.
• Consider the expression: π staff_name (σdept_id=3(Department) ⋈ staff)

π staff_name

σdept_id=3 Staff

Department 48
Materialization

• In this method, the given expression evaluates one relational operation at a time. Also,
each operation is evaluated in an appropriate sequence or order.
• After evaluating all the operations, the outputs are materialized in a temporary relation
for their subsequent uses.
• The examples of above figure is computed as follows:
1. Compute σdept_id=3(Department) and store in relation1
2. Compute relation1 ⋈ staff and store result int relation2
3. Compute π staff_name on materialized relation2
By repeating the process, we will eventually evaluate the operation at the root of the
tree, giving the final result of the expression.
Disadvantages:
4. Creation of temporary relations and storage of temporary relations on the disk (wastage of
space)
5. Cost of writing intermediate results to the disk
49
6. Only one operation at a time (starting from the lowest level)
Pipelining
• Combining several operations into one and avoiding the writing of temporary results
to disk is called pipelining or stream-based processing.
• In this method, DBMS do not store the records into temporary tables. Instead, it
queries each and result of which will be passed to next query to express and so on.
• Pipelining evaluates multiple operations simultaneously by passing results of one
operation to the next one without storing the tuples on the disk.
• More than one operation done at a time.
• This is cheaper than materialization.
• In the above example, all three operations can be placed in a pipeline, which passes
the results of the selection to the join as they are generated. In turn, it passes the
results of the join to the projection as they are generated.
• Cannot use in sorting and hashing because they depend on the results of the previous
operation.
50
Combining Operations Using
Pipelining
• SQL query translated into relational algebra expression
• Sequence of relational operations
• Materialized evaluation
• Creating, storing, and passing temporary results
• General query goal: minimize the number of temporary files
• Pipelining or stream-based processing
• Combines several operations into one
• Avoids writing temporary files
Combining Operations using Pipelining (cont’d.)
• Motivation
• A query is mapped into a sequence of operations.
• Each execution of an operation produces a temporary result.
• Generating and saving temporary files on disk is time consuming and
expensive.
• Alternative
• Avoid constructing temporary results as much as possible.
• Pipeline the data through multiple operations - pass the result of a previous
operator to the next without waiting to complete the previous operation.
• Pipelined evaluation benefits
• Avoiding cost and time delay associated with writing intermediate results to
disk
• Being able to start generating results as quickly as possible
Combining Operations Using
Pipelining (cont’d.)
• Iterator
• Operation implemented in such a way that it outputs one tuple at a time
• Many iterators may be active at one time
• Iterator interface methods
• Open()
• Get_Next()
• Close()
• Iterator concept can also be applied to access methods
Combining Operations using
Pipelining (cont’d.)
• Example:
• if a join operation follows two select operations on base relations,
the tuples resulting from each select are provided as input for the
join algorithm in a stream or pipeline as they are produced.
• For a 2-way join, combine the 2 selections on the input and one
projection on the output with the Join.
• Dynamic generation of code to allow for multiple operations to
be pipelined.
• Results of a select operation are fed in a "Pipeline" to the join
algorithm.
• Also known as stream-based processing.
Parallel Algorithms for Query
Processing
• Parallel database architecture approaches
• Shared-memory architecture
• Multiple processors can access common main memory region
• Shared-disk architecture
• Every processor has its own memory
• Machines have access to all disks
• Shared-nothing architecture
• Each processor has own memory and disk storage
• Most commonly used in parallel database systems
Parallel Algorithms for Query Processing
(cont’d.)
• Goals of parallel processing:
• Linear speed-up
• Linear reduction in time taken for operations
• Linear speedup is a measure of how much faster a parallel program runs as the number of
processors increases.
• For example, if a parallel program takes 10 seconds to run on one processor, then it will
take 1 second to run on 10 processors.
• Linear scale-up
• Linear scaleup is a measure of how much the problem size can be increased while still
maintaining the same execution time as the original problem size.
• Constant sustained performance by increasing the number of processors and disks
• For example, if a parallel program can solve a problem of size 100 on one processor, then
it can solve a problem of size 1000 on 10 processors.
Parallel Algorithms for Query Processing
(cont’d.)
• Operator-level parallelism
• Operator-level parallelism (OLP) is a type of parallelism that occurs when
multiple operators in a query plan can be executed concurrently.
• In the operations that can be implemented with parallel algorithms, one of the
main strategies is to partition data across disks.
• Horizontal partitioning
• Horizontal partitioning of a relation corresponds to distributing the tuples across disks
based on some partitioning method.
• Round-robin partitioning
• Given n disks, assigning the I’th tuple to disk i mod n is called round-robin partitioning
• Range partitioning
• Ttuples are equally distributed (as much as possible) by dividing the range of values of some attribute.
• For example, employee tuples from the EMPLOYEE relation may be assigned to 10 disks by dividing the age
range into 10 ranges—say 22–25, 26–28, 29–30, and so on—such that each has roughly one-tenth of the
total number of employees.
• Hash partitioning
• Tuple i is assigned to the disk h(i), where h is the hashing function.
Parallel Algorithms for Query
Processing (cont’d.)
• Operator-level parallelism for various operations:
• Sorting
• If data has been range-partitioned on an attribute:
• Each partition can be sorted separately in parallel
• Results concatenated
• Reduces sorting time
• Selection
• If condition is an equality condition on an attribute used for range partitioning:
• Perform selection only on partition to which the value belongs
• In other cases, the selection would be performed in parallel on all the processors and the results merged.
• Projection without duplicate elimination
• Perform operation in parallel as data is read from each partition
• Duplicate elimination
• Sort tuples and discard duplicates
Parallel Algorithms for Query
Processing (cont’d.)
• Join: The basic idea of parallel join is to split the relations to be joined, say R and S, in such a way
that the join is divided into multiple n smaller joins, and then perform these smaller joins in parallel
on n processors and take a union of the result.
• Parallel joins divide the join into n smaller joins
• Perform smaller joins in parallel on n processors
• Take a union of the result
• Parallel join techniques
• Equality-based partitioned join
• Inequality join with partitioning and replication
• Asymmetric case
• Symmetric case
• Parallel partitioned hash join
Inequality join with partitioning and replication Inequality join
Asymmetric case with
partitioning and
replication
Symmetric case

Parallel partitioned hash join

61
Equality-based partitioned join
Parallel Algorithms for Query
Processing (cont’d.)
• Aggregation
• Achieved by partitioning on the grouping attribute and then computing the
aggregate function locally at each processor
• Set operations
• For union, intersection, and set difference operations, if the argument
relations R and S are partitioned using the same hash function, they can be
done in parallel on each processor
Parallel Algorithms for Query Processing (cont’d.)
• Intraquery parallelism
• Approaches
• Use parallel algorithm for each operation, with appropriate partitioning of the data input to that
operation
• Execute independent operations in parallel
• If the output of one of the operations can be generated tuple-by-tuple and fed into another operator,
the result is pipelined parallelism.
• An operator that does not produce any output until it has consumed all its inputs is said to block the
pipelining
• Interquery parallelism
• Execution of multiple queries in parallel
• Goal: scale up (i.e., to increase the overall rate at which queries or transactions can be
processed by increasing the number of processors)
• Difficult to achieve on shared-disk or shared-nothing architectures
• Database systems using shared memory parallel architecture can achieve this type of
parallelism more easily without significant changes.
Query Optimization
• Process of selecting the most efficient query evaluation from the available possible plans for processing a
given query.
• The function of query optimization engine is to find an evaluation plan that reduces the overall execution
cost of a query.
• Two phases:
• Selection of most efficient expression equivalent to given expression
• Selection of detailed strategies: Choosing algorithms, setup indices etc
• It is clear that the costs for performing particular operations such as select, and join can vary
dramatically.
• Example: Consider two relations r and s with the following characteristics
• 10000 = nr = Number of tuples in r
• 1000 = ns = Number of tuples in s
• 1000 = br = Number of blocks with tuples in r
• 100 = bs = Number of blocks with tuples in s
• Selecting a single record from a r on a non-key attribute can have,
• Cost of [log(br)] = 10 [binary search]
• Cost of br/2 = 500 (Linear search]
64
• It is noticed that the cost difference between the two selects differs by a factor of 500
Query Optimization
• The steps involved in query optimization are:
1.Identifying the different execution plans for the query: There can be many different ways to
execute a query. The query optimizer needs to identify all of the possible execution plans.
2.Estimating the cost of each execution plan: The query optimizer needs to estimate the cost of
each execution plan. The cost of an execution plan is determined by the number of operations it
requires, the size of the data involved, and the speed of the DBMS.
3.Selecting the most efficient execution plan: The query optimizer selects the most efficient
execution plan. The most efficient execution plan is the one that has the lowest cost.
• By optimizing queries, the DBMS can retrieve data more quickly and efficiently.
• Query optimization is the process of finding the most efficient query execution plan
and this involves considering the following factors:
• The size and structure of the database
• The specific operations that need to be performed
• The cost of each operation
• Here are some of the benefits of query processing and optimization:
• Improved performance: Optimized queries can be executed more quickly, which can improve
the overall performance of the DBMS.
• Reduced resource usage: Optimized queries can use fewer resources, such as CPU time and
memory, which can improve the overall efficiency of the DBMS.
• Increased scalability: Optimized queries can scale better to handle larger datasets, which can
65
improve the performance of the DBMS as the amount of data grows.
Query Optimizer

66
Using Heuristics in Query Optimization(1)

• Process for heuristics optimization


1. The parser of a high-level query generates an initial internal representation;
2. Apply heuristics rules to optimize the internal representation.
3. A query execution plan is generated to execute groups of operations based
on the access paths available on the files involved in the query.
• The main heuristic is to apply first the operations that reduce the
size of intermediate results.
• E.g., Apply SELECT and PROJECT operations before applying the JOIN or
other binary operations.
Using Heuristics in Query Optimization (2)
• Query tree:
• A tree data structure that corresponds to a relational algebra expression.
• It represents the input relations of the query as leaf nodes of the tree and the relational
algebra operations as internal nodes.
• An execution of the query tree consists of executing an internal node operation whenever
its operands are available and then replacing that internal node by the relation that results
from executing the operation.
• Query graph:
• A graph data structure does not indicate an order on which operations to perform first.
• There is only a single graph corresponding to each query.
• A graph data structure that corresponds to a relational calculus expression.
• Notation:
• Relation nodes displayed as single circles
• Constants represented by constant nodes
• Double circles or ovals
• Selection or join conditions represented as edges
• Attributes to be retrieved displayed in square brackets
Using Heuristics in Query Optimization (3)

• Example:
• For every project located in ‘Stafford’, retrieve the project number, the controlling
department number and the department manager’s last name, address and birthdate.
• Relation algebra:
PNUMBER, DNUM, LNAME, ADDRESS, BDATE (((PLOCATION=‘STAFFORD’(PROJECT))
DNUM=DNUMBER (DEPARTMENT)) MGRSSN=SSN (EMPLOYEE))

• SQL query:
Q2: SELECT P.NUMBER,P.DNUM,E.LNAME, E.ADDRESS, E.BDATE
FROM PROJECT AS P,DEPARTMENT AS D, EMPLOYEE AS E
WHERE P.DNUM=D.DNUMBER AND D.MGRSSN=E.SSN AND
P.PLOCATION=‘STAFFORD’;
Using Heuristics in Query Optimization (4)
Using Heuristics in Query Optimization (5)
Using Heuristics in Query Optimization (6)

• Heuristic Optimization of Query Trees:


• The same query could correspond to many different relational algebra expressions —
and hence many different query trees.
• The task of heuristic optimization of query trees is to find a final query tree that is
efficient to execute.
• Example:
Q: SELECT LNAME
FROM EMPLOYEE, WORKS_ON, PROJECT
WHERE PNAME = ‘AQUARIUS’
AND PNMUBER=PNO AND ESSN=SSN
AND BDATE > ‘1957-12-31’;
Using Heuristics in Query Optimization (7)

The definition of most restrictive SELECT can mean either the


ones that produce a relation with the fewest tuples or with
the smallest absolute size
Using Heuristics in Query Optimization (8)
Using Heuristics in Query Optimization (9)

• General Transformation Rules for Relational Algebra Operations:


1. Cascade of s: A conjunctive selection condition can be broken up into a cascade (sequence) of
individual s operations:
• s c1 AND c2 AND ... AND cn(R) = sc1 (sc2 (...(scn(R))...) )
2. Commutativity of s: The s operation is commutative:
• sc1 (sc2(R)) = sc2 (sc1(R))
3. Cascade of p: In a cascade (sequence) of p operations, all but the last one can be ignored:
• pList1 (pList2 (...(pListn(R))...) ) = pList1(R)
4. Commuting s with p: If the selection condition c involves only the attributes A1, ..., An in the
projection list, the two operations can be commuted:
• pA1, A2, ..., An (sc (R)) = sc (pA1, A2, ..., An (R))
Using Heuristics in Query Optimization (10)

• General Transformation Rules for Relational Algebra Operations (contd.):


5. Commutativity of ( and x ): The operation is commutative as is the x operation:
• R C S = S C R;
• Rx S=Sx R
6. Commuting s with (or x ): If all the attributes in the selection condition c involve only the
attributes of one of the relations being joined—say, R—the two operations can be commuted as
follows:
• sc ( R S ) = (sc (R)) S
• Alternatively, if the selection condition c can be written as (c1 and c2), where condition c1
involves only the attributes of R and condition c2 involves only the attributes of S, the operations
commute as follows:
• sc ( R S ) = (sc1 (R)) (sc2 (S))
Using Heuristics in Query Optimization (11)

• General Transformation Rules for Relational Algebra Operations (contd.):


7. Commuting p with (or x): Suppose that the projection list is L = {A1, ..., An,
B1, ..., Bm}, where A1, ..., An are attributes of R and B1, ..., Bm are attributes of S.
If the join condition c involves only attributes in L, the two operations can be
commuted as follows:
• pL ( R C S ) = (pA1, ..., An (R)) C (p B1, ..., Bm (S))
• If the join condition C contains additional attributes not in L, these must be added
to the projection list, and a final p operation is needed.
Using Heuristics in Query Optimization (12)

• General Transformation Rules for Relational Algebra Operations (contd.):


8. Commutativity of set operations: The set operations υ and ∩ are commutative
but “–” is not.
9. Associativity of , x, υ, and ∩ : These four operations are individually
associative; that is, if q stands for any one of these four operations (throughout
the expression), we have
• (RqS)qT = Rq(SqT)
10. Commuting s with set operations: The s operation commutes with υ , ∩ , and
–. If q stands for any one of these three operations, we have
• sc ( R q S ) = (sc (R)) q (sc (S))
Using Heuristics in Query Optimization (13)
• General Transformation Rules for Relational Algebra Operations (contd.):
• 11. The p operation commutes with υ.
• pL ( R υ S ) = (pL (R)) υ (pL (S))
• 12. Converting a (s, x) sequence into : If the condition c of a s that follows a x
Corresponds to a join condition, convert the (s, x) sequence into a as follows
• (sC (R x S)) = (R C S)
• 13. Pushing σ in conjunction with set difference.
• σc (R − S) = σc (R) – σc ( S)
• However, σ may be applied to only one relation:
• σc (R – S) = σc (R) – S
• 14. Pushing σ to only one argument in ∩.
• If in the condition σc all attributes are from relation R, then:
• σc (R ∩ S) = σc (R) ∩ S
• 15. Some trivial transformations.
• If S is empty, then R ∪ S = R
• If the condition c in σc is true for the entire R, then σc (R) = R
Using Heuristics in Query
Optimization (13)…
• Other possible transformations:
• For example, a selection or join condition c can be converted into an
equivalent condition by using the following standard rules from
Boolean algebra (De Morgan’s laws):
• NOT (c1 AND c2) ≡ (NOT c1) OR (NOT c2)
• NOT (c1 OR c2) ≡ (NOT c1) AND (NOT c2)

80
Using Heuristics in Query Optimization (14)
• Outline of a Heuristic Algebraic Optimization Algorithm:
1. Using rule 1, break up any select operations with conjunctive conditions into a
cascade of select operations. This permits a greater degree of freedom in moving
SELECT operations down different branches of the tree.
2. Using rules 2, 4, 6 and 10,13,14 concerning the commutativity of select with other
operations, move each select operation as far down the query tree as is permitted
by the attributes involved in the select condition.
3. Using rules 5 and 9 concerning associativity of binary operations, rearrange the leaf
nodes of the tree so that the leaf node relations with the most restrictive select
operations are executed first in the query tree representation.
4. Using Rule 12, combine a Cartesian product operation with a subsequent select
operation in the tree into a join operation.
5. Using rules 3, 4, 7, and 11 concerning the cascading of project and the commuting
of project with other operations, break down and move lists of projection attributes
down the tree as far as possible by creating new project operations as needed.
6. Identify subtrees that represent groups of operations that can be executed by a
single algorithm.
Using Heuristics in Query Optimization (15)

• Summary of Heuristics for Algebraic Optimization:


1. The main heuristic is to apply first the operations that reduce the size of
intermediate results.
2. Perform select operations as early as possible to reduce the number of tuples and
perform project operations as early as possible to reduce the number of attributes.
(This is done by moving select and project operations as far down the tree as
possible.)
3. The select and join operations that are most restrictive should be executed before
other similar operations. (This is done by reordering the leaf nodes of the tree
among themselves and adjusting the rest of the tree appropriately.)
Using Heuristics in Query Optimization (16)

• Query Execution Plans


• An execution plan for a relational algebra query consists of a combination of the
relational algebra query tree and information about the access methods to be used
for each relation as well as the methods to be used in computing the relational
operators stored in the tree.
• Materialized evaluation: the result of an operation is stored as a temporary relation.
• Pipelined evaluation: as the result of an operator is produced, it is forwarded to the
next operator in sequence.
Choice of Query Execution Plans…
• The choice of query execution plan can significantly impact the performance of the
query.
• Techniques to choose query execution plans:
• Alternatives for Query Evaluation
• materialized evaluation vs pipelined evaluation
• Nested Subquery Optimization
• Unnesting the subquery using semi-join and anti-join
• Subquery (View) Merging Transformation
• This FROM clause subquery is often referred to as an inline view
• The view-merging operation merges the tables in the view with the tables from the outer query block
• Example:
SELECT E.Ln, V.Addr, V.Phone
FROM EMP E, ( SELECT D.Dno, D.Dname, B.Addr, B.Phone
FROM DEPT D, BLDG B EMP (Ssn, Fn, Ln, Dno)
WHERE D.Bldg_id = B.Bldg_id ) V DEPT (Dno, Dname, Dmgrname, Bldg_id)
WHERE V.Dno = E.Dno AND E.Fn = “John”; BLDG (Bldg_id, No_storeys, Addr, Phone)
--this query transforms to following query:

SELECT E.Ln, B.Addr, B.Phone


FROM EMP E, DEPT D, BLDG B
WHERE D.Bldg_id = B.Bldg_id AND D.Dno = E.Dno AND E.Fn = “John”;
84
• Cost-based optimization vs Heuristic-based optimization
Choice of Query Execution Plans…
• Materialized Views
• A view is defined in the database as a query, and a materialized view stores the
results of that query.
• Using materialized views to avoid some of the computation involved in a query
is another query optimization technique.
• A materialized view may be stored temporarily to allow more queries to be
processed against it or permanently, as is common in data warehouses.
• A materialized view constitutes derived data because its content can be
computed as a result of processing the defining query of the materialized view.
• The main idea behind materialization is that it is much cheaper to read it when
needed and query against it than to recompute it from scratch.
• The savings can be significant when the view involves costly operations like join,
aggregation, and so forth.

85
Using Selectivity and Cost Estimates in Query
Optimization (1)
•Cost-based query optimization:
• Estimate and compare the costs of executing a query using different execution strategies and
choose the strategy with the lowest cost estimate.
• This is based on the cost of the query. The query can use different paths based on indexes,
constraints, sorting methods etc.
• The cost of an execution plan is estimated based on several factors,
including the size of the database, the number of rows that need to be
processed, and the cost of each operation.
• The cost of each operation is typically estimated based on the number of CPU
cycles and I/O operations that the operation will require.
• This method mainly uses the statistics like record size, number of records , number of records
per block, number of blocks, table size, whether whole table fits in a block, uniqueness of
column values, size of columns etc.

• (Compare to heuristic query optimization)


• Issues
• Cost function (these are the estimates and not exact cost function)
• The cost function is heavily dependent on its selectivity i.e., the portion of the select operation that forms the
output.
Using Selectivity and Cost Estimates in Query
Optimization (2)
• Cost Components for Query Execution
1. Access cost to secondary storage
2. Storage cost
3. Computation cost
4. Memory usage cost
5. Communication cost

• Note: Different database systems may focus on different cost


components.
Using Selectivity and Cost Estimates in Query
Optimization (3)
• Catalog Information Used in Cost Functions
• Information about the size of a file
• number of records (tuples) (r),
• record size (R),
• number of blocks (b)
• blocking factor (bfr) (number of records per block)
• Information about indexes and indexing attributes of a file
• Number of levels (x) of each multilevel index
• Number of first-level index blocks (bI1)
• Number of distinct values (d) of an attribute
• xA: Number of levels of the index for attribute A
• bI1A: Number of first-level blocks of the index on attribute A
• Selectivity (sl) of an attribute which is the fraction of records satisfying an equality condition on the attribute
• For example, if a table has 1000 records and a query has a condition that only matches 10 of those records, the
selectivity of the condition is 10/1000 = 0.01.
• Selection cardinality (s) of an attribute. (s = sl * r)
• which is the average number of records that will satisfy an equality selection condition on that attribute
• So, in the example above, the equation would be s = 0.01 * 1000 = 10.
Histograms
• Histograms are tables or data structures maintained by the
DBMS to record information about the distribution of data.
• Histograms divide the attribute over important ranges (called
buckets) and store the total number of records that belong to
that bucket in that relation.
• Histograms are used by the database optimizer to estimate the
cost of queries and to choose the most efficient execution
plan.
• The optimizer uses the histogram to estimate the selectivity of
conditions on the column.
• For example, if a condition specifies that the value of a column
must be between 30 and 40, the optimizer can use the
histogram to estimate the percentage of rows that satisfy the
condition.
• A common variations of histograms are:
• Equi-width histograms, the range of values is divided into equal
subranges.
• Equi-height histograms, the buckets are so formed that each one
contains roughly the same number of records 89
Cost Functions for SELECT
• Cost functions are used to estimate the cost of a SELECT operation.
• The cost of a SELECT operation depends on several factors, including the size of the table, the
selectivity of the condition, and the type of access method used.
• Examples of Cost Functions for SELECT
• S1. Linear search (brute force) approach
• CS1a = b;
• For an equality condition on a key, CS1a = (b/2) if the record is found; otherwise CS1a = b.
• S2. Binary search:
• CS2 = log2b + ⎡(s/bfr) –1
• For an equality condition on a unique (key) attribute, CS2 =log2b (because s = 1 in this case)
• S3. Using a primary index (S3a) or hash key (S3b) to retrieve a single record
• CS3a = x + 1;
• CS3b = 1 for static or linear hashing;
• CS3b = 2 for extendible hashing;
Cost Functions for SELECT…

• S4. Using an ordering index to retrieve multiple records:


• For the comparison condition on a key field with an ordering index, CS4 = x + (b/2)
• If the comparison condition is >, >=, <= on a key field with an ordering index, roughly
half the file records will satisfy the condition.
• S5. Using a clustering index to retrieve multiple records:
• CS5 = x + ┌ (s/bfr) ┐
• S6. Using a secondary (B+-tree) index:
• For an equality comparison, CS6a = x + s;
• For a comparison condition such as >, <, >=, or <=,
• CS6a = x + (bI1/2) + (r/2)
Cost Functions for SELECT…

• S7. Conjunctive selection:


• Use either S1 or one of the methods S2 to S6 to solve.
• For the latter case, use one condition to retrieve the records and then check in the
memory buffer whether each retrieved record satisfies the remaining conditions in
the conjunction.
• S8. Conjunctive selection using a composite index:
• Same as S3a, S5 or S6a, depending on the type of index.
Cost Functions for JOIN
• An estimate for the size (number of tuples) of the file that results after JOIN
operation is required to develop reasonably accurate cost function for JOIN
operations.
• Examples of Cost Functions for JOIN
• Join selectivity (js)
• Ratio of the size (number of tuples) of the resulting join file to the size of the CARTESIAN PRODUCT file
• js = | (R C S) | / | R x S | = | (R C S) | / (|R| * |S |)
• If there is no join condition C, js = 1;
• If no tuples from the relations satisfy condition C, js = 0;
• Usually, 0 <= js <= 1;
• Simple formula for join selectivity

• Size of the result file after join operation (join cardinality (jc))
• Jc= | (R C S) | = js * |R| * |S |
Cost Functions for JOIN (contd.)

• J1. Nested-loop join: (Use R for outer loop)


• For three memory buffer blocks:
• CJ1 = bR + (bR*bS) + ((js* |R|* |S|)/bfrRS)
• For nB memory buffer blocks:
• CJ1 = bR + ( ⎡bR/(nB – 2)⎤ * bS) + ((js * |R| * |S|)/bfrRS)

• The last part of the formula is the cost of writing the resulting file to disk
• J2. Index-based nested-loop join (using an access structure to retrieve the matching
record(s))
• If an index exists for the join attribute B of S with index levels xB, we can retrieve each
record s in R and then use the index to retrieve all the matching records t from S that
satisfy t[B] = s[A].
• Where A is the joining attribute of R and B is the joining attribute of S
• The cost depends on the type of index.
Cost Functions for JOIN (contd.)
• J2. Index-based nested-loop join (contd.)
• For a secondary index,
• CJ2a = bR + (|R| * (xB + sB)) + ((js* |R|* |S|)/bfrRS);
• For a clustering index,
• CJ2b = bR + (|R| * (xB + (sB/bfrB))) + ((js* |R|* |S|)/bfrRS);
• where sB is the selection cardinality for the join attribute B of S.
• Selection cardinality was defined as the average number of records that satisfy an equality condition on an
attribute, which is the average number of records that have the same value for the attribute and hence will be
joined to a single record in the other file.
• For a primary index,
• CJ2c = bR + (|R| * (xB + 1)) + ((js* |R|* |S|)/bfrRS);
• If a hash key exists for one of the two join attributes — B of S
• CJ2d = bR + (|R| * h) + ((js* |R|* |S|)/bfrRS);
• where h ≥ 1 is the average number of block accesses to retrieve a record, given its hash
key value.
• Usually, h is estimated to be 1 for static and linear hashing and 2 for extendible hashing.
This is an optimistic estimate, and typically h ranges from 1.2 to 1.5 in practical
situations.
Cost Functions for JOIN (contd.)
• J3. Sort-merge join: .
• If the files are already sorted on the join attributes, the cost function for this method is
• CJ3a = bR + bS + ((js* |R|* |S|)/bfrRS);
• If we must sort the files, the cost of sorting must be added. And cost function will be:
• C = CS + b + b + ((js* |R|* |S|)/bfr );
J3a R S RS

• (CS: Cost for sorting files)


• J4—Partition–hash join (or just hash join).
• The records of files R and S are partitioned into smaller files. Each of the files, R and S is partitioned
separately.
• The partitioning of each file is done using the same hashing function h on the join attribute A of R (for
partitioning file R) and B of S (for partitioning file S).
• The partitioning phase has two iterations. After the first iteration, the first file R is partitioned into the
subfiles R1, R2, … , RM, where all the records that hashed to the same buffer are in the same partition.
After the second iteration, the second file S is similarly partitioned.
• the cost of this join can be approximated to:
• CJ4 = 3 * (bR + bS) + ((js * |R| * |S|)/bfrRS)
• since each record is read once and written back to disk once during the partitioning phase. During the
joining (probing) phase, each record is read a second time to perform the join.
How Buffer Space and Choice of Outer-Loop File Affect Performance of
Nested-Loop Join
• Assume that the number of buffers available in main memory for implementing the join is nB = 7 blocks (buffers).
• For illustration, assume that the DEPARTMENT file consists of rD = 50 records stored in bD = 10 disk blocks and that
the EMPLOYEE file consists of rE = 6,000 records stored in bE = 2,000 disk blocks.
• Note that keeping one block for reading from the inner file and one block for writing to the output file, nB − 2 blocks
are available to read from the outer relation.
• The algorithm can then read one block at a time for the inner-loop file and use its records to probe (that is, search) the outer-loop
blocks that are currently in main memory for matching records.
• If EMPLOYEE is used for the outer loop
• Total number of blocks accessed (read) for outer-loop file = bE
• Number of times (nB − 2) blocks of outer file are loaded into main memory = ⎡bE/(nB – 2) ⎤
• Total number of blocks accessed (read) for inner-loop file = bD * ⎡bE/(nB – 2) ⎤
• Hence, we get the following total number of block read accesses:
• bE + ( ⎡bE/(nB − 2)⎤ * bD) = 2000 + ( ⎡(2000/5)⎤ * 10) = 6000 block accesses
• On the other hand, if we use the DEPARTMENT records in the outer loop, by symmetry we get the following total
number of block accesses:
• bD + ( ⎡bD/(nB − 2)⎤ * bE) = 10 + ( ⎡(10/5)⎤ * 2000) = 4010 block accesses
• If the result file of the join operation has bRES disk blocks, each block is written once to disk, so an additional b RES
block accesses (writes) should be added to the preceding formulas in order to estimate the total cost of the join
operation.
• As this example shows, it is advantageous to use the file with fewer blocks as the outer-loop file in the nested-loop 97
join.
Example of Join Optimization Based on Cost Formulas
• Suppose that we have the EMPLOYEE file consists of rE=10,000 records stored in bE=2000 disk blocks and
DEPARTMENT file consists of rD = 125 records stored in bD = 13 disk blocks.
• Consider the following two join operations:
• OP6: EMPLOYEE Dno=Dnumber DEPARTMENT
• OP7: DEPARTMENT Mgr_ssn=Ssn EMPLOYEE
• Suppose that we have a primary index on Dnumber of DEPARTMENT with xDnumber= 1 level and a
secondary index on Mgr_ssn of DEPARTMENT with selection cardinality sMgr_ssn= 1 and levels
xMgr_ssn= 2.
• Assume that the join selectivity for OP6 is jsOP6 = (1/|DEPARTMENT|) = 1/125 because Dnumber is a key
of DEPARTMENT.
• jsOP6 = 1/ max (NDV (Dno, EMPLOYEEE), NDV (Dnumber, DEPARTMENT) = 1/max (125,125) = 1/125.
• NDV is number of distinct value
• Also assume that the blocking factor for the resulting join file is bfrED= 4 records per block.
• We can estimate the worst-case costs for the JOIN operation OP6 using the applicable methods J1 and J2
as follows:
1. Using method J1 with EMPLOYEE as outer loop:
CJ1 = bE + (bE * bD) + (( jsOP6 * rE* rD)/bfrED) = 2,000 + (2,000 * 13) + (((1/125) * 10,000 * 125)/4) = 30,500
2. Using method J1 with DEPARTMENT as outer loop:
CJ1 = bD + (bE * bD) + (( jsOP6* rE* rD)/bfrED) = 13 + (13 * 2,000) + (((1/125) * 10,000 * 125/4) = 28,513
Example of Join Optimization Based
on Cost Formulas…
• Using method J2 with EMPLOYEE as outer loop:
• CJ2c = bE + (rE * (xDnumber+ 1)) + (( jsOP6 * rE * rD)/bfrED = 2,000 + (10,000
* 2) + (((1/125) * 10,000 * 125/4) = 24,500 4.
• Using method J2 with DEPARTMENT as outer loop:
• CJ2a = bD + (rD * (xDno + sDno)) + (( jsOP6 * rE * rD)/bfrED) = 13 + (125 * (2 +
80)) + (((1/125) * 10,000 * 125/4) = 12,763 5.
• Using method J4 gives:
• CJ4 = 3* ( bD + bE ) + (( jsOP6 * rE * rD)/bfrED) = 3* (13+2,000) + 2,500 =
8,539

99
Additional Issues Related to Query
Optimization
• Displaying the system’s query execution plan
• Oracle syntax
• EXPLAIN PLAN FOR <SQL query>;
• The query may involve INSERT, DELETE, and UPDATE statements;
• The output goes into a table called PLAN_TABLE.
• Showing Execution Plans:
• select * from table(dbms_xplan.display);
• IBM DB2 syntax
• EXPLAIN PLAN SELECTION [additional options] FOR <SQL-query>
• SQL server syntax
• SET SHOWPLAN_TEXT ON or SET SHOWPLAN_XML ON or SET SHOWPLAN_ALL ON
• PostgreSQL Syntax
• EXPLAIN [set of options] .
• where the options include ANALYZE, VERBOSE, COSTS, BUFFERS, TIMING, etc.
100
Additional Issues Related to Query Optimization (cont’d.)
• Size estimation of other operations
• Projection
• For projection of the form πList (R) expressed as SELECT FROM R, since SQL treats it as a multiset, the
estimated number of tuples in the result is |R|.
• If the DISTINCT option is used, then size of πA (R) is NDV (A, R).
• Set operations
• If the arguments for an intersection, union, or set difference are made of selections on the same relation,
they can be rewritten as conjunction, disjunction, or negation, respectively.
• For example, σc1 (R) ∩ σc2 (R) can be rewritten as σc1 AND c2 (R); and σc1 (R) ∪ σc2 (R) can be rewritten as
σc1 OR c2 (R).
• The size estimation can be made based on the selectivity of conditions c1 and c2.
• Otherwise, the estimated upper bound on the size of r ∩ s is the minimum of the sizes of r and s;
• the estimated upper bound on the size of r ∪ s is the sum of their sizes.
• Aggregation
• The size of GℑAggregate-function(A) R is NDV (G, R) since there is one group for each unique value of G.
• Outer join
• the size of R LEFT OUTER JOIN S would be |R S| plus |R anti-join S|. Similarly, the size of R FULL OUTER JOIN
Additional Issues Related to Query
Optimization (cont’d.)
• Plan caching
• Plan stored by query optimizer for later use by same queries with different
parameters.
• Storing of the plan and reusing it is referred to as plan caching.
• Top k-results optimization
• When the output of a query is expected to be large, sometimes the user is
satisfied with only the top-k results based on some sort order.
• Limits strategy generation
• When the output of a query is expected to be large, sometimes the user is
satisfied with only the top-k results based on some sort order.

102
Query Optimization in Data Warehouses
• Star transformation optimization
• Goal: to access a reduced set of data from the fact table and avoid using a full table scan
on it
• Two types of star transformation optimizations are possible:
• Both these optimizations are performed on the basis of comparative costs of the original and the
transformed queries.
• Classic star transformation
• In this optimization, a Cartesian product of the dimension tables is performed first after applying the filters
(such as D1.A > 5 ) to each dimension table.
• Note that generally there are no join predicates between dimension tables.
• The result of this Cartesian product is then joined with the fact table using B-tree indexes (if any) on the joining
keys of the fact table.
• Bitmap index star transformation
• Bitmap indexes are widely used in data warehousing environments.
• The requirement with this optimization is that there must be bitmap indexes on the fact-table joining keys
referenced in the query.
• For example, in QSTAR, there must be bitmap indexes on FACT.Fk1, FACT.Fk2, and FACT.Fk3 attributes; each bit
in the bitmap corresponds to a row in the fact table.
• The bit is set if the key value of the attribute appears in a row of the fact table.
Query Optimization in Data Warehouses…

The given query QSTAR is transformed into Q2STAR as shown


below.

104
Query Optimization in Data
Warehouses…
• Joining back
• The subquery bitmap trees filter the fact table based on the filter predicates
on the dimension tables; therefore, it may still be necessary to join the
dimension tables back to the relevant rows in the fact table using the original
join predicates.
• The join back of a dimension table can be avoided if the columns of the
dimension table are not referenced in the SELECT and GROUP-BY clauses.
• Note that in Q2STAR, the table DIM3 is not joined back to the FACT table,
since it is not referenced in the SELECT and GROUP-BY clauses, and DIM3.Pk is
unique.

105

You might also like