Unit 2
Unit 2
Optimization
Unit 2
Chapter Outline
0. Introduction to Query Processing
1. Translating SQL Queries into Relational Algebra
2. Algorithms for External Sorting
3. Algorithms for SELECT and JOIN Operations
4. Algorithms for PROJECT and SET Operations
5. Implementing Aggregate Operations and Outer Joins
6. Combining Operations using Pipelining
7. Using Heuristics in Query Optimization
8. Using Selectivity and Cost Estimates in Query Optimization
9. Overview of Query Optimization in Oracle
10. Semantic Query Optimization
Overview of Query Processing
• Query processing is the process of retrieving data from a database in
response to a user query.
• Query Processing
• Translation of high-level queries into low-level expression that can be used at the physical level
of the file system, query optimization and actual execution of the query to get the result.
• Query Optimization
• Process in which multiple query-execution plans for satisfying a query are examined and a most
efficient query plan is identified for execution.
• The process of choosing a suitable execution strategy for processing a query.
• Query processing is the activities involved in parsing , validating, optimizing and
executing a query.
• Two internal representations of a query:
• Query Tree
• Query Graph 3
Query Processing and Optimization
4
Query Processing Steps
6
Translating SQL Queries into Relational Algebra
7
Translating SQL Queries into Relational Algebra
SELECT LNAME, FNAME
FROM EMPLOYEE
WHERE SALARY > ( SELECT MAX (SALARY)
FROM EMPLOYEE
WHERE DNO = 5);
The optimizer would then choose an execution plan for each query block
Copyright © 2007 Ramez Elmasri and Shamkant B. Navathe
Additional Operators Semi-Join and Anti-Join
Semi-join
Generally used for unnesting EXISTS, IN, and ANY subqueries
Syntax: T1.X S = T2.Y
T1 is the left table and T2 is the right table of the semi-join
A row of T1 is returned as soon as T1.X finds a match with any value
of T2.Y without searching for further matches
SELECT COUNT(*)
FROM EMPLOYEE E, DEPARTMENT D
WHERE D.Dnumber S= E.Dno and E.Salary > 200000;
SELECT COUNT(*)
FROM EMPLOYEE, DEPARTMENT
WHERE EMPLOYEE.Dno A= DEPARTMENT AND Zipcode =30332
• SQL Query
Select B_name, Price
From Book
Where Price > 50; 14
Query Tree: Example
Select B_name, Price
From Book
Where Price > 50;
Book Book
15
Query Evaluation Plan
• A query evaluation plan (QEP) is a sequence of operations that are used to evaluate a
query.
• It is generated by the query optimizer, and it specifies how the query will be executed on
the underlying hardware.
• The annotations in the query tree specify the algorithm or the index to be used to
evaluate each operation.
• This allows the query optimizer to generate a more efficient execution plan for the query.
B_name, Price(Price>50(Book))
• Note:
• Different evaluation plans for a given query can have
different costs. It is the responsibility of the query
B_name, Price optimizer to generate a least costly plan.
• Best Query Evaluation plan is finally submitted to the query
evaluation engine for actual execution.
Price>50
Book 16
Algorithms for External Sorting
• Degree of merging
• Number of sorted subfiles that can be merged in each merge step
• Performance of the sort-merge algorithm
19
• Number of disk block reads and writes before sorting is completed
Example: External Sorting Using Sort-Merge
Example
• If the number of available main memory buffers nB = 5 disk blocks and the size of the file b
= 1,024 disk blocks, then
• nR= ⎡(b/nB)⎤ or 205 initial runs each of size 5 blocks (except the last run, which will have only 4
blocks).
• Hence, after the sorting phase, 205 sorted runs (or 205 sorted subfiles of the original file) are stored
as temporary subfiles on disk.
• In the merging phase, the sorted runs are merged during one or more merge passes. Each
merge pass can have one or more merge steps.
• The degree of merging (dM) is the number of sorted subfiles that can be merged in each merge step.
• During each merge step, one buffer block is needed to hold one disk block from each of the sorted
subfiles being merged, and one additional buffer is needed for containing one disk block of the merge
result, which will produce a larger sorted file that is the result of merging several smaller sorted
subfiles.
• Hence, dM is the smaller of (nB − 1) and nR, and the number of merge passes is ⎡(logdM(nR)) ⎤.
• In our example, where nB = 5, dM = 4 (four-way merging),
• so the 205 initial sorted runs would be merged 4 at a time in each step into 52 larger sorted subfiles at
the end of the first merge pass.
• These 52 sorted files are then merged 4 at a time into 13 sorted files, which are then merged into 4
sorted files, and then finally into 1 fully sorted file, which means that four passes are needed. 21
Algorithms for External Sorting (Sort-Merge
Algorithm)
22
Algorithms for SELECT Operation
• SELECT operation : There are many algorithms for executing a SELECT operation,
which is basically a search operation to locate the records in a disk file that
satisfy a certain condition.
• Examples:
• (OP1): s SSN='123456789' (EMPLOYEE)
• (OP2): s DNUMBER>5(DEPARTMENT)
• (OP3): s DNO=5(EMPLOYEE)
• (OP4): s DNO=5 AND SALARY>30000 AND SEX=F(EMPLOYEE)
• (OP5): s ESSN=123456789 AND PNO=10(WORKS_ON)
23
Algorithms for SELECT Operation…
• Search Methods for Simple Selection:
• S1 Linear search (brute force):
• Retrieve every record in the file, and test whether its attribute values
satisfy the selection condition.
• S2 Binary search:
• If the selection condition involves an equality comparison on a key
attribute on which the file is ordered, binary search (which is more
efficient than linear search) can be used. (See OP1).
• S3 Using a primary index or hash key to retrieve a single record:
• If the selection condition involves an equality comparison on a key
attribute with a primary index (or a hash key), use the primary index (or
the hash key) to retrieve the record.
• Note that this conditions returns single record (at most).
24
Algorithms for SELECT Operation…
• Search Methods for Simple Selection:
• S4 Using a primary index to retrieve multiple records:
• If the comparison condition is >, ≥, <, or ≤ on a key field with a primary index, use the index
to find the record satisfying the corresponding equality condition, then retrieve all
subsequent records in the (ordered) file. (see OP2)
• S5 Using a clustering index to retrieve multiple records:
• If the selection condition involves an equality comparison on a non-key attribute with a
clustering index, use the clustering index to retrieve all the records satisfying the selection
condition (see OP3).
• S6 Using a secondary (B+-tree) index:
• On an equality comparison, this search method can be used to retrieve a single record if the
indexing field has unique values (is a key) or to retrieve multiple records if the indexing field
is not a key.
• In addition, it can be used to retrieve records on conditions involving >,>=, <, or <=. (FOR
RANGE QUERIES)
25
Algorithms for SELECT Operation…
• Search Methods for Complex Selections
• Conjunctive selection:
• A conjunction selection is of the form
• σ θ1^ θ2^…. ^Θn( r ).
• The database can use the indexes to quickly find the rows that satisfy the first condition, and
then it can use the indexes to quickly find the rows that satisfy the second condition.
• Conjunctive selection using a composite index
• If two or more attributes are involved in equality conditions in the conjunctive condition and a
composite index (or hash structure) exists on the combined field, we can use the index directly.
• Disjunctive selection conditions
• A disjunctive selection is a selection of the form:
• σ θ1v θ2v…. vΘn( r ).
• A disjunctive condition is satisfied by the union of all records satisfying the individual, simple
condition θi.
• Negation :
• The result of selection σ-θ( r ) is the set of tuples of r for which the condition θ evaluates to be
false. 26
Algorithms for SELECT Operation…
• Whenever a single condition specifies the selection, we can only check
whether an access path exists on the attribute involved in that condition.
• If an access path exists, the method corresponding to that access path is
used; otherwise, the “brute force” linear search approach of method S1 is
used. (See OP1, OP2 and OP3)
• For conjunctive selection conditions, whenever more than one of the
attributes involved in the conditions have an access path, query
optimization should be done to choose the access path that retrieves the
fewest records in the most efficient way.
27
Algorithms for JOIN Operations
Algorithms for JOIN Operations
• Like selection, the join operation can be implemented in a variety of
ways. In terms of disk access, the joining can be very expensive, so
implementing and utilizing efficient join algorithms is critical in
minimizing a query’s execution time.
• Methods for implementing joins :
• Nested-Loop Join (brute force):
• Block Nested Loop Join (variation of nested loop join)
• Sort Merge Join
• Hash Join
29
Nested Join Loop
• For each record t in R (outer loop), retrieve every record s from S (inner loop) and test whether the two records satisfy the join
condition t[A] = s[B].
• This algorithm consists of an inner for loop nested within an outer for loop.
• We use following notations
• r,s : Relations r and s
• tr : Tuple in relation r
• ts : Tuple in relation s
• nr : Number of records in relation r
• ns : Number of records in relation s
• br : Number of blocks with records in relation r
• bs : Number of blocks with records in relation s
31
Block Nested Loop Join
• The following algorithm shows block nested loop join, where every block of the inner relation is paired with
every block of outer relation.
• Within each pair of blocks, every tuple in one block is paired with every tuple in outer block, to generate all
pairs of tuples.
• As before, all pairs of tuples that satisfy the join condition are added to the result.
33
Hash Join
• The records of files R and S are both hashed to the same hash file, using the same hashing
function on the join attributes A of R and B of S as hash keys.
• A single pass through the file with fewer records (say, R) hashes its records to the hash file
buckets.
• A single pass through the other file (S) then hashes each of its records to the appropriate
bucket, where the record is combined with all matching records from R.
• Like the sort-merge join, the hash join algorithm can be used to perform natural joins and equi-
joins.
• The concept behind the hash –join algorithm is to partition the tuples of each given relation
into sets (has buckets).
• The partition is done on the basic of same hash value on join attributes.
• The has function provides the hash value.
• The main goal of using the hash function in the algorithm is to reduce the number of
comparisons and increase the efficiency to complete the join operation on the relations. 34
Algorithms for SELECT and JOIN Operations
• Set operations:
• UNION, INTERSECTION, SET DIFFERENCE and CARTESIAN PRODUCT
• CARTESIAN PRODUCT :
• Cartesian Product of relations R and S include all possible combinations of
records from R and S.
• The attribute of the result include all attributes of R and S.
• Cost analysis of CARTESIAN PRODUCT
• If R has n records and j attributes and S has m records and k attributes, the result
relation will have n*m records and j+k attributes.
• CARTESIAN PRODUCT operation is very expensive and should be avoided if possible.
Algorithms for PROJECT and SET Operations (3)
39
Implementing Aggregate Operations and Outer
Joins (1)
• Implementing Aggregate Operations:
• Aggregate operators:
• MIN, MAX, SUM, COUNT and AVG
• Options to implement aggregate operators:
• Table Scan
• Index
• Example
• SELECT MAX (SALARY) FROM EMPLOYEE;
• If an (ascending) B+-tree index on SALARY exists for the employee relation, then the optimizer can
decide on using the Salary index to search for the largest Salary value in the index by following the
rightmost pointer in each index node from the root to the rightmost leaf.
• That node would include the largest Salary value as its last entry.
• In most cases, this would be more efficient than a full table scan of EMPLOYEE, since no actual
records need to be retrieved.
• The MIN function can be handled in a similar manner.
Implementing Aggregate Operations and Outer
Joins (2)
• Implementing Aggregate Operations (contd.):
• SUM, COUNT and AVG
• For a dense index (each record has one index entry):
• Apply the associated computation to the values in the index.
• For a non-dense index:
• Actual number of records associated with each index entry (value) must be accounted for a
correct computation.
• This can be done if the number of records associated with each value in the index is stored in
each index entry.
• For the COUNT aggregate function, the number of values can be also computed from the
index in a similar manner.
• If a COUNT(*) function is applied to a whole relation, the number of records currently in each
relation are typically stored in the catalog, and so the result can be retrieved directly from the
catalog.
Implementing Aggregate Operations and Outer Joins (2)…
42
Implementing Aggregate Operations and Outer
Joins (3)
• Implementing Outer Join:
• Outer Join Operators:
• LEFT OUTER JOIN
• RIGHT OUTER JOIN
• FULL OUTER JOIN.
• The full outer join produces a result which is equivalent to the union of the results of the left and
right outer joins.
• Example (LEFT OUTER JOIN):
SELECT FNAME, DNAME
FROM (EMPLOYEE LEFT OUTER JOIN DEPARTMENT
ON DNO = DNUMBER);
• Note: The result of this query is a table of employee names and their associated departments.
• It is similar to a regular join result, with the exception that if an employee does not have an
associated department, the employee's name will still appear in the resulting table, although the
department name would be indicated as null.
• Outer join can be looked upon as a combination of inner join and anti-join.
Implementing Aggregate Operations and Outer
Joins (4)
47
Materialization
• It is easier to understand intuitively to evaluate an expression by looking at a
pictorial representation of the expression in an operator tree.
• Consider the expression: π staff_name (σdept_id=3(Department) ⋈ staff)
π staff_name
σdept_id=3 Staff
Department 48
Materialization
• In this method, the given expression evaluates one relational operation at a time. Also,
each operation is evaluated in an appropriate sequence or order.
• After evaluating all the operations, the outputs are materialized in a temporary relation
for their subsequent uses.
• The examples of above figure is computed as follows:
1. Compute σdept_id=3(Department) and store in relation1
2. Compute relation1 ⋈ staff and store result int relation2
3. Compute π staff_name on materialized relation2
By repeating the process, we will eventually evaluate the operation at the root of the
tree, giving the final result of the expression.
Disadvantages:
4. Creation of temporary relations and storage of temporary relations on the disk (wastage of
space)
5. Cost of writing intermediate results to the disk
49
6. Only one operation at a time (starting from the lowest level)
Pipelining
• Combining several operations into one and avoiding the writing of temporary results
to disk is called pipelining or stream-based processing.
• In this method, DBMS do not store the records into temporary tables. Instead, it
queries each and result of which will be passed to next query to express and so on.
• Pipelining evaluates multiple operations simultaneously by passing results of one
operation to the next one without storing the tuples on the disk.
• More than one operation done at a time.
• This is cheaper than materialization.
• In the above example, all three operations can be placed in a pipeline, which passes
the results of the selection to the join as they are generated. In turn, it passes the
results of the join to the projection as they are generated.
• Cannot use in sorting and hashing because they depend on the results of the previous
operation.
50
Combining Operations Using
Pipelining
• SQL query translated into relational algebra expression
• Sequence of relational operations
• Materialized evaluation
• Creating, storing, and passing temporary results
• General query goal: minimize the number of temporary files
• Pipelining or stream-based processing
• Combines several operations into one
• Avoids writing temporary files
Combining Operations using Pipelining (cont’d.)
• Motivation
• A query is mapped into a sequence of operations.
• Each execution of an operation produces a temporary result.
• Generating and saving temporary files on disk is time consuming and
expensive.
• Alternative
• Avoid constructing temporary results as much as possible.
• Pipeline the data through multiple operations - pass the result of a previous
operator to the next without waiting to complete the previous operation.
• Pipelined evaluation benefits
• Avoiding cost and time delay associated with writing intermediate results to
disk
• Being able to start generating results as quickly as possible
Combining Operations Using
Pipelining (cont’d.)
• Iterator
• Operation implemented in such a way that it outputs one tuple at a time
• Many iterators may be active at one time
• Iterator interface methods
• Open()
• Get_Next()
• Close()
• Iterator concept can also be applied to access methods
Combining Operations using
Pipelining (cont’d.)
• Example:
• if a join operation follows two select operations on base relations,
the tuples resulting from each select are provided as input for the
join algorithm in a stream or pipeline as they are produced.
• For a 2-way join, combine the 2 selections on the input and one
projection on the output with the Join.
• Dynamic generation of code to allow for multiple operations to
be pipelined.
• Results of a select operation are fed in a "Pipeline" to the join
algorithm.
• Also known as stream-based processing.
Parallel Algorithms for Query
Processing
• Parallel database architecture approaches
• Shared-memory architecture
• Multiple processors can access common main memory region
• Shared-disk architecture
• Every processor has its own memory
• Machines have access to all disks
• Shared-nothing architecture
• Each processor has own memory and disk storage
• Most commonly used in parallel database systems
Parallel Algorithms for Query Processing
(cont’d.)
• Goals of parallel processing:
• Linear speed-up
• Linear reduction in time taken for operations
• Linear speedup is a measure of how much faster a parallel program runs as the number of
processors increases.
• For example, if a parallel program takes 10 seconds to run on one processor, then it will
take 1 second to run on 10 processors.
• Linear scale-up
• Linear scaleup is a measure of how much the problem size can be increased while still
maintaining the same execution time as the original problem size.
• Constant sustained performance by increasing the number of processors and disks
• For example, if a parallel program can solve a problem of size 100 on one processor, then
it can solve a problem of size 1000 on 10 processors.
Parallel Algorithms for Query Processing
(cont’d.)
• Operator-level parallelism
• Operator-level parallelism (OLP) is a type of parallelism that occurs when
multiple operators in a query plan can be executed concurrently.
• In the operations that can be implemented with parallel algorithms, one of the
main strategies is to partition data across disks.
• Horizontal partitioning
• Horizontal partitioning of a relation corresponds to distributing the tuples across disks
based on some partitioning method.
• Round-robin partitioning
• Given n disks, assigning the I’th tuple to disk i mod n is called round-robin partitioning
• Range partitioning
• Ttuples are equally distributed (as much as possible) by dividing the range of values of some attribute.
• For example, employee tuples from the EMPLOYEE relation may be assigned to 10 disks by dividing the age
range into 10 ranges—say 22–25, 26–28, 29–30, and so on—such that each has roughly one-tenth of the
total number of employees.
• Hash partitioning
• Tuple i is assigned to the disk h(i), where h is the hashing function.
Parallel Algorithms for Query
Processing (cont’d.)
• Operator-level parallelism for various operations:
• Sorting
• If data has been range-partitioned on an attribute:
• Each partition can be sorted separately in parallel
• Results concatenated
• Reduces sorting time
• Selection
• If condition is an equality condition on an attribute used for range partitioning:
• Perform selection only on partition to which the value belongs
• In other cases, the selection would be performed in parallel on all the processors and the results merged.
• Projection without duplicate elimination
• Perform operation in parallel as data is read from each partition
• Duplicate elimination
• Sort tuples and discard duplicates
Parallel Algorithms for Query
Processing (cont’d.)
• Join: The basic idea of parallel join is to split the relations to be joined, say R and S, in such a way
that the join is divided into multiple n smaller joins, and then perform these smaller joins in parallel
on n processors and take a union of the result.
• Parallel joins divide the join into n smaller joins
• Perform smaller joins in parallel on n processors
• Take a union of the result
• Parallel join techniques
• Equality-based partitioned join
• Inequality join with partitioning and replication
• Asymmetric case
• Symmetric case
• Parallel partitioned hash join
Inequality join with partitioning and replication Inequality join
Asymmetric case with
partitioning and
replication
Symmetric case
61
Equality-based partitioned join
Parallel Algorithms for Query
Processing (cont’d.)
• Aggregation
• Achieved by partitioning on the grouping attribute and then computing the
aggregate function locally at each processor
• Set operations
• For union, intersection, and set difference operations, if the argument
relations R and S are partitioned using the same hash function, they can be
done in parallel on each processor
Parallel Algorithms for Query Processing (cont’d.)
• Intraquery parallelism
• Approaches
• Use parallel algorithm for each operation, with appropriate partitioning of the data input to that
operation
• Execute independent operations in parallel
• If the output of one of the operations can be generated tuple-by-tuple and fed into another operator,
the result is pipelined parallelism.
• An operator that does not produce any output until it has consumed all its inputs is said to block the
pipelining
• Interquery parallelism
• Execution of multiple queries in parallel
• Goal: scale up (i.e., to increase the overall rate at which queries or transactions can be
processed by increasing the number of processors)
• Difficult to achieve on shared-disk or shared-nothing architectures
• Database systems using shared memory parallel architecture can achieve this type of
parallelism more easily without significant changes.
Query Optimization
• Process of selecting the most efficient query evaluation from the available possible plans for processing a
given query.
• The function of query optimization engine is to find an evaluation plan that reduces the overall execution
cost of a query.
• Two phases:
• Selection of most efficient expression equivalent to given expression
• Selection of detailed strategies: Choosing algorithms, setup indices etc
• It is clear that the costs for performing particular operations such as select, and join can vary
dramatically.
• Example: Consider two relations r and s with the following characteristics
• 10000 = nr = Number of tuples in r
• 1000 = ns = Number of tuples in s
• 1000 = br = Number of blocks with tuples in r
• 100 = bs = Number of blocks with tuples in s
• Selecting a single record from a r on a non-key attribute can have,
• Cost of [log(br)] = 10 [binary search]
• Cost of br/2 = 500 (Linear search]
64
• It is noticed that the cost difference between the two selects differs by a factor of 500
Query Optimization
• The steps involved in query optimization are:
1.Identifying the different execution plans for the query: There can be many different ways to
execute a query. The query optimizer needs to identify all of the possible execution plans.
2.Estimating the cost of each execution plan: The query optimizer needs to estimate the cost of
each execution plan. The cost of an execution plan is determined by the number of operations it
requires, the size of the data involved, and the speed of the DBMS.
3.Selecting the most efficient execution plan: The query optimizer selects the most efficient
execution plan. The most efficient execution plan is the one that has the lowest cost.
• By optimizing queries, the DBMS can retrieve data more quickly and efficiently.
• Query optimization is the process of finding the most efficient query execution plan
and this involves considering the following factors:
• The size and structure of the database
• The specific operations that need to be performed
• The cost of each operation
• Here are some of the benefits of query processing and optimization:
• Improved performance: Optimized queries can be executed more quickly, which can improve
the overall performance of the DBMS.
• Reduced resource usage: Optimized queries can use fewer resources, such as CPU time and
memory, which can improve the overall efficiency of the DBMS.
• Increased scalability: Optimized queries can scale better to handle larger datasets, which can
65
improve the performance of the DBMS as the amount of data grows.
Query Optimizer
66
Using Heuristics in Query Optimization(1)
• Example:
• For every project located in ‘Stafford’, retrieve the project number, the controlling
department number and the department manager’s last name, address and birthdate.
• Relation algebra:
PNUMBER, DNUM, LNAME, ADDRESS, BDATE (((PLOCATION=‘STAFFORD’(PROJECT))
DNUM=DNUMBER (DEPARTMENT)) MGRSSN=SSN (EMPLOYEE))
• SQL query:
Q2: SELECT P.NUMBER,P.DNUM,E.LNAME, E.ADDRESS, E.BDATE
FROM PROJECT AS P,DEPARTMENT AS D, EMPLOYEE AS E
WHERE P.DNUM=D.DNUMBER AND D.MGRSSN=E.SSN AND
P.PLOCATION=‘STAFFORD’;
Using Heuristics in Query Optimization (4)
Using Heuristics in Query Optimization (5)
Using Heuristics in Query Optimization (6)
80
Using Heuristics in Query Optimization (14)
• Outline of a Heuristic Algebraic Optimization Algorithm:
1. Using rule 1, break up any select operations with conjunctive conditions into a
cascade of select operations. This permits a greater degree of freedom in moving
SELECT operations down different branches of the tree.
2. Using rules 2, 4, 6 and 10,13,14 concerning the commutativity of select with other
operations, move each select operation as far down the query tree as is permitted
by the attributes involved in the select condition.
3. Using rules 5 and 9 concerning associativity of binary operations, rearrange the leaf
nodes of the tree so that the leaf node relations with the most restrictive select
operations are executed first in the query tree representation.
4. Using Rule 12, combine a Cartesian product operation with a subsequent select
operation in the tree into a join operation.
5. Using rules 3, 4, 7, and 11 concerning the cascading of project and the commuting
of project with other operations, break down and move lists of projection attributes
down the tree as far as possible by creating new project operations as needed.
6. Identify subtrees that represent groups of operations that can be executed by a
single algorithm.
Using Heuristics in Query Optimization (15)
85
Using Selectivity and Cost Estimates in Query
Optimization (1)
•Cost-based query optimization:
• Estimate and compare the costs of executing a query using different execution strategies and
choose the strategy with the lowest cost estimate.
• This is based on the cost of the query. The query can use different paths based on indexes,
constraints, sorting methods etc.
• The cost of an execution plan is estimated based on several factors,
including the size of the database, the number of rows that need to be
processed, and the cost of each operation.
• The cost of each operation is typically estimated based on the number of CPU
cycles and I/O operations that the operation will require.
• This method mainly uses the statistics like record size, number of records , number of records
per block, number of blocks, table size, whether whole table fits in a block, uniqueness of
column values, size of columns etc.
• Size of the result file after join operation (join cardinality (jc))
• Jc= | (R C S) | = js * |R| * |S |
Cost Functions for JOIN (contd.)
• The last part of the formula is the cost of writing the resulting file to disk
• J2. Index-based nested-loop join (using an access structure to retrieve the matching
record(s))
• If an index exists for the join attribute B of S with index levels xB, we can retrieve each
record s in R and then use the index to retrieve all the matching records t from S that
satisfy t[B] = s[A].
• Where A is the joining attribute of R and B is the joining attribute of S
• The cost depends on the type of index.
Cost Functions for JOIN (contd.)
• J2. Index-based nested-loop join (contd.)
• For a secondary index,
• CJ2a = bR + (|R| * (xB + sB)) + ((js* |R|* |S|)/bfrRS);
• For a clustering index,
• CJ2b = bR + (|R| * (xB + (sB/bfrB))) + ((js* |R|* |S|)/bfrRS);
• where sB is the selection cardinality for the join attribute B of S.
• Selection cardinality was defined as the average number of records that satisfy an equality condition on an
attribute, which is the average number of records that have the same value for the attribute and hence will be
joined to a single record in the other file.
• For a primary index,
• CJ2c = bR + (|R| * (xB + 1)) + ((js* |R|* |S|)/bfrRS);
• If a hash key exists for one of the two join attributes — B of S
• CJ2d = bR + (|R| * h) + ((js* |R|* |S|)/bfrRS);
• where h ≥ 1 is the average number of block accesses to retrieve a record, given its hash
key value.
• Usually, h is estimated to be 1 for static and linear hashing and 2 for extendible hashing.
This is an optimistic estimate, and typically h ranges from 1.2 to 1.5 in practical
situations.
Cost Functions for JOIN (contd.)
• J3. Sort-merge join: .
• If the files are already sorted on the join attributes, the cost function for this method is
• CJ3a = bR + bS + ((js* |R|* |S|)/bfrRS);
• If we must sort the files, the cost of sorting must be added. And cost function will be:
• C = CS + b + b + ((js* |R|* |S|)/bfr );
J3a R S RS
99
Additional Issues Related to Query
Optimization
• Displaying the system’s query execution plan
• Oracle syntax
• EXPLAIN PLAN FOR <SQL query>;
• The query may involve INSERT, DELETE, and UPDATE statements;
• The output goes into a table called PLAN_TABLE.
• Showing Execution Plans:
• select * from table(dbms_xplan.display);
• IBM DB2 syntax
• EXPLAIN PLAN SELECTION [additional options] FOR <SQL-query>
• SQL server syntax
• SET SHOWPLAN_TEXT ON or SET SHOWPLAN_XML ON or SET SHOWPLAN_ALL ON
• PostgreSQL Syntax
• EXPLAIN [set of options] .
• where the options include ANALYZE, VERBOSE, COSTS, BUFFERS, TIMING, etc.
100
Additional Issues Related to Query Optimization (cont’d.)
• Size estimation of other operations
• Projection
• For projection of the form πList (R) expressed as SELECT FROM R, since SQL treats it as a multiset, the
estimated number of tuples in the result is |R|.
• If the DISTINCT option is used, then size of πA (R) is NDV (A, R).
• Set operations
• If the arguments for an intersection, union, or set difference are made of selections on the same relation,
they can be rewritten as conjunction, disjunction, or negation, respectively.
• For example, σc1 (R) ∩ σc2 (R) can be rewritten as σc1 AND c2 (R); and σc1 (R) ∪ σc2 (R) can be rewritten as
σc1 OR c2 (R).
• The size estimation can be made based on the selectivity of conditions c1 and c2.
• Otherwise, the estimated upper bound on the size of r ∩ s is the minimum of the sizes of r and s;
• the estimated upper bound on the size of r ∪ s is the sum of their sizes.
• Aggregation
• The size of GℑAggregate-function(A) R is NDV (G, R) since there is one group for each unique value of G.
• Outer join
• the size of R LEFT OUTER JOIN S would be |R S| plus |R anti-join S|. Similarly, the size of R FULL OUTER JOIN
Additional Issues Related to Query
Optimization (cont’d.)
• Plan caching
• Plan stored by query optimizer for later use by same queries with different
parameters.
• Storing of the plan and reusing it is referred to as plan caching.
• Top k-results optimization
• When the output of a query is expected to be large, sometimes the user is
satisfied with only the top-k results based on some sort order.
• Limits strategy generation
• When the output of a query is expected to be large, sometimes the user is
satisfied with only the top-k results based on some sort order.
102
Query Optimization in Data Warehouses
• Star transformation optimization
• Goal: to access a reduced set of data from the fact table and avoid using a full table scan
on it
• Two types of star transformation optimizations are possible:
• Both these optimizations are performed on the basis of comparative costs of the original and the
transformed queries.
• Classic star transformation
• In this optimization, a Cartesian product of the dimension tables is performed first after applying the filters
(such as D1.A > 5 ) to each dimension table.
• Note that generally there are no join predicates between dimension tables.
• The result of this Cartesian product is then joined with the fact table using B-tree indexes (if any) on the joining
keys of the fact table.
• Bitmap index star transformation
• Bitmap indexes are widely used in data warehousing environments.
• The requirement with this optimization is that there must be bitmap indexes on the fact-table joining keys
referenced in the query.
• For example, in QSTAR, there must be bitmap indexes on FACT.Fk1, FACT.Fk2, and FACT.Fk3 attributes; each bit
in the bitmap corresponds to a row in the fact table.
• The bit is set if the key value of the attribute appears in a row of the fact table.
Query Optimization in Data Warehouses…
104
Query Optimization in Data
Warehouses…
• Joining back
• The subquery bitmap trees filter the fact table based on the filter predicates
on the dimension tables; therefore, it may still be necessary to join the
dimension tables back to the relevant rows in the fact table using the original
join predicates.
• The join back of a dimension table can be avoided if the columns of the
dimension table are not referenced in the SELECT and GROUP-BY clauses.
• Note that in Q2STAR, the table DIM3 is not joined back to the FACT table,
since it is not referenced in the SELECT and GROUP-BY clauses, and DIM3.Pk is
unique.
105