Relational Algebra for Query Optimization
When a query is placed, it is at first scanned, parsed and validated. An internal representation of
the query is then created such as a query tree or a query graph. Then alternative execution
strategies are devised for retrieving results from the database tables. The process of choosing the
most appropriate execution strategy for query processing is called query optimization.
Query Optimization Issues in DDBMS
In DDBMS, query optimization is a crucial task. The complexity is high since number of
alternative strategies may increase exponentially due to the following factors −
The presence of a number of fragments.
Distribution of the fragments or tables across various sites.
The speed of communication links.
Disparity in local processing capabilities.
Hence, in a distributed system, the target is often to find a good execution strategy for query
processing rather than the best one. The time to execute a query is the sum of the following −
Time to communicate queries to databases.
Time to execute local query fragments.
Time to assemble data from different sites.
Time to display results to the application.
Query Processing
Query processing is a set of all activities starting from query placement to displaying the results
of the query. The steps are as shown in the following diagram −
Relational Algebra
Relational algebra defines the basic set of operations of relational database model. A sequence of
relational algebra operations forms a relational algebra expression. The result of this expression
represents the result of a database query.
The basic operations are −
Projection
Selection
Union
Intersection
Minus
Join
Projection
Projection operation displays a subset of fields of a table. This gives a vertical partition of the
table.
Syntax in Relational Algebra
$$\pi_{<{AttributeList}>}{(<{Table Name}>)}$$
For example, let us consider the following Student database −
STUDENT
Roll_No Name Course Semester Gender
2 Amit Prasad BCA 1 Male
4 Varsha Tiwari BCA 1 Female
5 Asif Ali MCA 2 Male
6 Joe Wallace MCA 1 Male
8 Shivani Iyengar BCA 1 Female
If we want to display the names and courses of all students, we will use the following relational
algebra expression −
$$\pi_{Name,Course}{(STUDENT)}$$
Selection
Selection operation displays a subset of tuples of a table that satisfies certain conditions. This
gives a horizontal partition of the table.
Syntax in Relational Algebra
$$\sigma_{<{Conditions}>}{(<{Table Name}>)}$$
For example, in the Student table, if we want to display the details of all students who have opted
for MCA course, we will use the following relational algebra expression −
$$\sigma_{Course} = {\small "BCA"}^{(STUDENT)}$$
Combination of Projection and Selection Operations
For most queries, we need a combination of projection and selection operations. There are two
ways to write these expressions −
Using sequence of projection and selection operations.
Using rename operation to generate intermediate results.
For example, to display names of all female students of the BCA course −
Relational algebra expression using sequence of projection and selection operations
$$\pi_{Name}(\sigma_{Gender = \small "Female" AND \: Course = \small "BCA"}
{(STUDENT)})$$
Relational algebra expression using rename operation to generate intermediate results
$$FemaleBCAStudent \leftarrow \sigma_{Gender = \small "Female" AND \: Course = \small
"BCA"} {(STUDENT)}$$
$$Result \leftarrow \pi_{Name}{(FemaleBCAStudent)}$$
Union
If P is a result of an operation and Q is a result of another operation, the union of P and Q ($p \
cup Q$) is the set of all tuples that is either in P or in Q or in both without duplicates.
For example, to display all students who are either in Semester 1 or are in BCA course −
$$Sem1Student \leftarrow \sigma_{Semester = 1}{(STUDENT)}$$
$$BCAStudent \leftarrow \sigma_{Course = \small "BCA"}{(STUDENT)}$$
$$Result \leftarrow Sem1Student \cup BCAStudent$$
Intersection
If P is a result of an operation and Q is a result of another operation, the intersection of P and Q
( $p \cap Q$ ) is the set of all tuples that are in P and Q both.
For example, given the following two schemas −
EMPLOYEE
EmpID Name City Department Salary
PROJECT
PId City Department Status
To display the names of all cities where a project is located and also an employee resides −
$$CityEmp \leftarrow \pi_{City}{(EMPLOYEE)}$$
$$CityProject \leftarrow \pi_{City}{(PROJECT)}$$
$$Result \leftarrow CityEmp \cap CityProject$$
Minus
If P is a result of an operation and Q is a result of another operation, P - Q is the set of all tuples
that are in P and not in Q.
For example, to list all the departments which do not have an ongoing project (projects with
status = ongoing) −
$$AllDept \leftarrow \pi_{Department}{(EMPLOYEE)}$$
$$ProjectDept \leftarrow \pi_{Department} (\sigma_{Status = \small "ongoing"}{(PROJECT)})
$$
$$Result \leftarrow AllDept - ProjectDept$$
Join
Join operation combines related tuples of two different tables (results of queries) into a single
table.
For example, consider two schemas, Customer and Branch in a Bank database as follows −
CUSTOMER
CustID AccNo TypeOfAc BranchID DateOfOpening
BRANCH
BranchID BranchName IFSCcode Address
To list the employee details along with branch details −
$$Result \leftarrow CUSTOMER \bowtie_{Customer.BranchID=Branch.BranchID}
{BRANCH}$$
Translating SQL Queries into Relational Algebra
SQL queries are translated into equivalent relational algebra expressions before optimization. A
query is at first decomposed into smaller query blocks. These blocks are translated to equivalent
relational algebra expressions. Optimization includes optimization of each block and then
optimization of the query as a whole.
Examples
Let us consider the following schemas −
EMPLOYEE
EmpID Name City Department Salary
PROJECT
PId City Department Status
WORKS
EmpID PID Hours
Example 1
To display the details of all employees who earn a salary LESS than the average salary, we write
the SQL query −
SELECT * FROM EMPLOYEE
WHERE SALARY < ( SELECT AVERAGE(SALARY) FROM EMPLOYEE ) ;
This query contains one nested sub-query. So, this can be broken down into two blocks.
The inner block is −
SELECT AVERAGE(SALARY)FROM EMPLOYEE ;
If the result of this query is AvgSal, then outer block is −
SELECT * FROM EMPLOYEE WHERE SALARY < AvgSal;
Relational algebra expression for inner block −
$$AvgSal \leftarrow \Im_{AVERAGE(Salary)}{EMPLOYEE}$$
Relational algebra expression for outer block −
$$\sigma_{Salary <{AvgSal}>}{EMPLOYEE}$$
Example 2
To display the project ID and status of all projects of employee 'Arun Kumar', we write the SQL
query −
SELECT PID, STATUS FROM PROJECT
WHERE PID = ( SELECT FROM WORKS WHERE EMPID = ( SELECT EMPID FROM
EMPLOYEE
WHERE NAME = 'ARUN KUMAR'));
This query contains two nested sub-queries. Thus, can be broken down into three blocks, as
follows −
SELECT EMPID FROM EMPLOYEE WHERE NAME = 'ARUN KUMAR';
SELECT PID FROM WORKS WHERE EMPID = ArunEmpID;
SELECT PID, STATUS FROM PROJECT WHERE PID = ArunPID;
(Here ArunEmpID and ArunPID are the results of inner queries)
Relational algebra expressions for the three blocks are −
$$ArunEmpID \leftarrow \pi_{EmpID}(\sigma_{Name = \small "Arun Kumar"}
{(EMPLOYEE)})$$
$$ArunPID \leftarrow \pi_{PID}(\sigma_{EmpID = \small "ArunEmpID"} {(WORKS)})$$
$$Result \leftarrow \pi_{PID, Status}(\sigma_{PID = \small "ArunPID"} {(PROJECT)})$$
Computation of Relational Algebra Operators
The computation of relational algebra operators can be done in many different ways, and each
alternative is called an access path.
The computation alternative depends upon three main factors −
Operator type
Available memory
Disk structures
The time to perform execution of a relational algebra operation is the sum of −
Time to process the tuples.
Time to fetch the tuples of the table from disk to memory.
Since the time to process a tuple is very much smaller than the time to fetch the tuple from the
storage, particularly in a distributed system, disk access is very often considered as the metric for
calculating cost of relational expression.
Computation of Selection
Computation of selection operation depends upon the complexity of the selection condition and
the availability of indexes on the attributes of the table.
Following are the computation alternatives depending upon the indexes −
No Index − If the table is unsorted and has no indexes, then the selection process
involves scanning all the disk blocks of the table. Each block is brought into the memory
and each tuple in the block is examined to see whether it satisfies the selection condition.
If the condition is satisfied, it is displayed as output. This is the costliest approach since
each tuple is brought into memory and each tuple is processed.
B+ Tree Index − Most database systems are built upon the B+ Tree index. If the
selection condition is based upon the field, which is the key of this B+ Tree index, then
this index is used for retrieving results. However, processing selection statements with
complex conditions may involve a larger number of disk block accesses and in some
cases complete scanning of the table.
Hash Index − If hash indexes are used and its key field is used in the selection condition,
then retrieving tuples using the hash index becomes a simple process. A hash index uses a
hash function to find the address of a bucket where the key value corresponding to the
hash value is stored. In order to find a key value in the index, the hash function is
executed and the bucket address is found. The key values in the bucket are searched. If a
match is found, the actual tuple is fetched from the disk block into the memory.
Computation of Joins
When we want to join two tables, say P and Q, each tuple in P has to be compared with each
tuple in Q to test if the join condition is satisfied. If the condition is satisfied, the corresponding
tuples are concatenated, eliminating duplicate fields and appended to the result relation.
Consequently, this is the most expensive operation.
The common approaches for computing joins are −
Nested-loop Approach
This is the conventional join approach. It can be illustrated through the following pseudocode
(Tables P and Q, with tuples tuple_p and tuple_q and joining attribute a) −
For each tuple_p in P
For each tuple_q in Q
If tuple_p.a = tuple_q.a Then
Concatenate tuple_p and tuple_q and append to Result
End If
Next tuple_q
Next tuple-p
Sort-merge Approach
In this approach, the two tables are individually sorted based upon the joining attribute and then
the sorted tables are merged. External sorting techniques are adopted since the number of records
is very high and cannot be accommodated in the memory. Once the individual tables are sorted,
one page each of the sorted tables are brought to the memory, merged based upon the joining
attribute and the joined tuples are written out.
Hash-join Approach
This approach comprises of two phases: partitioning phase and probing phase. In partitioning
phase, the tables P and Q are broken into two sets of disjoint partitions. A common hash function
is decided upon. This hash function is used to assign tuples to partitions. In the probing phase,
tuples in a partition of P are compared with the tuples of corresponding partition of Q. If they
match, then they are written out.
What is heuristic optimization in DBMS?
Cost-based optimization is expensive. Heuristics are used to reduce the number of choices that
must be made in a cost-based approach.
Rules
Heuristic optimization transforms the expression-tree by using a set of rules which improve the
performance. These rules are as follows −
Perform the SELECTION process foremost in the query. This should be the first action
for any SQL table. By doing so, we can decrease the number of records required in the
query, rather than using all the tables during the query.
Perform all the projection as soon as achievable in the query. Somewhat like a selection
but this method helps in decreasing the number of columns in the query.
Perform the most restrictive joins and selection operations. What this means is that select
only those sets of tables and/or views which will result in a relatively lesser number of
records and are extremely necessary in the query. Obviously any query will execute
better when tables with few records are joined.
Some systems use only heuristics and the others combine heuristics with partial cost-based
optimization.
Steps in heuristic optimization
Let’s see the steps involve in heuristic optimization, which are explained below −
Deconstruct the conjunctive selections into a sequence of single selection operations.
Move the selection operations down the query tree for the earliest possible execution.
First execute those selections and join operations which will produce smallest relations.
Replace the cartesian product operation followed by selection operation with join
operation.
Deconstructive and move the tree down as far as possible.
Identify those subtrees whose operations are pipelined.
Using Selectivity and Cost Estimates in Query Optimization
A query optimizer does not depend solely on heuristic rules; it also estimates and compares the
costs of executing a query using different execution strategies and algorithms, and it then
chooses the strategy with the lowest cost estimate. For this approach to work, accurate cost
estimates are required so that different strategies can be compared fairly and realistically. In
addition, the optimizer must limit the number of execution strategies to be considered; otherwise,
too much time will be spent making cost estimates for the many possible execution strategies.
Hence, this approach is more suitable for compiled queries where the optimization is done at
compile time and the resulting execution strategy code is stored and executed directly at runtime.
For interpreted queries, where the entire process shown in Figure 19.1 occurs at runtime, a full-
scale optimization may slow down the response time. A more elaborate optimization is indicated
for compiled queries, whereas a partial, less time-consuming optimization works best for
interpreted queries.
This approach is generally referred to as cost-based query optimization. It uses traditional
optimization techniques that search the solution space to a problem for a solution that minimizes
an objective (cost) function. The cost functions used in query optimization are estimates and not
exact cost functions, so the optimization may select a query execution strategy that is not the
optimal (absolute best) one. In Section 19.8.1 we discuss the components of query execution
cost. In Section 19.8.2 we discuss the type of information needed in cost functions. This
information is kept in the DBMS catalog. In Section 19.8.3 we give examples of cost functions
for the SELECT operation, and in Section 19.8.4 we discuss cost functions for two-way JOIN
operations. Section 19.8.5 discusses multiway joins, and Section 19.8.6 gives an example.
Semantic Query Optimization
A different approach to query optimization, called semantic query optimization, has been
suggested. This technique, which may be used in combination with the techniques discussed
previously, uses constraints specified on the database schema—such as unique attributes and
other more complex constraints—in order to modify one query into another query that is more
efficient to execute. We will not discuss this approach in detail but we will illustrate it with a
simple example. Consider the SQL query:
SELECT E.Lname, M.Lname
FROM EMPLOYEE AS E, EMPLOYEE AS M
WHERE E.Super_ssn=M.Ssn AND E.Salary > M.Salary
This query retrieves the names of employees who earn more than their supervisors. Suppose that
we had a constraint on the database schema that stated that no employee can earn more than his
or her direct supervisor. If the semantic query optimizer checks for the existence of this
constraint, it does not need to execute the query at all because it knows that the result of the
query will be empty. This may save considerable time if the constraint checking can be done
efficiently. However, searching through many constraints to find those that are applicable to a
given query and that may semantically optimize it can also be quite time-consuming. With the
inclusion of active rules and additional metadata in database systems (see Chapter 26), semantic
query optimization techniques are being gradually incorporated into the DBMSs.
Concurrency Control Using Locks in DBMS
DBMSDatabaseMySQL
Locks are an integral part to maintain concurrency control in DBMS. A transaction in any system
implementing lock based concurrency control cannot read or write a statement until it has
obtained the required locks.
There are two types of locks in Lock based protocols. These are:
Binary Locks - These can only be in one of two states, locked or unlocked.
Shared/Exclusive Locks - Shared locks are acquired when only read operation is to be
performed. Shared locks can be shared between multiple transactions as there is no data
being altered. Exclusive locks are used when write operation is performed. Only the
transaction holding the exclusive lock is allowed to make changes to the data value.
The different locking protocols are −
Simplistic Lock Protocol
A lock is obtained by the transaction on the data value before the write operation is performed.
After the write operation, the lock can be released. An example of Simplistic Lock Protocol is:
T1 T2
R(A)
R(A)
Lock(B)
R(B)
W(B)
Unlock(B)
Lock(C)
R(C)
W(C)
Unlock(C)
Commit
Commit
There are two transactions T1 and T2 shown above. There are no locks required for the read
operation but before the write operation, each of these transactions acquires a lock and releases it
after.
Two-Phase Locking Protocol
The two-phase locking protocol has two phases, namely the growing and shrinking phase. The
transaction can only acquire locks when it is in the growing phase. When it enters the shrinking
phase, it can release the previously acquired locks but cannot acquire new locks. The exclusive
locks are represented by X and the shared locks are represented by S. An example of Two phase
locking protocol is −
T1 T2
S(A)
R(A)
S(A)
R(A)
X(B)
R(B)
W(B)
X(C)
R(C)
W(C)
Unlock(C)
Unlock(A)
Unlock(B)
Unlock(A)
Commit
Commit
In the above example, T1 and T2 share the variable A using a shared lock as only read operation
is performed on A. T1 acquires an exclusive lock on B for the write operation and releases it
soon after. T2 does the same with C.
Strict Two-Phase Locking Protocol
Strict two phase locking protocol is similar to two phase locking protocol. The only difference is
that in strict 2PL protocol all the exclusive locks acquired by the protocol need to be held until
the protocol either commits or aborts. An example of Strict two - phase locking protocol is:
T1 T2
S(A)
R(A)
S(A)
R(A)
X(B)
R(B)
W(B)
X(C)
R(C)
W(C)
Unlock(A)
Unlock(A)
Commit
Unlock(B)
Commit
T1 T2
Unlock(C)
In the above example, T1 and T2 share the variable A using a shared lock as only read operation
is performed on A. T1 acquires an exclusive lock on B for the write operation and T2 does the
same with C. The exclusive locks are released only after the transactions have committed.
However, there is no such bound for the shared locks.
Rigorous two-phase locking protocol
Rigorous two phase locking protocol is merely an extension of two phase locking protocol and
strict two phase locking protocol. Here, all the locks held by a transaction, whether shared or
exclusive, are only released once the transaction commits or aborts. An example of Rigorous two
- phase locking protocol is:
T1 T2
S(A)
R(A)
S(A)
R(A)
X(B)
R(B)
W(B)
X(C)
R(C)
W(C)
Commit
Unlock(A)
Unlock(B)
Commit
Unlock(A)
Unlock(C)
In the above example, T1 and T2 share the variable A using a shared lock as only read operation
is performed on A. T1 acquires an exclusive lock on B for the write operation and T2 does the
same with C. Both the shared locks and the exclusive locks are only released after the
transactions have committed.