Basic concepts of query processing;
Query processing is the set of activities a database management system
(DBMS) performs to translate a user's high-level query (like a SQL statement)
into a series of low-level, executable operations. The main goal is to execute
the query as efficiently as possible to minimize the time and resources needed
to retrieve the data.
This process is generally broken down into three main phases:
1. Parsing and Translation
This is the first step where the DBMS analyzes the query. It's essentially a
"compilation" phase.
• Lexical Analysis: The query string is broken down into meaningful
components or "tokens," such as keywords (e.g., SELECT, FROM,
WHERE), attribute names, and values.
• Syntactic Analysis: The system checks the query's syntax to ensure it
follows the rules of the language (e.g., SQL). If there's a typo, like
"SELCT," an error is reported.
• Semantic Analysis: The DBMS verifies that the query is logically sound. It
checks the database's catalog to ensure that all tables and columns
mentioned actually exist and that the user has the necessary
permissions.
• Translation: The validated query is then converted into an internal
representation, often a relational algebra expression or a query tree,
which is a more structured format the system can work with.
2. Optimization
This is the most critical phase for performance. The optimizer takes the internal
representation of the query and generates multiple possible execution plans to
retrieve the same data. It then estimates the cost of each plan and chooses the
one with the lowest cost.
• Generating Plans: The optimizer considers different ways to execute the
query. For example, it might decide to use an index to find specific rows,
or perform a full table scan. If the query involves multiple tables, it will
consider different join orders and join algorithms (e.g., nested-loop join,
hash join).
• Cost Estimation: The cost of each plan is calculated based on factors like
the number of disk I/O operations, CPU usage, and estimated number of
rows to be processed. This is where the database's statistics on tables
and indexes are crucial.
• Selecting the Best Plan: The optimizer chooses the most efficient plan,
which is often referred to as the query execution plan.
3. Evaluation
In this final phase, the chosen execution plan is executed. The query execution
engine takes the plan and runs the necessary operations to produce the final
result.
• Execution: The engine carries out the primitive operations defined in the
plan, such as reading data from disk, applying filters, performing joins,
sorting, and aggregating.
• Result Retrieval: The final result set is assembled and returned to the
user or application that submitted the query.
converting SQL queries into Relational Algebra;
Converting a SQL query into a Relational Algebra expression is a crucial step in
database query processing. Relational Algebra is a procedural query language,
which means it describes how to get a result, while SQL is a declarative
language, which describes what to get. The database management system's
(DBMS) query optimizer translates the declarative SQL into a procedural
Relational Algebra expression to generate an efficient execution plan.
The core idea is to map the clauses of a SQL query to the fundamental
operators of relational algebra.
Key Mappings Between SQL and Relational Algebra
• FROM Clause Cartesian Product (×) or Joins (⋈) The FROM clause,
which specifies the tables, is typically translated into a Cartesian Product
of those tables. This combines every row from the first table with every
row from the second table. However, since a Cartesian product can be
inefficient, the optimizer often converts this into a Join operation (⋈),
which is a more direct way of combining related tuples from different
tables.
• WHERE Clause Selection (σ) The WHERE clause's filtering condition
is represented by the Selection operator (sigma). This unary operator is
applied to a relation and filters out the rows (tuples) that do not satisfy
the specified condition. For example, WHERE Salary > 50000 becomes
sigma_textSalary50000.
• SELECT Clause Projection (π) The SELECT clause, which lists the
columns to be returned, is translated into the Projection operator (pi).
This unary operator is used to select specific columns (attributes) from a
relation, effectively creating a new relation with a vertical subset of the
original table. For example, SELECT Name, Age becomes
pi_textName,Age.
• GROUP BY and Aggregate Functions (e.g., COUNT, SUM)
Aggregation (γ) Queries with GROUP BY and aggregate functions are
translated using the Aggregation operator (gamma). This operator
partitions the tuples into groups based on the GROUP BY clause and then
applies aggregate functions to each group.
A Simple Example
Consider the following SQL query:
SQL
SELECT Fname, Lname
FROM Employee
WHERE Salary > 50000;
This query can be converted into the following Relational Algebra expression:
πFname, Lname(σSalary>50000(Employee))
This expression shows the order of operations:
1. Selection (sigma): Filter the Employee table to keep only the rows where
Salary is greater than 50,000.
2. Projection (pi): Take the result of the selection and project it onto the
Fname and Lname columns.
Basic Algorithms for executing query operations;
Selection (WHERE Clause)
For a SELECT statement with a WHERE clause, the database system must find
all tuples that satisfy a given condition.
• File Scan: The most basic approach is a full table scan or file scan. The
system reads every single tuple from the table and checks if it meets the
condition. This is simple but can be very slow for large tables.
• Index-based Selection: A much more efficient method is to use an index.
If a WHERE clause condition involves an indexed column, the system can
use the index (like a book's index) to quickly locate the relevant tuples
without scanning the entire table.
Projection (SELECT Clause)
The projection operation (the SELECT clause) extracts specific columns from a
relation.
• Tuple-by-tuple: This is the simplest algorithm. The system reads each
tuple, creates a new tuple containing only the desired columns, and
writes it to the output.
• Columnar storage: In some modern databases, data is stored in columns
instead of rows. This makes projection very efficient, as the system only
needs to read the relevant columns from the disk, ignoring the others
entirely.
Join Operation
The join operation combines data from two or more tables based on a join
condition. This is often the most expensive operation and is why a query
optimizer is so important.
• Nested Loop Join: This is the most basic join algorithm. For each tuple in
the outer relation (the first table), the algorithm iterates through every
tuple in the inner relation (the second table) and checks if the join
condition is met.
o Cost: High, especially for large tables, because it requires a full
scan of the inner relation for every tuple of the outer one.
• Sort-Merge Join: This is an efficient algorithm when the tables are large.
1. Sort Phase: The algorithm first sorts both relations on their
respective join attributes.
2. Merge Phase: It then scans both sorted relations simultaneously.
Pointers are used to find matching tuples, and the results are
combined.
• Hash Join: This is often the fastest join algorithm for large, unsorted
relations with an equi-join condition.
1. Build Phase: The algorithm builds a hash table on the smaller of
the two relations, using the join attribute as the hash key.
2. Probe Phase: It then scans the larger relation, and for each tuple,
it uses the hash function to look for matching tuples in the hash
table.
Aggregation (GROUP BY)
Aggregation functions like COUNT, SUM, AVG, MIN, and MAX summarize data.
When used with a GROUP BY clause, the system must first group the data
before calculating the aggregate value for each group.
• Sorting-based Aggregation: The relation is first sorted on the grouping
attributes. Then, a single scan is performed, and aggregate values are
calculated for each group of consecutive, identical values.
• Hashing-based Aggregation: A hash table is built in memory, with the
grouping attributes as keys and the aggregate values as values. The
system scans the relation and updates the aggregate value for each
group found in the hash table. This is often faster than sorting if the hash
table fits in memory.
Query tree and query graph;
Both a query tree and a query graph are internal representations of a query
used by a database system during the optimization and execution phases.
However, they differ in how they model the query and the information they
convey about the execution plan.
Query Tree
A query tree is a tree data structure that represents a specific Relational
Algebra expression. It shows a hierarchical, procedural plan for executing a
query.
• Structure:
o Leaf nodes: Represent the input relations (tables) from the FROM
clause.
o Internal nodes: Represent the relational algebra operators (like σ
for Selection, π for Projection, and ⋈ for Join) that will be applied
to the relations.
o Edges: Show the flow of data and the order of operations, with
the root node representing the final result.
• Key Characteristic: A query tree explicitly defines a specific order of
operations. This makes it a direct representation of a query execution
plan. For a single SQL query, a database optimizer may generate many
different, equivalent query trees to find the most efficient one.
Query Graph
A query graph is a graph data structure (which can have cycles) that represents
a query's Relational Calculus or declarative form. It represents the
relationships and conditions in a query without specifying an execution order.
• Structure:
o Nodes: Represent the relations (tables) involved in the query.
o Edges: Represent the join conditions and selection predicates
(from the WHERE clause) that connect the relations. Attributes to
be retrieved are sometimes shown within the nodes.
• Key Characteristic: A query graph is a non-procedural representation. It
shows all the constraints and relationships but leaves the ordering of
operations up to the query optimizer. There is usually only one canonical
query graph for a given query, from which multiple query trees can be
derived.
Heuristic optimization of query tree, functional dependencies, normal forms
Heuristic Optimization of Query Tree
Heuristic optimization is a rule-based approach to improving a query's
execution plan. The primary goal is to transform an initial, unoptimized query
tree into a more efficient, equivalent one. This is done by applying a set of rules
that, while not guaranteeing the absolute best plan, significantly reduce the
search space and lead to a good plan. The most important heuristic rule is to
perform selections and projections as early as possible to minimize the size of
intermediate results.
Key Heuristic Rules:
• Push Selections Down: Move the selection (σ) operation as close to the
leaf nodes (tables) as possible. This filters out a large number of rows
early, reducing the data that subsequent operations (like joins) have to
process.
• Push Projections Down: Similarly, move the projection (π) operation
down the tree. This reduces the number of attributes in intermediate
results, saving memory and disk I/O.
• Combine Operations: Replace a Cartesian product followed by a
selection with a more efficient join operation (⋈), as joins are generally
faster than performing a full Cartesian product.
• Reorder Joins: Reorder the join operations to process the most
restrictive joins first. A join that produces a smaller intermediate result is
more efficient, as it reduces the input size for the next join.
Functional Dependencies
A functional dependency (FD) is a constraint between two sets of attributes in
a relational database. It states that the value of one set of attributes (the
determinant) uniquely determines the value of another set of attributes (the
dependent). It's written as X→Y, which means that for any two tuples, if they
have the same value for attribute set X, they must also have the same value for
attribute set Y.
• Trivial Functional Dependency: A dependency X→Y is trivial if Y is a
subset of X. For example, {SSN, EmployeeID} \to {SSN} is trivial.
• Transitive Dependency: A dependency X→Z is transitive if there is an
intermediate attribute set Y such that X→Y and Y→Z, and Y does not
determine X.
• Partial Dependency: A dependency X→Y is partial if X is a composite key
and Y is determined by a proper subset of X. For example, if {StudentID,
CourseID} is a key and {StudentID} \to {StudentName} is true, this is a
partial dependency.
Functional dependencies are the theoretical basis for database normalization.
Normalization is a process for organizing data in a database to reduce
redundancy and improve data integrity. It involves a series of rules, called
normal forms, that help you design a logical schema. The main goal is to break
down a large table into smaller, well-structured tables and establish
relationships between them.
The process helps prevent three types of data anomalies:
• Insertion Anomaly: You can't add new data without adding data to other
required fields.
• Deletion Anomaly: Deleting a record unintentionally deletes other
related data.
• Update Anomaly: You have to update the same data in multiple places,
which can lead to inconsistencies.
The Most Common Normal Forms
First Normal Form (1NF)
A table is in 1NF if all its attribute values are atomic, meaning each cell contains
only a single value. It also eliminates repeating groups of data.
• Rule: Eliminate repeating groups and ensure each column has a single,
non-divisible value.
Second Normal Form (2NF)
A table is in 2NF if it's in 1NF and all non-key attributes are fully functionally
dependent on the primary key. This eliminates partial dependencies, where a
non-key attribute is dependent on only part of a composite primary key.
• Rule: Be in 1NF and remove partial dependencies.
Third Normal Form (3NF)
A table is in 3NF if it's in 2NF and has no transitive dependencies. A transitive
dependency occurs when a non-key attribute is determined by another non-
key attribute.
• Rule: Be in 2NF and remove transitive dependencies.
Denormalization
While normalization is crucial for data integrity, it can sometimes reduce query
performance because it requires complex joins to combine data from multiple
tables. Denormalization is a technique used in performance-critical
applications (like data warehousing) to intentionally add some controlled
redundancy to a previously normalized database. The goal is to speed up read
operations by reducing the need for costly joins, at the expense of potentially
slower write operations and increased data redundancy. It's a trade-off that
database designers make to optimize for specific use cases.
Boyce-Codd Normal Form (BCNF) is a stricter version of Third Normal Form
(3NF) that addresses a specific type of anomaly not handled by 3NF. A table is
in BCNF if and only if for every non-trivial functional dependency X→Y, X is a
superkey.
Key Differences from 3NF
While 3NF ensures that non-key attributes are not transitively dependent on
the primary key, it does not account for a more subtle issue: a non-trivial
functional dependency between two candidate keys. This is where BCNF
comes in. A table can be in 3NF but not in BCNF if:
• The table has two or more overlapping candidate keys.
• There is a functional dependency where a determinant is not a superkey,
and the dependent is part of another candidate key.
In simple terms, 3NF focuses on dependencies involving non-key attributes,
while BCNF applies to all functional dependencies.
Example
Consider a table Student_Course_Advisor with attributes {Student, Course,
Advisor}. The functional dependencies are:
1. {Student, Course} → Advisor (a student enrolled in a course has a single
advisor)
2. Advisor → Student (each advisor advises only one student)
From these dependencies, the candidate keys are {Student, Course} and
{Advisor, Course}.
This table is in 3NF because there are no transitive dependencies. However, it is
not in BCNF because the dependency Advisor → Student has Advisor as the
determinant, but Advisor is not a superkey on its own. It's only part of a
candidate key. To normalize this into BCNF, we would decompose the table into
two:
• Student_Advisor ({Student, Advisor})
• Student_Course ({Student, Course})
This decomposition removes the dependency Advisor → Student from the
original table, satisfying BCNF and eliminating the potential for data anomalies.