Dbms Unit 1 13m
Dbms Unit 1 13m
The goal of the three schema architecture is to separate the user applications and the physical
database. The schemas can be defined at the following levels:
1. The internal level – has an internal schema which describes the physical storage
structure of the database. Uses a physical data model and describes the complete details
of data storage and access paths for the database.
2. The conceptual level – has a conceptual schema which describes the structure of the
database for users. It hides the details of the physical storage structures, and
concentrates on describing entities, data types, relationships, user operations and
constraints. Usually a representational data model is used to describe the conceptual
schema.
3. The External or View level – includes external schemas or user vies. Each external
schema describes the part of the database that a particular user group is interested in and
hides the rest of the database from that user group. Represented using the
representational data model.
The three schema architecture is used to visualize the schema levels in a database. The three
schemas are only descriptions of data, the data only actually exists is at the physical level.
2. With the help of a neat block diagram explain the basic architecture of a
database management system? (NOV 2015) ( or)
Briefly explain about Database System Architecture? (JUN 2016) (or)
Explain the overall architecture of the database system in detail? (MAY 2017) (OR)
State and explain the architecture of DBMS? (NOV 2017, Nov 2019)
The architecture of a database system is greatly influenced by the underlying computer system on which the
database system runs. Database systems can be centralized, or client-server, where one server machine executes
work on behalf of multiple client machines. Database systems can also be designed to exploit parallel computer
architectures. Distributed databases span multiple geographically separated machines.
A database system is partitioned into modules that deal with each of the responsibilities of the overall system. The
functional components of a database system can be broadly divided into the storage manager and the query
processor components. The storage manager is important because databases typically require a large amount of
storage space. The query processor is important because it helps the database system simplify and facilitate access
to data.
It is the job of the database system to translate updates and queries written in a nonprocedural language, at the logical
level, into an efficient sequence of operations at the physical level.
Database applications are usually partitioned into two or three parts, as in Figure 1.4. In a two-tier architecture, the
application resides at the client machine, where it invokes database system functionality at the server machine
through query language statements. Application program interface standards like ODBC and JDBC are used for
interaction between the client and the server. In contrast, in a three-tier architecture, the client machine acts as
merely a front end and does not contain any direct database calls. Instead, the client end communicates with an
application server, usually through a forms interface.
The application server in turn communicates with a database system to access data. The business logic of the
application, which says what actions to carry out under what conditions, is embedded in the application server, instead
of being distributed across multiple clients. Three-tier applications are more appropriate for large applications,
and for applications that run on the WorldWideWeb.
Figure 1.4: Two-tier and three-tier architectures.
Query Processor:
The query processor components include
· DDL interpreter, which interprets DDL statements and records the definitions in the data dictionary.
· DML compiler, which translates DML statements in a query language into an evaluation plan consisting of
low-level instructions that the query evaluation engine understands.
A query can usually be translated into any of a number of alternative evaluation plans that all give the same
result. The DML compiler also performs query optimization, that is, it picks the lowest cost evaluation plan
from among the alternatives.
Query evaluation engine, which executes low-level instructions generated by the DML compiler.
Storage Manager:
A storage manager is a program module that provides the interface between the lowlevel data stored in the database
and the application programs and queries submitted to the system. The storage manager is responsible for the
interaction with the file manager. The raw data are stored on the disk using the file system, which is usually provided
by a conventional operating system. The storage manager translates the various DML statements into low-level file-
system commands. Thus, the storage manager is responsible for storing, retrieving, and updating data in the
database.
The storage manager components include:
· Authorization and integrity manager, which tests for the satisfaction of integrity constraints and checks the
authority of users to access data.
· Transaction manager, which ensures that the database remains in a consistent (correct) state despite
system failures, and that concurrent transaction executions proceed without conflicting.
· File manager, which manages the allocation of space on disk storage and the data structures used to
represent information stored on disk.
· Buffer manager, which is responsible for fetching data from disk storage into main memory, and deciding what
data to cache in main memory. The buffer manager is a critical part of the database system, since it enables the
database to handle data sizes that are much larger than the size of main memory.
Transaction Manager:
A transaction is a collection of operations that performs a single logical function in a database application. Each
transaction is a unit of both atomicity and consistency. Thus, we require that transactions do not violate any
database-consistency constraints. That is, if the database was consistent when a transaction started, the database
must be consistent when the transaction successfully terminates. Transaction - manager ensures that the database
remains in a consistent (correct) state despite system failures (e.g., power failures and operating system
crashes) and transaction failures.
COMPONENTS OF DBMS
Database Users
Users are differentiated by the way they expect to interact with the system
• Application programmers
• Sophisticated users
• Naïve users
• Database Administrator
• Specialized users etc,.
Application programmers:
Professionals who write application programs and using these application
programs they interact with the database system
Sophisticated users :
These user interact with the database system without writing programs, But they
submit queries to retrieve the information
Specialized users:
Who write specialized database applications to interact with the database system.
Naïve users:
Interacts with the database system by invoking some application programs that
have been written
previously by application programmers
Eg : people accessing database over the web
Database Administrator:
Coordinates all the activities of the database system; the database administrator has a
good understanding of
the enterprise’s information resources and needs.
Schema definition
Access method definition
Schema and physical organization modification
Granting user authority to access the database
Monitoring performance
Storage Manager
The Storage Manager include these following components/modules
Authorization Manager
Transaction Manager
File Manager
Buffer Manager
Storage manager is a program module that provides the interface between the low-
level data stored in the database and the application programs and queries submitted to
the system.
The storage manager is responsible to the following tasks:
interaction with the file manager
efficient storing, retrieving and updating of data
Authorization Manager
Checks whether the user is an authorized person or not
Test the satisfaction of integrity constraints
Transaction Manager
Responsible for concurrent transaction execution It ensures that the database remains in
a consistent state despite of the system failure.
users
users
1
Logical Conceptual or Logical Level
Schema
Instance – the actual content of the database at a particular point in time is called the instance of the
schema
Data Independence
The ability to change a schema at one level without affecting the higher level schemas is called as
data independence. There are two types of data independence:
Physical Data Independence – is the ability to modify the physical or internal schema without
changing the logical or external schemas.
Logical Data Independence – is the ability to modify logical or conceptual schema without
changing the external schemas or application programs.
5. Explain select, project and Cartesian product operations in relational algebra with an
example? (NOV 2016, APR 2018)
Relational Algebra
Relational algebra is a procedural query language, which takes instances of relations as input and yields
instances of relations as output. It uses operators to perform queries. An operator can be either unary or
binary. They accept relations as their input and yield relations as their output. Relational algebra is
performed recursively on a relation and intermediate results are also considered relations.
The fundamental operations of relational algebra are as follows −
Select
Project
Union
Set different
Cartesian product
Rename
We will discuss all these operations in the following sections.
Select Operation (σ)
It selects tuples that satisfy the given predicate from a relation.
Notation − σp(r)
Where σ stands for selection predicate and r stands for relation. p is prepositional logic formula which
may use connectors like and, or, and not. These terms may use relational operators like − =, ≠, ≥, < , >, ≤.
For example −
σsubject = "database"(Books)
Output − Selects tuples from books where subject is
'database'. σsubject = "database" and price = "450"(Books)
Output − Selects tuples from books where subject is 'database' and 'price' is 450.
σsubject = "database" and price = "450" or year > "2010"(Books)
Output − Selects tuples from books where subject is 'database' and 'price' is 450 or those books
published after 2010.
Project Operation (∏)
It projects column(s) that satisfy a given predicate.
Notation − ∏A1, A2, An (r)
Where A1, A2 , An are attribute names of relation r.
Duplicate rows are automatically eliminated, as relation is a set.
For example −
∏subject, author (Books)
Selects and projects columns named as subject and author from the relation Books.
Union Operation (𝖴)
It performs binary union between two given relations and is defined as −
r 𝖴 s = { t | t ∈ r or t ∈ s}
Notation − r U s
Where r and s are either database relations or relation result set (temporary relation).
For a union operation to be valid, the following conditions must hold −
r, and s must have the same number of attributes.
Attribute domains must be compatible.
Duplicate tuples are automatically eliminated.
∏ author (Books) 𝖴 ∏ author (Articles)
Output − Projects the names of the authors who have either written a book or an article or both.
Set Difference (−)
The result of set difference operation is tuples, which are present in one relation but are not in the second
relation.
Notation − r − s
Finds all the tuples that are present in r but not in s.
∏ author (Books) − ∏ author (Articles)
Output − Provides the name of authors who have written books but not articles.
Cartesian Product (Χ)
Combines information of two different relations into one.
Notation − r Χ s
Where r and s are relations and their output will be
defined as − r Χ s = { q t | q ∈ r and t ∈ s}
σauthor = 'tutorialspoint'(Books Χ Articles)
Output − Yields a relation, which shows all the books and articles written by tutorialspoint.
Rename Operation (ρ)
The results of relational algebra are also relations but without any name. The rename operation allows us
to rename the output relation. 'rename' operation is denoted with small Greek letter rho ρ.
Notation − ρ x (E)
Where the result of expression E is saved with name of x.
Additional operations are −
Set intersection
Assignment
Natural join
5. Consider the given relation schema . ( MAY 2017, NOV 2018 )
Employee(empno,name,office,age)
Books(isbn,title,authors,publisher)
Loan(empno,isbn,date)
Find the name of all employees who have borrowed a book published by McGraw-Hill.
Find the name of all employees who have borrowed all book published by McGraw-Hill.
Find the names of employees who have borrowed more than five different books published
by McGraw-Hill.
For each publisher, find the name of employees who have borrowed more than five books of
that publisher.
select name from employee e, books b, loan l where e.empno = l.empno and l.isbn = b.isbn and
b.publisher = ‘McGrawHill’
select name from employee e join loan l on e.empno=l.empno join (select isbn from books where
publisher = 'McGrawHill') x on l.isbn=x.isbn group by e.empno,name having count(*)= (select
count(*) from books where publisher=’McGrawHill’)
select name from employee,loan where employee.empno=loan.empno and isbn in ( select distinct
isbn from books where publisher='McGraw-Hill') group by employee.empno,name having
count(isbn) >=5
select name from employee,loan,books where employee.empno=loan.empno and
books.isbn=loan.isbn group by employee.empno, name,books.publisher having
count(loan.isbn) >=5
6. Explain about SQL Fundamentals? (JUN 2016) (OR)
State and Explain DDL, DML, DCL with suitable examples. (NOV 2017), (NOV 2019)
SQL FUNDAMENTALS:
SQL is a standard computer language for accessing and manipulating databases.
What is SQL?
SQL stands for Structured Query Language
SQL allows you to access a database
SQL is an ANSI standard computer language
SQL can execute queries against a database
SQL can retrieve data from a database
SQL can insert new records in a database
SQL can delete records from a database
SQL can update records in a database
SQL is easy to learn
i)DDL
ii)DML
iii)Embedded SQL (NOV 2014)
The DDL gets as input some instructions and generates some output. The output of the DDL is placed in
the data dictionary, which contains metadata- that is data about data. The data dictionary is considered to be
special type of table, which can only be accessed and updated by the database system itself. A database system
consults the data dictionary before reading or modifying actual data.
ii) Data Manipulation Language:
A data manipulation language (DML) is a language that enables users to access or manipulate data
organized by the appropriate data model. The types of access are:
Retrieval of information stored in the database o Insertion of new
information into the database o Deletion of information from the
database
o Modification of information stored in the database
There are basically two types:
Procedural DMLs require a user 3to specify what data are needed and how to get
those data.
Declarative DMLs (also referred to as nonprocedural DMLs) require a user to specify what
data are needed without specifying how to get those data.
A query is a statement requesting the retrieval of information. The portion of a DML that involves
information retrieval is called a query language.
Embedded SQL:
An embedded SQL program must be processed by a special preprocessor prior to compilation. The
preprocessor replaces embedded SQL requests with host-language declarations and procedure calls that
allow run-time execution of the database accesses. Then, the resulting program is compiled by the host
language compiler. To identify embedded SQL requests to the preprocessor, we use the EXEC SQL
statement. It has the form
EXEC SQL <embedded SQL statement> END-EXEC
The exact syntax for embedded SQL requests depends on the language in which SQL is embedded.
For instance, a semicolon is used instead of END-EXEC when SQL is embedded in C. The java embedding
of SQL called (SQLJ) uses the syntax
#SQL{<embedded SQL statement>}
The statement SQL INCLUDE is placed in the program to identify the place where the preprocessor
should insert the special variables used for communication between the program and the database system.
Variables of the host language can be used within embedded SQL statements, but they must be preceded by
a colon (:) to distinguish them from SQL variables.
Before executing any SQL statements, the program must first connect to the database. This is done
using
EXEC SQL connect to server user user-name END-EXEC
ER Model: Relationships
When an Entity is related to another Entity, they are said to have a relationship. For example, A
ClassEntity is related to Student entity, because students study in classes, hence this is a relationship.
Depending upon the number of entities involved, a degree is assigned to relationships.
For example, if 2 entities are involved, it is said to be Binary relationship, if 3 entities are involved, it is
said to be Ternary relationship, and so on.
Working with ER Diagrams
ER Diagram is a visual representation of data that describes how data is related to each other. In ER Model,
we disintegrate data into entities, attributes and setup relationships between entities, all this can be represented
visually using the ER diagram.
Components of ER Diagram
Entitiy, Attributes, Relationships etc form the components of ER Diagram and there are defined symbols
and shapes to represent each one of them.
Let's see how we can represent these in our ER Diagram.
Entity
Simple rectangular box represents an Entity.
Weak Entity
A weak Entity is represented using double rectangular boxes. It is generally connected to another entity.
ER Diagram: Entity
An Entity can be any object, place, person or class. In ER Diagram, an entity is represented using rectangles.
Consider an example of an Organisation- Employee, Manager, Department, Product and many more can be
taken as entities in an Organisation.
The above example describes that one student can enroll only for one course and a course will also have only one
Student. This is not what you will usually see in real-world relationships.
One to Many Relationship
The below example showcases this relationship, which means that 1 student can opt for many courses, but a
course can only have 1 student. Sounds weird! This is how it is.
The above diagram represents that one student can enroll for more than one courses. And a course can have
more than 1 student enrolled in it.
ER Diagram: Recursive Relationship
When an Entity is related with itself it is known as Recursive Relationship.
For example, in the diagram above, we have three related entities, Company, Product and Sector. To understand
the relationship better or to define rules around the model, we should relate two entities and then derive the third
one.
A Company produces many Products/ each product is produced by exactly one
company. A Company operates in only one Sector / each sector has many companies
operating in it.
Considering the above two rules or relationships, we see that although the complete relationship involves
three entities, but we are looking at two entities at a time.
The Enhanced ER Model
As the complexity of data increased in the late 1980s, it became more and more difficult to use the traditional
ER Model for database modelling. Hence some improvements or enhancements were made to the existing ER
Model to make it able to handle the complex applications better.
Hence, as part of the Enhanced ER Model, along with other improvements, three new concepts were added to
the existing ER Model, they were:
1. Generalization
2. Specialization
3. Aggregration
Generalization
Generalization is a bottom-up approach in which two lower level entities combine to form a higher level
entity. In generalization, the higher level entity can also combine with other lower level entities to make
further higher level entity.
It's more like Superclass and Subclass system, but the only difference is the approach, which is bottom-up.
Hence, entities are combined to form a more generalised entity, in other words, sub-classes are combined to
form a super-
class.
For example, Saving and Current account types entities can be generalised and an entity with name Account
can be created, which covers both.
Specialization
Specialization is opposite to Generalization. It is a top-down approach in which one higher level entity can be
broken down into two lower level entity. In specialization, a higher level entity may not have any lower-level
entity sets, it's possible.
Aggregation
Aggregation is a process when relation between two entities is treated as a single entity.
In the diagram above, the relationship between Center and Course together, is acting as an Entity, which is in
relationship with another entity Visitor. Now in real world, if a Visitor or a Student visits a Coaching Center,
he/she will never enquire about the center only or just about the course, rather he/she will ask enquire about
both.
ER Model to Relational Model
ER Model can be represented using ER Diagrams which is a great way of designing and representing the
database design in more of a flow chart form.
It is very convenient to design the database using the ER Model by creating an ER diagram and later on
converting it into relational model to design your tables.
Not all the ER Model constraints and components can be directly transformed into relational model, but
an approximate schema can be derived.
Few examples of ER diagrams and convert it into relational model schema, hence creating tables in RDBMS.
Entity becomes Table
Entity in ER Model is changed into tables, or we can say for every Entity in ER model, a table is created
in Relational Model.
And the attributes of the Entity gets converted to columns of the table.
And the primary key specified for the entity in the ER model, will become the primary key for the table
in relational model.
For example, for the below ER Diagram in ER Model,
A table with name Student will be created in relational model, which will have
4 columns, id, name, age, address and id will be the primary key for this table.
Table:Student
As discussd above, entity gets mapped to table, hence we will create table for Teacher and a table for Student
with all the attributes converted into columns.
Now, an additional table will be created for the relationship, for example StudentTeacher or give it any name
you like. This table will hold the primary key for both Student and Teacher, in a tuple to describe the
relationship, which teacher teaches which student.
If there are additional attributes related to this relationship, then they become the columns for this table, like
subject
name.
Also proper foreign key constraints must be set for all the tables.
Functional Dependency
The functional dependency is a relationship that exists between two attributes. It typically exists between
the primary key and non-key attribute within a table.
X → Y
The left side of FD is known as a determinant, the right side of the production is known as a dependent.
For example:
Assume we have an employee table with attributes: Emp_Id, Emp_Name, Emp_Address.
Here Emp_Id attribute can uniquely identify the Emp_Name attribute of employee table because if we know
the Emp_Id, we can tell that employee name associated with it.
Functional dependency can be written as:
Emp_Id → Emp_Name
Types of Functional dependency
In the table above, we have data of 4 Computer Sci. students. As we can see, data for the fields branch,
hod(Head of Department) and office_tel is repeated for the students who are in the same branch in the college,
this is Data Redundancy.
Insertion Anomaly
Suppose for a new admission, until and unless a student opts for a branch, data of the student cannot be inserted,
or else we will have to set the branch information as NULL.
Also, if we have to insert data of 100 students of same branch, then the branch information will be repeated for
all those 100 students.
These scenarios are nothing but Insertion anomalies.
Updation Anomaly
What if Mr. X leaves the college? or is no longer the HOD of computer science department? In that case all
the student records will have to be updated, and if by mistake we miss any record, it will lead to data
inconsistency. This is Updation anomaly.
Deletion Anomaly
In our Student table, two different informations are kept together, Student information and Branch
information. Hence, at the end of the academic year, if student records are deleted, we will also lose the
branch information. This is Deletion anomaly.
Normalization Rule
Normalization rules are divided into the following normal forms:
1. First Normal Form
2. Second Normal Form
3. Third Normal Form
4. BCNF
5. Fourth Normal Form
6. Fifth Normal Form
First Normal Form (1NF)
For a table to be in the First Normal Form, it should follow the following 4 rules:
1. It should only have single(atomic) valued attributes/columns.
2. Values stored in a column should be of the same domain
3. All the columns in a table should have unique names.
4. And the order in which data is stored, does not matter.
Rules for First Normal Form
The first normal form expects you to follow a few simple rules while designing your database, and they are:
Rule 1: Single Valued Attributes
Each column of your table should be single valued which means they should not contain multiple values. We
will explain this with help of an example later, let's see the other rules for now.
Rule 2: Attribute Domain should not change
This is more of a "Common Sense" rule. In each column the values stored must be of the same kind or type.
For example: If you have a column dob to save date of births of a set of people, then you cannot or you must
not save 'names' of some of them in that column along with 'date of birth' of others in that column. It should
hold only 'date of birth' for all the records/rows.
Rule 3: Unique name for Attributes/Columns
This rule expects that each column in a table should have a unique name. This is to avoid confusion at the time
of retrieving data or performing any other operation on the stored data.
If one or more columns have same name, then the DBMS system will be left confused.
Rule 4: Order doesn't matters
This rule says that the order in which you store the data in your table doesn't matter.
EXAMPLE
Create a table to store student data which will have student's roll no., their name and the name of subjects they
have opted for.
Here is the table, with some sample data added to it.
The table already satisfies 3 rules out of the 4 rules, as all our column names are unique, we have stored data
in the order we wanted to and we have not inter-mixed different type of data in columns.
But out of the 3 different students in our table, 2 have opted for more than 1 subject. And we have stored the
subject names in a single column. But as per the 1st Normal form each column must contain atomic value.
It's very simple, because all we have to do is break the values into atomic values.
Here is our updated table and it now satisfies the First Normal Form.
roll_no name subject
101 Akon OS
101 Akon CN
103 Ckon Java
102 Bkon C
102 Bkon C++
By doing so, although a few values are getting repeated but values for the subject column are now atomic for
each record/row. Using the First Normal Form, data redundancy increases, as there will be many columns with
same data in multiple rows but each row as a whole will be unique.
Second Normal Form (2NF)
For a table to be in the Second Normal Form,
1. It should be in the First Normal form.
2. And, it should not have Partial Dependency.
Dependency
Let's take an example of a Student table with columns student_id, name, reg_no(registration
number), branch and address(student's home address).
In this table, student_id is the primary key and will be unique for every row, hence we can use student_id to
fetch any row of data from this table
Even for a case, where student names are same, if we know the student_id we can easily fetch the correct
record.
student_id name reg_no branch address
10 Akon 07-WY CSE Kerala
11 Akon 08-WY IT Gujarat
Hence we can say a Primary Key for a table is the column or a group of columns(composite key) which can
uniquely identify each record in the table.
I can ask from branch name of student with student_id 10, and I can get it. Similarly, if I ask for name of
student with student_id 10 or 11, I will get it. So all I need is student_id and every other column depends on it,
or can be fetched using it.This is Dependency and we also call it Functional Dependency.
Partial Dependency
Now that we know what dependency is, we are in a better state to understand what partial dependency is.
For a simple table like Student, a single column like student_id can uniquely identfy all the records in a table.
But this is not true all the time. So now let's extend our example to see if more than 1 column together can act
as a primary key.
Let's create another table for Subject, which will have subject_id and subject_name fields and subject_id will
be the primary key.
subject_i subject_nam
d e
1 Java
2 C++
www.Notesfree.
3 Php
Now we have a Student table with student information and another table Subject for storing subject
information.
Let's create another table Score, to store the marks obtained by students in the respective subjects. We will also
be saving name of the teacher who teaches that subject along with marks.
subject_i
score_id student_id marks teacher
n
d
1 10 1 70 Java Teacher
2 10 2 75 C++ Teacher
3 11 1 80 Java Teacher
In the score table we are saving the student_id to know which student's marks are these and subject_id to
know for which subject the marks are for.
Together, student_id + subject_id forms a Candidate Key which can be the Primary key.
To get me marks of student with student_id 10, can you get it from this table? No, because you don't know for
which subject. And if I give you subject_id, you would not know for which student. Hence we need student_id
+ subject_id to uniquely identify any row.
But where is Partial Dependency?
Now if you look at the Score table, we have a column names teacher which is only dependent on the subject,
for Java it's Java Teacher and for C++ it's C++ Teacher & so on.
Now as we just discussed that the primary key for this table is a composition of two columns which
is student_id & subject_id but the teacher's name only depends on subject, hence the subject_id, and has
nothing to do with student_id.
This is Partial Dependency, where an attribute in a table depends on only a part of the primary key and not on
the whole key.
How to remove Partial Dependency?
There can be many different solutions for this, but out objective is to remove teacher's name from Score table.
The simplest solution is to remove columns teacher from Score table and add it to the Subject table. Hence, the
Subject table will become:
Score Table
In the Score table, we need to store some more information, which is the exam name and total marks, so let's
add 2 more columns to the Score table.
student_i subject_i
score_id d d marks
1 10 1 70
2 10 2 75
3 11 1 80
Transitive Dependency
With exam_name and total_marks added to our Score table, it saves more data now. Primary key for the Score
table is a composite key, which means it's made up of two attributes or columns → student_id + subject_id.
The new column exam_name depends on both student and subject. For example, a mechanical engineering
student will have Workshop exam but a computer science student won't. And for some subjects you have
Practical exams and for some you don't. So we can say that exam_name is dependent on
both student_id and subject_id.
And what about our second new column total_marks? Does it depend on our Score table's primary key?
Well, the column total_marks depends on exam_name as with exam type the total score changes. For example,
practicals are of less marks while theory exams are of more marks.
But, exam_name is just another column in the score table. It is not a primary key or even a part of the primary
key, and total_marks depends on it.
This is Transitive Dependency. When a non-prime attribute depends on other non-prime attributes rather than
depending upon the prime attributes or primary key.
How to remove Transitive Dependency
Again the solution is very simple. Take out the columns exam_name and total_marks from Score table and put
them in an Exam table and use the exam_id wherever required.
Score Table: In 3rd Normal Form
student_i subject_
score_id d id marks exam_id
The above table can be decomposed into the following three tables; therefore it is not in 5NF:
<EmployeeSkills>
EmpName EmpSkills
Tom Networking
Harry Web Development
Katie Programming
<EmployeeJob>
EmpName EmpJob
Tom EJ001
Harry EJ002
<JobSkills> Katie EJ002
In above table, Rose takes both Mathematics and Physics class for Semester 1, but she does not take Physics
class for Semester 2. In this case, combination of all these 3 fields is required to identify a valid data. Imagine
we want to add a new class - Semester3 but do not know which Subject and who will be taking that subject.
We would be simply inserting a new entry with Class as Semester3 and leaving Lecturer and subject as NULL.
As we discussed above, it's not a good to have such entries. Moreover, all the three columns together act as a
primary key, we cannot leave other two columns blank!
Hence we have to decompose the table in such a way that it satisfies all the rules till 4NF and when join them
by using keys, it should yield correct record. Here, we can represent each lecturer's Subject area and their
classes in a better way. We can divide above table into three - (SUBJECT, LECTURER), (LECTURER,
CLASS), (SUBJECT, CLASS)
Now, each of combinations is in three different tables. If we need to identify who is teaching which subject to
which semester, we need join the keys of each table and get the result.
For example, who teaches Physics to Semester 1, we would be selecting Physics and Semester1 from table 3
above, join with table1 using Subject to filter out the lecturer names. Then join with table2 using Lecturer to
get correct lecturer name. That is we joined key columns of each table to get the correct data. Hence there is no
lose or new data - satisfying 5NF condition.
UNIT III - TRANSACTIONS
Transaction Concepts – ACID Properties – Schedules – Serializability – Concurrency Control – Need
for Concurrency – Locking Protocols – Two Phase Locking – Deadlock – Transaction Recovery - Save
Points – Isolation Levels – SQL Facilities for Concurrency and Recovery
1. TRANSACTION CONCEPTS
What is Transaction?
A set of logically related operations is known as transaction. The main operations of a transaction are:
Read(A): Read operations Read(A) or R(A) reads the value of A from the database and stores it in a buffer
in main memory.
Write (A): Write operation Write(A) or W(A) writes the value back to the database from buffer.
Let us take a debit transaction from an account which consists of following operations:
1.R(A);
2.A=A-1000;
3.W(A);
Assume A’s value before starting of transaction is 5000.
The first operation reads the value of A from database and stores it in a buffer.
Second operation will decrease its value by 1000. So buffer will contain 4000.
Third operation will write the value from buffer to database. So A’s final value will be 4000.
But it may also be possible that transaction may fail after executing some of its operations. The failure can be
because of hardware, software or power etc. For example, if debit transaction discussed above fails after
executing operation 2, the value of A will remain 5000 in the database which is not acceptable by the bank.
States of Transactions
1
State Transaction types
Active State A transaction enters into an active state when the execution process begins. During
this state read or write operations can be performed.
Partially A transaction goes into the partially committed state after the end of a transaction.
Committed
Committed When the transaction is committed to state, it has already completed its execution
State successfully. Moreover, all of its changes are recorded to the database permanently.
Failed State A transaction considers failed when any one of the checks fails or if the transaction is
aborted while it is in the active state.
Terminated State of transaction reaches terminated state when certain transactions which are
State leaving the system can't be restarted.
2. ACID PROPERTIES
2
A transaction is a single logical unit of work which accesses and possibly modifies the contents of a
database. Transactions access data using read and write operations.
In order to maintain consistency in a database, before and after transaction, certain properties
are followed. These are called ACID properties.
3
Atomicity
By this, we mean that either the entire transaction takes place at once or doesn’t happen at all. There is
no midway i.e. transactions do not occur partially. Each transaction is considered as one unit and either
runs to completion or is not executed at all. It involves following two operations.
—Abort: If a transaction aborts, changes made to database are not visible.
—Commit: If a transaction commits, changes made are visible.
Atomicity is also known as the ‘All or nothing rule’.
Consider the following transaction T consisting of T1 and T2: Transfer of 100 from account X to
account Y.
If the transaction fails after completion of T1 but before completion of T2.( say, after write(X) but
before write(Y)), then amount has been deducted from X but not added to Y. This results in an
inconsistent database state. Therefore, the transaction must be executed in entirety in order to ensure
correctness of database state.
Consistency
This means that integrity constraints must be maintained so that the database is consistent before and
after the transaction. It refers to correctness of a database.
Referring to the example above,
The total amount before and after the transaction must be maintained.
Total before T occurs = 500 + 200 = 700.
Total after T occurs = 400 + 300 = 700.
Therefore, database is consistent. Inconsistency occurs in case T1 completes but T2 fails. As a result T is
incomplete.
Isolation
This property ensures that multiple transactions can occur concurrently without leading to
inconsistency of database state. Transactions occur independently without interference. Changes occurring
in a particular transaction will not be visible to any other transaction until that particular change in that
transaction is written to memory or has been committed. This property ensures that the execution of
transactions concurrently will result in a state that is equivalent to a state achieved these were executed
serially in some order.
Let X= 500, Y = 500.
Consider two transactions T and T”.
4
Suppose T has been executed till Read (Y) and then T’’ starts. As a result , interleaving of
operations takes place due to which T’’ reads correct value of X but incorrect value of Y and sum
computed by
T’’: (X+Y = 50, 000+500=50, 500)
is thus not consistent with the sum at end of transaction:
T: (X+Y = 50, 000 + 450 = 50, 450).
This results in database inconsistency, due to a loss of 50 units. Hence, transactions must take place in
isolation and changes should be visible only after a they have been made to the main memory.
Durability:
This property ensures that once the transaction has completed execution, the updates and
modifications to the database are stored in and written to disk and they persist even if system failure
occurs. These updates now become permanent and are stored in a non-volatile memory. The effects of the
transaction, thus, are never lost.
3. SCHEDULES
1. Serial Schedules
Schedules in which the transactions are executed non-interleaved, i.e., a serial schedule is one in which
no transaction starts until a running transaction has ended are called serial schedules.
Example: Consider the following schedule involving two transactions T1 and T2.
T1 T2
R(A)
W(A)
R(B)
W(B)
R(A)
R(B)
where R(A) denotes that a read operation is performed on some data item ‘A’
This is a serial schedule since the transactions perform serially in the order T1 —> T2
2. Complete Schedules
Schedules in which the last operation of each transaction is either abort (or) commit are called
5
complete schedules.
6
Example: Consider the following schedule involving three transactions T1, T2 and T3.
T1 T2 T3
R(A)
W(A)
R(B)
W(B)
commit
commit
abort
This is a complete schedule since the last operation performed under every transaction is either
“commit” or “abort”.
3. Recoverable Schedules
Schedules in which transactions commit only after all transactions whose changes they read commit
are called recoverable schedules. In other words, if some transaction Tj is reading value updated or written
by some other transaction Ti, then the commit of Tj must occur after the commit of Ti.
Example – Consider the following schedule involving two transactions T1 and T2.
T1 T2
R(A)
W(A)
W(A)
R(A)
commi
t
commit
This is a recoverable schedule since T1 commits before T2, that makes the value read by T2 correct.
4. Cascadeless Schedules –
Also called Avoids cascading aborts/rollbacks (ACA). Schedules in which transactions read values
only after all transactions whose changes they are going to read commit are called cascadeless schedules.
Avoids that a single transaction abort leads to a series of transaction rollbacks. A strategy to prevent
cascading aborts is to disallow a transaction from reading uncommitted changes from another transaction
in the same schedule.
In other words, if some transaction Tj wants to read value updated or written by some other transaction Ti,
then the commit of Tj must read it after the commit of Ti.
Example: Consider the following schedule involving two transactions T1 and T2.
7
T1 T2
R(A)
W(A)
W(A)
commit
R(A)
commit
This schedule is cascadeless. Since the updated value of A is read by T2 only after the updating
transaction i.e. T1 commits.
5. Strict Schedules
A schedule is strict if for any two transactions Ti, Tj, if a write operation of Ti precedes a conflicting
operation of Tj (either read or write), then the commit or abort event of Ti also precedes that conflicting
operation of Tj.
In other words, Tj can read or write updated or written value of Ti only after Ti commits/aborts.
Example: Consider the following schedule involving two transactions T1 and T2.
T1 T2
R(A)
R(A)
W(A)
commit
W(A)
R(A)
commit
This is a strict schedule since T2 reads and writes A which is written by T1 only after the commit of T1.
Note – It can be seen that:
1. Cascadeless schedules are stricter than recoverable schedules or are a subset of recoverable schedules.
2. Strict schedules are stricter than cascadeless schedules or are a subset of cascadeless schedules.
3. Serial schedules satisfy constraints of all recoverable, cascadeless and strict schedules and hence is a
subset of strict schedules.
4. SERIALIZABILITY
When multiple transactions are running concurrently then there is a possibility that the database may
be left in an inconsistent state. Serializability is a concept that helps us to check which schedules are
serializable. A serializable schedule is the one that always leaves the database in consistent state.
What is a serializable schedule?
A serializable schedule always leaves the database in consistent state. A serial schedule is always a
serializable schedule because in serial schedule, a transaction only starts when the other transaction
finished execution. However, a non-serial schedule needs to be checked for Serializability.
Types of Serializability
There are two types of Serializability.
I. Conflict Serializability
II. View Serializability
1. Conflict Serializability
Conflict Serializability is one of the type of Serializability, which can be used to check whether a
non- serial schedule is conflict serializable or not.
What is Conflict Serializability?
A schedule is called conflict serializable if we can convert it into a serial schedule after swapping its
non-conflicting operations.
Conflicting operations
Two operations are said to be in conflict, if they satisfy all the following three conditions:
1. Both the operations should belong to different transactions.
2. Both the operations are working on same data item.
3. At least one of the operations is a write operation.
Let’s see some examples to understand this:
Example 1: Operation W(X) of transaction T1 and operation R(X) of transaction T2 are conflicting
operations, because they satisfy all the three conditions mentioned above. They belong to different
transactions, they are working on same data item X, one of the operations in write operation.
Example 2: Similarly, Operations W(X) of T1 and W(X) of T2 are conflicting operations.
Example 3: Operations W(X) of T1 and W(Y) of T2 are non-conflicting operations because both the
write operations are not working on same data item so these operations don’t satisfy the second condition.
Example 4: Similarly, R(X) of T1 and R(X) of T2 are non-conflicting operations because none of them is
write operation.
Example 5: Similarly, W(X) of T1 and R(X) of T1 are non-conflicting operations because both the
operations belong to same transaction T1.
Conflict Equivalent Schedules
Two schedules are said to be conflict Equivalent if one schedule can be converted into other schedule
after swapping non-conflicting operations.
Conflict Serializable check
Let’s check whether a schedule is conflict serializable or not. If a schedule is conflict Equivalent to its
serial schedule then it is called Conflict Serializable schedule. Let’s take few examples of schedules.
Example of Conflict Serializability
Let’s consider this schedule:
T1 T2
R(A)
R(B)
R(A)
R(B)
W(B)
W(A)
To convert this schedule into a serial schedule we must have to swap the R(A) operation of transaction
T2 with the W(A) operation of transaction T1. However, we cannot swap these two operations because
they are conflicting operations, thus we can say that this given schedule is not Conflict Serializable.
Let’s take another example:
T1 T2
----- ------
R(A)
R(A)
R(B)
W(B)
R(B)
W(A)
Let’s swap non-conflicting operations:
After swapping R(A) of T1 and R(A) of T2 we get:
T1 T2
----- ------
R(A)
R(A)
R(B)
W(B)
R(B)
W(A)
After swapping R(A) of T1 and R(B) of T2 we get:
T1 T2
----- ------
R(A)
R(B)
R(A)
W(B)
R(B)
W(A)
After swapping R(A) of T1 and W(B) of T2 we get:
T1 T2
----- ------
R(A)
R(B)
W(B)
R(A)
R(B)
W(A)
We finally got a serial schedule after swapping all the non-conflicting operations so we can say that the
given schedule is Conflict Serializable.
2. View Serializability
View Serializability is a process to find out that a given schedule is view serializable or not.
To check whether a given schedule is view serializable, we need to check whether the given schedule
is View Equivalent to its serial schedule. Lets take an example to understand what I mean by that.
Given Schedule:
T1 T2
----- ------
R(X)
W(X)
R(X)
W(X)
R(Y)
W(Y)
R(Y)
W(Y)
Serial Schedule of the above given schedule:
As we know that in Serial schedule a transaction only starts when the current running transaction is
finished. So the serial schedule of the above given schedule would look like this:
T1 T2
----- ------
R(X)
W(X)
R(Y)
W(Y)
R(X)
W(X)
R(Y)
W(Y)
If we can prove that the given schedule is View Equivalent to its serial schedule then the given schedule
is called view Serializable.
Testing for Serializability
5. CONCURRENCY CONTROL
In the concurrency control, the multiple transactions can be executed simultaneously.
It may affect the transaction result. It is highly important to maintain the order of execution of those
transactions.
6. NEED FOR CONCURRENCY
Problems of concurrency control
Several problems can occur when concurrent transactions are executed in an uncontrolled manner.
Following are the three problems in concurrency control.
Lost updates
Dirty read
Unrepeatable read
1. Lost update problem
When two transactions that access the same database items contain their operations in a way that
makes the value of some database item incorrect, then the lost update problem occurs.
If two transactions T1 and T2 read a record and then update it, then the effect of updating of the first
record will be overwritten by the second update.
Example:
Here,
o At time t2, transaction-X reads A's value.
o At time t3, Transaction-Y reads A's value.
o At time t4, Transactions-X writes A's value on the basis of the value seen at time t2.
o At time t5, Transactions-Y writes A's value on the basis of the value seen at time t3.
o So at time T5, the update of Transaction-X is lost because Transaction y overwrites it
without looking at its current value.
o Such type of problem is known as Lost Update Problem as update made by one transaction
is lost here.
2. Dirty Read
o The dirty read occurs in the case when one transaction updates an item of the database,
and then the transaction fails for some reason. The updated database item is accessed by another
transaction before it is changed back to the original value.
o A transaction T1 updates a record which is read by T2. If T1 aborts then T2 now has
values which have never formed part of the stable database.
Example:
o Transaction-X is doing the sum of all balance while transaction-Y is transferring an amount 50
from Account-1 to Account-3.
o Here, transaction-X produces the result of 550 which is incorrect. If we write this produced
result in the database, the database will become an inconsistent state because the actual sum is 600.
o Here, transaction-X has seen an inconsistent state of the database.
Concurrency Control Protocol
Concurrency control protocols ensure atomicity, isolation, and serializability of concurrent
transactions. The concurrency control protocol can be divided into three categories:
1. Lock based protocol
2. Time-stamp protocol
7. LOCKING PROTOCOLS
Lock-Based Protocol
In this type of protocol, any transaction cannot read or write data until it acquires an appropriate lock on it.
There are two types of lock:
1. Shared lock:
o It is also known as a Read-only lock. In a shared lock, the data item can only read by
the transaction.
o It can be shared between the transactions because when the transaction holds a lock, then
it can't update the data on the data item.
2. Exclusive lock:
o In the exclusive lock, the data item can be both reads as well as written by the transaction.
o This lock is exclusive, and in this lock, multiple transactions do not modify the same
data simultaneously.
The following way shows how unlocking and locking work with 2-PL.
Transaction T1:
o Growing phase: from step 1-3
o Shrinking phase: from step 5-7
o Lock point: at 3
Transaction T2:
o Growing phase: from step 2-6
o Shrinking phase: from step 8-9
o Lock point: at 6
Types of Two Phase Locking (2PL)
1. Strict Two-phase locking (Strict-2PL)
o The first phase of Strict-2PL is similar to 2PL. In the first phase, after acquiring all the locks,
the transaction continues to execute normally.
o The only difference between 2PL and strict 2PL is that Strict-2PL does not release a lock
after using it.
o Strict-2PL waits until the whole transaction to commit, and then it releases all the locks at
a time.
o Strict-2PL protocol does not have shrinking phase of lock release.
Problems
9. DEADLOCK
A deadlock is a condition wherein two or more tasks are waiting for each other in order to be finished
but none of the task is willing to give up the resources that other task needs. In this situation no task ever gets
finished and is in waiting state forever.
Coffman conditions
Coffman stated four conditions for a deadlock occurrence. A deadlock may occur if all the following
conditions holds true.
Mutual exclusion condition: There must be at least one resource that cannot be used by more than one
process at a time.
Hold and wait condition: A process that is holding a resource can request for additional resources that
are being held by other processes in the system.
No preemption condition: A resource cannot be forcibly taken from a process. Only the process
can release a resource that is being held by it.
Circular wait condition: A condition where one process is waiting for a resource that is being held
by second process and second process is waiting for third process ….so on and the last process is
waiting for the first process. Thus, making a circular chain of waiting.
For example: In the student table, transaction T1 holds a lock on some rows and needs to update some
rows in the grade table. Simultaneously, transaction T2 holds locks on some rows in the grade table and needs
to update the rows in the Student table held by Transaction T1.
Now, the main problem arises. Now Transaction T1 is waiting for T2 to release its lock and similarly,
transaction T2 is waiting for T1 to release its lock. All activities come to a halt state and remain at a standstill.
It will remain in a standstill until the DBMS detects the deadlock and aborts one of the transactions.
Deadlock Avoidance
o When a database is stuck in a deadlock state, then it is better to avoid the database rather than aborting
o The wait for the graph is maintained by the system for every transaction which is waiting for some
data held by the others. The system keeps checking the graph if there is any cycle in the graph.
The wait for a graph for the above scenario is shown below:
Deadlock Prevention
o Deadlock prevention method is suitable for a large database. If the resources are allocated in such a
way that deadlock never occurs, then the deadlock can be prevented.
o The Database management system analyzes the operations of the transaction whether they can create a
deadlock situation or not. If they do, then the DBMS never allowed that transaction to be executed.
I. Wait-Die scheme
In this scheme, if a transaction requests for a resource which is already held with a conflicting lock by
another transaction then the DBMS simply checks the timestamp of both transactions. It allows the older
transaction to wait until the resource is available for execution.
Let's assume there are two transactions Ti and Tj and let TS(T) is a timestamp of any transaction T. If
T2 holds a lock by some other transaction and T1 is requesting for resources held by T2 then the following
actions are performed by DBMS:
1. Check if TS(Ti) < TS(Tj) - If Ti is the older transaction and Tj has held some resource, then Ti is
allowed to wait until the data-item is available for execution. That means if the older transaction is
waiting for a resource which is locked by the younger transaction, then the older transaction is allowed
to wait for resource until it is available.
2. Check if TS(Ti) < TS(Tj) - If Ti is older transaction and has held some resource and if Tj is waiting
for it, then Tj is killed and restarted later with the random delay but with the same timestamp.
II. Wound wait scheme
o In wound wait scheme, if the older transaction requests for a resource which is held by the younger
transaction, then older transaction forces younger one to kill the transaction and release the
resource. After the minute delay, the younger transaction is restarted but with the same timestamp.
o If the older transaction has held a resource which is requested by the Younger transaction, then
the younger transaction is asked to wait until older releases it.
Here is the table representation of resource allocation for each algorithm. Both of these
algorithms take process age into consideration while determining the best possible way of resource
allocation for deadlock avoidance. One of the famous deadlock avoidance algorithm is Banker’s
algorithm
10. TRANSACTION RECOVERY
UNIT IV IMPLEMENTATION TECHNIQUES
RAID – File Organization – Organization of Records in Files – Indexing and Hashing –Ordered
Indices – B+ tree Index Files – B tree Index Files – Static Hashing – Dynamic Hashing – Query
Processing Overview – Algorithms for SELECT and JOIN operations – Query optimization using
Heuristics and Cost Estimation
RAID
RAID (redundant array of independent disks) originally redundant array of inexpensive disks) is a way of
storing the same data in different places on multiple hard disks to protect data in the case of a drive failure.
www.Notesfree.
RAID: Redundant Arrays of Independent Disks
Disk organization techniques that manage a large numbers of disks, providing a view of a single disk
of high capacity and high speed by using multiple disks in parallel, and high reliability by storing data
redundantly, so that data can be recovered even if a disk fails
Motivation for RAID
Just as additional memory in form of cache, can improve the system performance, in the same way
additional disks can also improve system performance.
n
In RAID we can use an array of disks which operates independently since there are many disks,
multiple I/O requests can be handled in parallel if the data required is on separate disks
A single I/O operation can be handled in parallel if the data required is distributed across multiple
disks.
Benefits of RAID
Data loss can be very dangerous for an organization
RAID technology prevents data loss due to disk failure
RAID technology can be implemented in hardware or software
Servers make use of RAID Technology
RAID Level 0- Stripping and non-redundant
RAID level 0 divides data into block units and writes them across a number of disks. As data is
placed
across multiple disks it is also called “data Striping”.
The advantage of distributing data over disks is that if different I/O requests are pending for two
different blocks of data, then there is a possibility that the requested blocks are on different disks
There is no parity checking of data. So if data in one drive gets corrupted then all the data would be lost.
Thus RAID 0 does not support data recovery Spanning is another term that is used with RAID level 0
because the logical disk will span all the physical drives. RAID 0 implementation requires minimum 2
disks.
Advantages
I/O performance is greatly improved by spreading the I/O load across many channels & drives.
Best performance is achieved when data is striped across multiple controllers with only one
driver per controller
Disadvantages
It is not fault-tolerant, failure of one drive will result in all data in an array being lost
RAID Level 1: Mirroring (or shadowing)
Also known as disk mirroring, this configuration consists of at least two drives that duplicate the
storage of
data. There is no striping.
Read performance is improved since either disk can be read at the same time. Write performance is
the same as for single disk storage.
Every write is carried out on both disks. If one disk in a pair fails, data still available in the other.
Data loss would occur only if a disk fails, and its mirror disk also fails before the system is
repaired Probability of combined event is very small.
www.Notesfree.
n
RAID Level 2:
This configuration uses striping across disks, with some disks storing error checking and correcting (ECC)
information. It has no advantage over RAID 3 and is no longer used.
RAID Level 5:
RAID 5 uses striping as well as parity for redundancy. It is well suited for heavy read and
low write operations.
Block-Interleaved Distributed Parity; partitions data and parity among all N + 1 disks, rather
than storing data in N disks and parity in 1 disk.
RAID Level 6:
This technique is similar to RAID 5, but includes a second parity scheme that is distributed across
the drives in the array. The use of additional parity allows the array to continue to function even if
two disks fail simultaneously. However, this extra protection comes at a cost.
P+Q Redundancy scheme; similar to Level 5, but stores extra redundant information to guard
against multiple disk failures.
- Better reliability than Level 5 at a higher cost; not used as widely.
File Organization
The database is stored as a collection of files.
Each file is a sequence of records.
A record is a sequence of fields.
Classifications of records
– Fixed length record
– Variable length record
Fixed length record approach:
Assume record size is fixed each file has records of one particular type only different files
are used for different relations
Simple approach
- Record access is
simple Example pseudo code
type account = record
account_number
char(10); branch_name
char(22); balance
numeric(8);
end
Total bytes 40 for a
record
Two
problems - Difficult to delete record from this structure.
- Some record will cross block boundaries, that is part of the record will be stored in one block
and
part in another. It would require two block accesses to read or write
Reuse the free space alternatives:
– move records i + 1, . . ., n to n i, . . . , n – 1
– do not move records, but link all free records
on a free list
– Move the final record to deleted record place.
Free Lists
Store the address of the first deleted record in the file header.
Use this first record to store the address of the second deleted record, and so on
Variable-Length Records
Byte string representation
Attach an end-of-record () control character to the end of each record
Difficulty with deletion
0 perryridge A-102 400 A-201 900
www.Notesfree.
1 roundhill A-305 350
n
Slotted page header contains:
– number of record entries
– end of free space in the block
– location and size of each record
Pointer Method
Search-key pointer
Index files are typically much smaller than the original file
Two basic kinds of indices:
– Ordered indices: search keys are stored in sorted order
– Hash indices: search keys are distributed uniformly across “buckets” and by using a
“hash function” the values are determined.
Ordered Indices
In an ordered index, index entries are stored sorted on the search key value.
Primary index: in a sequentially ordered file, the index whose search key specifies the sequential
order of the file.
Secondary index: an index whose search key specifies an order different from the sequential
order of the file.
Types of Ordered Indices
Dense index
Sparse index
Dense Index Files
Dense index — Index record appears for every search-key value in the file.
www.Notesfree.
Sparse Index Files
Sparse Index
contains index records for only some search-key values.
To locate a record with search-key value K we:
n
– Find index record with largest search-key value that is less than or
equal to Search file sequentially starting at the record to which the index record
points
Multilevel Index
If primary index does not fit in memory, access becomes expensive.
To reduce number of disk accesses to index records, treat primary index kept on disk as a
sequential file and construct a sparse index on it.
– outer index – a sparse index of primary index
– inner index – the primary index file
If even outer index is too large to fit in main memory, yet another level of index can be created, and
so on.
Index Update: Deletion
If deleted record was the only record in the file with its particular search-key value, the search-key is
deleted from the index also.
Single-level index deletion:
– Dense indices – deletion of search-key is similar to file record deletion.
– Sparse indices – if an entry for the search key exists in the index, it is deleted by replacing
the entry in the index with the next search-key value in the file (in search-key order). If the
next search-key value already has an index entry, the entry is deleted instead of being
replaced.
Index Update: Insertion
Single-level index insertion:
– Perform a lookup using the search-key value appearing in the record to be inserted.
– Dense indices – if the search-key value does not appear in the index, insert it.
– Sparse indices – if index stores an entry for each block of the file, no change needs to be
made to the index unless a new block is created. In this case, the first search-key value
appearing in the new block is inserted into the index.
Secondary Index on balance field of account
Non-leaf nodes other than root must have between 3 and 5 children ((n/2 and n with n =5).
Root must have at least 2 children.
Observations about B+-trees
Since the inter-node connections are done by pointers, “logically” close blocks need not be
“physically”
close.
The B+-tree contains a relatively small number of levels thus searches can be conducted efficiently.
Insertions and deletions to the main file can be handled efficiently.
Updates on B+-Trees: Insertion
Find the leaf node in which the search-key value would appear
If the search-key value is already there in the leaf node, record is added to file and if necessary a
pointer is inserted into the bucket.
If the search-key value is not there, then add the record to the main file and create a bucket if
necessary.Then:
– If there is room in the leaf node, insert (key-value, pointer) pair in the leaf node otherwise,
split the node.
Example: B+-Tree before and after insertion of “Clearview”
• The removal of the leaf node containing “Downtown” did not result in its parent having too little
pointers.
So the cascaded deletions stopped with the deleted leaf node’s parent.
Deletion of “Perryridge” from result of previous example
• Node with “Perryridge” becomes empty and merged with its sibling.
• Root node then had only one child, and was deleted and its child became the new root node
B+-Tree File Organization
• The leaf nodes in a B+-tree file organization store records, instead of pointers.
• Since records are larger than pointers, the maximum number of records that can be stored in a
leaf node is less than the number of pointers in a nonleaf node.
• Leaf nodes are still required to be half full.
• Insertion and deletion are handled in the same way as insertion and deletion of entries in a B+-tree index.
B-Tree Index Files
• Similar to B+-tree, but B-tree allows search-key values to appear only once; eliminates redundant
storage of search keys.
• Search keys in nonleaf nodes appear nowhere else in the B-tree; an additional pointer field for
each search key in a nonleaf node must be included.
HASHING
• Hashing is an effective technique to calculate the direct location of a data record on the disk
without using index structure.
• Hashing uses hash functions with search keys as parameters to generate the address of a data record.
Hash
Organization
Bucket
A hash file stores data in bucket format. Bucket is considered a unit of storage. A bucket
typically stores one complete disk block, which in turn can store one or more records.
Hash Function
A hash function, h, is a mapping function that maps all the set of search-keys K to the
address where actual records are placed. It is a function from search keys to bucket addresses.
Worst hash function maps all search-key values to the same bucket.
An ideal hash function is uniform, i.e., each bucket is assigned the same number of search-key
values from the set of all possible values.
Ideal hash function is random, so each bucket will have the same number of
records. Types
• Static Hashing
• Dynamic Hashing
Static Hashing
In static hashing, when a search-key value is provided, the hash function always computes the same
address.
For example, if mod-4 hash function is used, then it shall generate only 5 values. The output
address shall always be same for that function.
The number of buckets provided remains unchanged at all times.
Example of Hash File Organization
There are 10 buckets,
The hash function returns the sum of the binary representations of the characters modulo 10
– E.g. h(Perryridge) = 5 h(Round Hill) = 3 h(Brighton) = 3
Operation
Insertion − When a record is required to be entered using static hash, the hash function h
computes the bucket address for search key K, where the record will be stored.
Bucket address = h(K)
Search − When a record needs to be retrieved, the same hash function can be used to retrieve the
address of
the bucket where the data is stored.
Delete − This is simply a search followed by a deletion operation.
Handling of Bucket Overflows
Bucket overflow can occur because of
– Insufficient buckets
– Skew in distribution of records. This can occur due to :
• multiple records have same search-key value
Although the probability of bucket overflow can be reduced, it cannot be eliminated; it is handled
by using overflow buckets.
Overflow chaining – the overflow buckets of a given bucket are chained together in a linked list.
Above scheme is called closed hashing.
– An alternative, called open hashing, which does not use overflow buckets, is not
suitable for database applications.
www.Notesfree.
Hash Indices
n
• Hashing can be used not only for file organization, but also for index-structure creation.
• A hash index organizes the search keys, with their associated record pointers, into a hash file
structure.
• Hash indices are always secondary indices
www.Notesfree.
n
Insertion in Extendable Hash Structure
To split a bucket j when inserting record with search-key value Kj:
If i > ij (more than one pointer to bucket j)
– allocate a new bucket z, and set ij = iz = (ij + 1)
– Update the second half of the bucket address table entries originally pointing to j, to point to
z
– remove each record in bucket j and reinsert (in j or z)
– recompute new bucket for Kj and insert record in the bucket (further splitting is
required if the bucket is still full)
If i = ij (only one pointer to bucket j)
– If i reaches some limit b, or too many splits have happened in this insertion, create an
overflow bucket
Else
– increment i and double the size of the bucket address table.
– replace each entry in the table by two entries that point to the same bucket.
– recompute new bucket address table entry for Kj
Now i > ij so use the first case above.
Deletion in Extendable Hash Structure
To delete a key value,
– locate it in its bucket and remove it.
– The bucket itself can be removed if it becomes empty (with appropriate updates to the
bucket address table).
– Coalescing of buckets can be done (can coalesce only with a “buddy” bucket having same
value of ij
and same ij –1 prefix, if it is present)
– Decreasing bucket address table size is also possible
• Note: decreasing bucket address table size is an expensive operation and should
be done only if number of buckets becomes much smaller than the size of the table
Example
Initial Hash structure, bucket size = 2
Hash structure after insertion of one Brighton and two Downtown records
www.Notesfree.
n
Hash structure after insertion of Mianus record
www.Notesfree.
– The intersection of these sets of record pointers gives the record pointers that satisfy the
conjunctive condition.
– If only some of the conditions have secondary indexes, each retrieval record is further tested
to determine whether it satisfies the remaining conditions.
n
– Two-way join: join on two files.
– Multiway join: involving more than two files.
The following examples of two-way JOIN operation (RΘ A=BS) will be used:
– OP6: EMPLOYEE Θ DNO=DNUMBER DEPARTMENT
– OP7: DEPARTMENT Θ MGRSSN=SSN EMPLOYEE
J1: Nested-loop join (brute force)
– For each record t in R (outer loop), retrieve every record s from S (inner loop) and test
whether the two records satisfy the join condition t[A] = s[B].
J2: Single-loop join (using an access structure to retrieve the matching records)
– If an index (or hash key) exists for one of the two join attributes (e.g B of S), retrieve each
record t in R, one at a time (single loop), and then use the access structure to retrieve directly
all matching records s from S that satisfy s[B] = t[A]
J3: Sort-merge join:
– If the records of R and S are physically sorted (ordered) by value of the join attributes A
and B, respectively, we can implement the join in the most efficient way.
– Both files are scanned concurrently in order of the join attributes, matching the records that
have the same values for A and B.
– If the files are not sorted, they may be sorted first by using external sorting.
– Pairs of file blocks are copied into memory buffers in order and records of each file are
scanned only once each for matching with the other file if A & B are key attributes.
– The method is slightly modified in case where A and B are not key attributes.
J4: Hash-join
– The records of files R and S are both hashed to the same hash file using the same hashing
function on the join attributes A of R and B of S as hash keys.
Partitioning Phase
– First, a single pass through the file with fewer records (say, R) hashes its records to the
hash file buckets.
– Assumption: The smaller file fits entirely into memory buckets after the first phase.
– (If the above assumption is not satisfied, the method is a more complex one and number of
variations have been proposed to improve efficiency: partition has join and hybrid hash join.)
Probing Phase
– A single pass through the other file (S) then hashes each of its records to probe appropriate
bucket, and that record is combined with all matching records from R in that bucket.
Heuristic-Based Query Optimization
1. Break up SELECT operations with conjunctive conditions into a cascade of SELECT operations
2. Using the commutativity of SELECT with other operations, move each SELECT operation as far
down the query tree as is permitted by the attributes involved in the select condition
3. Using commutativity and associativity of binary operations, rearrange the leaf nodes of the tree
4. Combine a CARTESIAN PRODUCT operation with a subsequent SELECT operation in the
tree into a JOIN operation, if the condition represents a join condition
5. Using the cascading of PROJECT and the commuting of PROJECT with other operations, break
down and move lists of projection attributes down the tree as far as possible by creating new
PROJECT operations as needed
6. Identify sub-trees that represent groups of operations that can be executed by a single algorithm
Example
Query
"Find the last names of employees born after 1957 who work on a project named ‘Aquarius’."
SQL
SELECT LNAME
FROM EMPLOYEE, WORKS_ON, PROJECT
WHERE PNAME=‘Aquarius’ AND PNUMBER=PNO AND ESSN=SSN AND BDATE.‘1957-12-31’;