Unit 3 Notes1
Unit 3 Notes1
Topic: schema refinement and normalization Class: 2nd CSE-B ,1ST SEM.
UNIT-3
Schema Refinement:
• The Schema Refinement refers to refine the schema by using some technique. The
best technique of schema refinement is decomposition.
• Normalization or Schema Refinement is a technique of organizing the data in the
database.
• It is a systematic approach of decomposing tables to eliminate data redundancy and
undesirable characteristics like Insertion, Update and Deletion Anomalies.
Redundancy:
• Redundancy refers to repetition of same data or duplicate copies of same data stored
in different locations.
• Redundancy means having multiple copies of same data in the database.
As it can be observed that values of attribute college name, college rank, course is
being repeated which can lead to problems
Anomalies:
• Anomalies refers to the problems occurred after poorly planned and normalized
databases where all the data is stored in one table which is sometimes called a flat file
database.
• Problems caused due to redundancy are:
1
DATA BASE MANAGEMENT SYSTEMS
Topic: schema refinement and normalization Class: 2nd CSE-B ,1ST SEM.
Here all the data is stored in a single table which causes redundancy of data or say anomalies
as SID and Sname are repeated once for same CID . Let us discuss anomalies one by one.
Due to redundancy of data we may get the following problems
Insertion anomalies: It may not be possible to store some information unless some other
information is stored as well.
Redundant storage: some information is stored repeatedly
Update anomalies: If one copy of redundant data is updated, then inconsistency is created
unless all redundant copies of data are updated.
Deletion anomalies: It may not be possible to delete some information without losing some
other information as well.
2
DATA BASE MANAGEMENT SYSTEMS
Topic: schema refinement and normalization Class: 2nd CSE-B ,1ST SEM.
3
DATA BASE MANAGEMENT SYSTEMS
Topic: schema refinement and normalization Class: 2nd CSE-B ,1ST SEM.
Because of insertion of some data, It is forced to insert some other dummy data.
Deletion Anomaly:
• Deletion of S3 student cause the deletion of course.
• Because of deletion of some data forced to delete some other useful data.
Solutions To Anomalies
Decomposition of Tables –
Schema Refinement
4
DATA BASE MANAGEMENT SYSTEMS
Topic: schema refinement and normalization Class: 2nd CSE-B ,1ST SEM.
5
DATA BASE MANAGEMENT SYSTEMS
FUNCTIONAL DEPENDANCY:
• It is a relationship that exists when one attribute uniquely determines another attribute
X Y
ID NAME SUR_NAME
S1 BHANU P
S2 SIVA G
S3 PRIYA M
• ID as identification key
• IDNAME
• ID__SUR_NAME (Functionally dependant table)
6
DATA BASE MANAGEMENT SYSTEMS
Functional dependencies:
• Functional dependency is a relationship that exist when one attribute uniquely
determines another attribute.
• Functional dependency is a form of integrity constraint that can identify schema with
redundant storage problems and to suggest refinement.
• A functional dependency AB in a relation holds true if two tuples having the same
value of attribute A also have the same value of attribute B
• IF t1.X=t2.X then t1.Y=t2.Y where t1,t2 are tuples and X,Y are attributes.
Fully functional dependencies
• Example:ABCD (D is fully FD on ABC)
• ABC combining ,uniquely determines D
• D cannot depends on any subset of ABC
As ..BCD (not possible as FFD)
CD
AD
Because BC cannot determines D,C cannot determines D and A cannot determines D
s1 pinky p2 5
s2 lucky p1 6
7
DATA BASE MANAGEMENT SYSTEMS
8
DATA BASE MANAGEMENT SYSTEMS
Attribute closure
Attribute closure of an attribute set can be defined as set of attributes which can be
functionally determined from it.
NOTE:
To find attribute closure of an attribute set-
1) Add elements of attribute set to the result set.
2) Recursively add elements to the result set which can be functionally determined from the
elements of result set.
9
DATA BASE MANAGEMENT SYSTEMS
Attributes which are parts of any candidate key of relation are called as prime attribute, others
are non-prime attributes.
Candidate Key:
• Candidate Key is minimal set of attributes of a relation which can be used to identify
a tuple uniquely.
• Consider student table: student(sno, sname,sphone,age)
• We can take sno as candidate key. we can have more than 1 candidate key in a table.
types of candidate keys:
• simple(having only one attribute)
• composite(having multiple attributes as candidate key)
Super Key is set of attributes of a relation which can be used to identify a tuple uniquely.
– Adding zero or more attributes to candidate key generates super key.
• A candidate key is a super key but vice versa is not true
Normalization:
Normalization is a process of designing a consistent database with minimum redundancy
which support data integrity by grating or decomposing given relation into smaller relations
preserving constraints on the relation
Characteristics of normalisation:
• Scalar values in each field
• Minimal use of null values
• Absence of redundancy
10
DATA BASE MANAGEMENT SYSTEMS
Normalization removes data redundancy and it will helps in designing a good data base
which involves a set of normal forms as follows –
• 1)First normal form(1NF)
• 2)Second normal form(2NF)
• 3)Third normal form(3NF)
• 4)Boyce coded normal form(BCNF)
• 5)Forth normal form(4NF)
• 6)Fifth normal form(5NF
•
• )
11
DATA BASE MANAGEMENT SYSTEMS
12
DATA BASE MANAGEMENT SYSTEMS
Domain Courses
Programming C , java
• The above table consist of multiple values in single columns which can be reduced
into atomic values by using first normal form as follows-
Domain Courses
Programming C
Programming Java
13
DATA BASE MANAGEMENT SYSTEMS
Here (student id, project id) are key attributes and (student name, project name) are non-
prime attributes.
It is decomposed as-
14
DATA BASE MANAGEMENT SYSTEMS
(Or)
student
subject
sub_id Sub_name
score
Marks depends on all the key values but teacher depends on part of the primary key
TEACHER column depends on part of the primary key leads partial dependency
15
DATA BASE MANAGEMENT SYSTEMS
16
DATA BASE MANAGEMENT SYSTEMS
Total_marks as non prime attribute depends on another nonprime attribute exam _name
score
exam
17
DATA BASE MANAGEMENT SYSTEMS
STD COURSE
Pinky DBMS
lucky DBMS
deepu COA
sonu COA
jhansi DBMS
18
DATA BASE MANAGEMENT SYSTEMS
COURSE TEACHER
DBMS Priya
DBMS Madhu
COA bhanu
21 Computer Dancing
21 Math Singing
19
DATA BASE MANAGEMENT SYSTEMS
34 Chemistry Dancing
74 Biology Cricket
59 Physics Hockey
The given STUDENT table is in 3NF, but the COURSE and HOBBY are two independent
entity. Hence, there is no relationship between COURSE and HOBBY
• In the STUDENT relation, a student with STU_ID, 21 contains two
courses, Computer and Math and two hobbies, Dancing and Singing. So there is a
Multi-valued dependency on STU_ID, which leads to unnecessary repetition of data.
• So to make the above table into 4NF, we can decompose it into two tables:
STUDENT_COURSE
STU_ID COURSE
21 Computer
21 Math
34 Chemistry
74 Biology
59 Physics
STUDENT_HOBBY
STU_ID HOBBY
20
DATA BASE MANAGEMENT SYSTEMS
21 Dancing
21 Singing
34 Dancing
74 Cricket
59 Hockey
21
DATA BASE MANAGEMENT SYSTEMS
What is decomposition?
• Decomposition is the process of breaking down in parts or elements.
• It replaces a relation with a collection of smaller relations.
• It breaks the table into multiple tables in a database.
• It should always be lossless, because it confirms that the information in the original
relation can be accurately reconstructed based on the decomposed relations.
• If there is no proper decomposition of the relation, then it may lead to problems like
loss of information.
Properties of Decomposition
Following are the properties of Decomposition,
1. Lossless Decomposition
2 .Loosy decomposition
3. Dependency Preservation
4. Lack of Data Redundancy
1. Lossless Decomposition
* If the information is not lost from the relation that is decomposed, then the
decomposition will be lossless.
* The lossless decomposition guarantees that the join of relations will result in the same
relation as it was decomposed.
* The relation is said to be lossless decomposition if natural joins of all the
decomposition give the original relation.
22
DATA BASE MANAGEMENT SYSTEMS
Example
EMPLOYEE_DEPARTMENT table:
The above relation is decomposed into two relations EMPLOYEE and DEPARTMENT
EMPLOYEE table:
EMP_ID EMP_NAME EMP_AGE EMP_CITY
22 Denim 28 Mumbai
33 Alina 25 Delhi
46 Stephan 30 Bangalore
52 Katherine 36 Mumbai
60 Jack 40 Noida
23
DATA BASE MANAGEMENT SYSTEMS
DEPARTMENT table
DEPT_ID EMP_ID DEPT_NAME
827 22 Sales
438 33 Marketing
869 46 Finance
575 52 Production
678 60 Testing
Now, when these two relations are joined on the common column "EMP_ID", then the
resultant relation will look like:
Employee⋈ Department
Lossy Decomposition:
"The decomposition of relation R into R1 and R2 is lossy when the join of R1 and R2 does
not yield the same relation as in R."
One of the disadvantages of decomposition into two or more relational schemes (or tables) is
that some information is lost during retrieval of original relation or table.
Consider that we have table STUDENT with three attribute roll_no , sname and department
24
DATA BASE MANAGEMENT SYSTEMS
STUDENT:
Roll_no Sname Dept
111 parimal
222 parimal
Name_dept :
Sname Dept
parimal COMPUTER
parimal ELECTRICAL
In lossy decomposition, spurious tuples are generated when a natural join is applied to the
relations in the decomposition
stu_joined:
25
DATA BASE MANAGEMENT SYSTEMS
3. Dependency Preserving
Join dependency
* a join dependency is a constraint on the set of legal relations over a database scheme.
* A table T is subject to a join dependency if T can always be recreated by joining
multiple tables each having a subset of the attributes of T. If one of the tables in the
join has all the attributes of the table T, the join dependency is called trivial.
* The join dependency plays an important role in the Fifth normal form, also known
as project-join normal form, because it can be proven that if a scheme R is
decomposed in tables R1to Rn, the decomposition will be a lossless-join
decomposition if the legal relations on R are restricted to a join dependency on
R called *(R1,R2,.. ,Rn).
26
DATA BASE MANAGEMENT SYSTEMS
* A relation is in 5NF if it is in 4NF and not contains any join dependency and joining
should be lossless.
* 5NF is satisfied when all the tables are broken into as many tables as possible in order
to avoid redundancy.
* 5NF is also known as Project-join normal form (PJ/NF).
Example
In the above table, John takes both Computer and Math class for Semester 1 but he doesn't
take Math class for Semester 2. In this case, combination of all these fields required to
identify a valid data.
* Suppose we add a new Semester as Semester 3 but do not know about the subject and
who will be taking that subject so we leave Lecturer and Subject as NULL. But all
three columns together acts as a primary key, so we can't leave other two columns
blank.
* So to make the above table into 5NF, we can decompose it into three relations P1, P2
& P3:
27
DATA BASE MANAGEMENT SYSTEMS
P1
SEMESTER SUBJECT
Semester 1 Computer
Semester 1 Math
Semester 1 Chemistry
Semester 2 Math
P2
SUBJECT LECTURER
Computer Anshika
Computer John
Math John
Math Akash
Chemistry Praveen
P3
SEMSTER LECTURER
Semester 1 Anshika
Semester 1 John
Semester 1 John
Semester 2 Akash
Semester 1 Praveen
28
DATA BASE MANAGEMENT SYSTEMS
Transaction Management
Transactions in DBMS:
Transactions are a set of operations used to perform a logical set of work. A transaction
usually means that the data in the database has changed. One of the major uses of DBMS is
to protect the user’s data from system failures. It is done by ensuring that all the data is
restored to a consistent state when the computer is restarted after a crash. The transaction is
any one execution of the user program in a DBMS. Executing the same program multiple
times will generate multiple transactions.
Example –
Set of Operations:
Consider the following example for transaction operations as follows.
Example -ATM transaction steps.
Transaction Start.
Insert your ATM card.
Select language for your transaction.
Select Savings Account option.
Enter the amount you want to withdraw.
Enter your secret pin.
Wait for some time for processing.
Collect your Cash.
Transaction Completed.
Three operations can be performed in a transaction as follows.
1. Read/Access data (R).
2. Write/Change data (W).
3. Commit.
Example –
Transfer of 50₹ from Account A to Account B. Initially A= 500 ₹, B= 800 ₹. This data is
brought to RAM from Hard Disk.
29
DATA BASE MANAGEMENT SYSTEMS
All instructions before commit come under a partially committed state and are stored in
RAM. When the commit is read the data is fully accepted and is stored in Hard Disk.
If the data is failed anywhere before commit we have to go back and start from the
beginning. We can’t continue from the same state. This is known as Roll Back.
Uses of Transaction Management:
The DBMS is used to schedule the access of data concurrently. It means that the user
can access multiple data from the database without being interfered with each other.
Transactions are used to manage concurrency.
It is also used to satisfy ACID properties.
It is used to solve Read/Write Conflict.
It is used to implement Recoverability, Serializability, and Cascading.
Transaction Management is also used for Concurrency Control Protocols and Locking
of data.
A transaction is a set of logically related operations. For example, you are transferring
money from your bank account to your friend’s account, the set of operations would be like
this:
This whole set of operations can be called a transaction. Although I have shown you
read, write and update operations in the above example but the transaction can have
operations like read, write, insert, update, delete.
30
DATA BASE MANAGEMENT SYSTEMS
Let’s say your account is A and your friend’s account is B, you are transferring 10000 from A
to B, the steps of the transaction are:
1. R(A);
2. A = A - 10000;
3. W(A);
4. R(B);
5. B = B + 10000;
6. W(B);
In the above transaction R refers to the Read operation and W refers to the write operation.
Transaction failure in between the operations
Now that we understand what transaction is, we should understand what are the problems
associated with it.
The main problem that can happen during a transaction is that the transaction can fail before
finishing the all the operations in the set. This can happen due to power failure, system crash
etc. This is a serious problem that can leave database in an inconsistent state. Assume that
transaction fail after third operation (see the example above) then the amount would be
deducted from your account but your friend will not receive it.
Commit: If all the operations in a transaction are completed successfully then commit those
changes to the database permanently.
Rollback: If any of the operation fails then rollback all the changes done by previous
operations.
Even though these operations can help us avoiding several issues that may arise during
transaction but they are not sufficient when two transactions are running concurrently. To
handle those problems we need to understand database ACID properties.
31
DATA BASE MANAGEMENT SYSTEMS
To ensure the integrity of data during a transaction (A transaction is a unit of program that
updates various data items, read more about it here), the database system maintains the
following properties. These properties are widely known as ACID properties:
Atomicity: This property ensures that either all the operations of a transaction reflect in
database or none. Let’s take an example of banking system to understand this: Suppose
Account A has a balance of 400$ & B has 700$. Account A is transferring 100$ to
Account B. This is a transaction that has two operations a) Debiting 100$ from A’s
balance b) Creating 100$ to B’s balance. Let’s say first operation passed successfully
while second failed, in this case A’s balance would be 300$ while B would be having
700$ instead of 800$. This is unacceptable in a banking system. Either the transaction
should fail without executing any of the operation or it should process both the
operations. The Atomicity property ensures that.
Consistency: To preserve the consistency of database, the execution of transaction
should take place in isolation (that means no other transaction should run concurrently
when there is a transaction already running). For example account A is having a
balance of 400$ and it is transferring 100$ to account B & C both. So we have two
transactions here. Let’s say these transactions run concurrently and both the
transactions read 400$ balance, in that case the final balance of A would be 300$
instead of 200$. This is wrong. If the transaction were to run in isolation then the
second transaction would have read the correct balance 300$ (before debiting 100$)
once the first transaction went successful.
Isolation: For every pair of transactions, one transaction should start execution only
when the other finished execution. I have already discussed the example of Isolation in
the Consistency property above.
Durability: Once a transaction completes successfully, the changes it has made into
the database should be permanent even if there is a system failure. The recovery-
management component of database systems ensures the durability of transaction.
In this guide, we will discuss the states of a transaction in DBMS. A transaction in DBMS
can be in one of the following states.
32
DATA BASE MANAGEMENT SYSTEMS
Active State
Failed State
As we can see in the above diagram that a transaction goes into “partially committed” state
from the active state when there are read and write operations present in the transaction.
A transaction contains number of read and write operations. Once the whole transaction is
successfully executed, the transaction goes into partially committed state where we have all
the read and write operations performed on the main memory (local memory) instead of the
actual database.
The reason why we have this state is because a transaction can fail during execution so if we
are making the changes in the actual database instead of local memory, database may be left
in an inconsistent state in case of any failure. This state helps us to rollback the changes
made to the database in case of a failure during execution
33
DATA BASE MANAGEMENT SYSTEMS
Committed State
If a transaction completes the execution successfully then all the changes made in the local
memory during partially committed state are permanently stored in the database. You can
also see in the above diagram that a transaction goes from partially committed state to
committed state when everything is successful.
Committed State
If a transaction completes the execution successfully then all the changes made in the local
memory during partially committed state are permanently stored in the database. You can
also see in the above diagram that a transaction goes from partially committed state to
committed state when everything is successful.
Aborted State
As we have seen above, if a transaction fails during execution then the transaction goes into a
failed state. The changes made into the local memory (or buffer) are rolled back to the
previous consistent state and the transaction goes into aborted state from the failed state.
Refer the diagram to see the interaction between failed and aborted state.
We know that transactions are set of instructions and these instructions perform operations on
database. When multiple transactions are running concurrently then there needs to be a
sequence in which the operations are performed because at a time only one operation can be
performed on the database. This sequence of operations is known as Schedule.
The following sequence of operations is a schedule. Here we have two transactions T1 & T2
which are running concurrently. This schedule determines the exact order of operations that
are going to be performed on database. In this example, all the instructions of transaction T1
are executed before the instructions of transaction T2,
34
DATA BASE MANAGEMENT SYSTEMS
T1 T2
---- ----
R(X)
W(X)
R(Y)
R(Y)
R(X)
W(Y)
We have various types of schedules in DBMS. Let’s discuss them one by one.
Serial schedule: A schedule where the operations of each transaction are executed
consequently without any interleaved operations from other transactions.
35
DATA BASE MANAGEMENT SYSTEMS
Non serial schedule: A schedule the operations from a set of concurrent transactions are
interleaved.
TIME T1 T2 T1 T2 T1 T2
36
DATA BASE MANAGEMENT SYSTEMS
A transaction consists of many steps. Some involve I/O activity; others involve CPU activity.
The CPU and the disks in a computer system can operate in parallel. Therefore, I/O activity
can be done in parallel with processing at the CPU.
The parallelism of the CPU and the I/O system can therefore be exploited to run multiple
transactions in parallel.
While a read or write on behalf of one transaction is in progress on one disk, another
transaction can be running in the CPU, while another disk may be executing a read or write
on behalf of a third transaction.
All of this increases the throughput of the system—that is, the number of transactions
executed in a given amount of time.
Correspondingly, the processor and disk utilization also increase; in other words, the
processor and disk spend less time idle, or not performing any useful work.
There may be a mix of transactions running on a system, some short and some long.
If transactions run serially, a short transaction may have to wait for a preceding long
transaction to complete, which can lead to unpredictable delays in running a transaction.
If the transactions are operating on different parts of the database, it is better to let them run
concurrently, sharing the CPU cycles and disk accesses among them.
Concurrent execution reduces the unpredictable delays in running transactions.
Moreover, it also reduces the average response time: the average time for a transaction to be
completed after it has been submitted.
37
DATA BASE MANAGEMENT SYSTEMS
The database system must control the interaction among the concurrent transactions to
prevent them from destroying the consistency of the database. It is achieved
using concurrency-control schemes.
DBMS Serializability
When multiple transactions are running concurrently then there is a possibility that the
database may be left in an inconsistent state. Serializability is a concept that helps us to check
which schedules are serializable. A serializable schedule is the one that always leaves the
database in consistent state.
A serializable schedule always leaves the database in consistent state. A serial schedule is
always a serializable schedule because in serial schedule, a transaction only starts when the
other transaction finished execution. However a non-serial schedule needs to be checked for
Serializability.A non-serial schedule of n number of transactions is said to be serializable
schedule, if it is equivalent to the serial schedule of those n transactions. A serial schedule
doesn’t allow concurrency, only one transaction executes at a time and the other starts when
the already running transaction finished.
Types of Serializability
Conflict Serializability is one of the type of Serializability, which can be used to check
whether a non-serial schedule is conflict serializable or not.What is Conflict Serializability?
A schedule is called conflict serializable if we can convert it into a serial schedule after
swapping its non-conflicting operations.
Conflicting operations
Two operations are said to be in conflict, if they satisfy all the following three conditions:
38
DATA BASE MANAGEMENT SYSTEMS
T1 T2
----- ------
R(A)
R(B)
R(A)
R(B)
W(B)
W(A)
To convert this schedule into a serial schedule we must have to swap the R(A) operation of
transaction T2 with the W(A) operation of transaction T1. However we cannot swap these
39
DATA BASE MANAGEMENT SYSTEMS
two operations because they are conflicting operations, thus we can say that this given
schedule is not Conflict Serializable.
T1 T2
----- ------
R(A)
R(A)
R(B)
W(B)
R(B)
W(A)
T1 T2
----- ------
R(A)
R(A)
R(B)
W(B)
R(B)
W(A)
T1 T2
----- ------
R(A)
R(B)
R(A)
W(B)
R(B)
W(A)
After swapping R(A) of T1 and W(B) of T2 we get:
T1 T2
----- ------
R(A)
R(B)
W(B)
R(A)
R(B)
W(A)
40
DATA BASE MANAGEMENT SYSTEMS
We finally got a serial schedule after swapping all the non-conflicting operations so we can
say that the given schedule is Conflict Serializable.
View Serializability is a process to find out that a given schedule is view serializable or not.
To check whether a given schedule is view serializable, we need to check whether the given
schedule is View Equivalent to its serial schedule. Lets take an example to understand what I
mean by that.
Given Schedule:
T1 T2
----- ------
R(X)
W(X)
R(X)
W(X)
R(Y)
W(Y)
R(Y)
W(Y)
41
DATA BASE MANAGEMENT SYSTEMS
T1 T2
----- ------
R(X)
W(X)
R(Y)
W(Y)
R(X)
W(X)
R(Y)
W(Y)
If we can prove that the given schedule is View Equivalent to its serial schedule then the
given schedule is called view Serializable.
Recoverability in DBMS
As discussed, a transaction may not execute completely due to hardware failure, system
crash or software issues. In that case, we have to roll back the failed transaction. But some
other transaction may also have used values produced by the failed transaction. So we have
to roll back those transactions as well.
Recoverable Schedules:
Schedules in which transactions commit only after all transactions whose changes they
read commit are called recoverable schedules. In other words, if some transaction T j is
reading value updated or written by some other transaction T i, then the commit of
Tj must occur after the commit of T i.
Example 1:
S1: R1(x), W1(x), R2(x), R1(y), R2(y),
W2(x), W1(y), C1, C2;
Given schedule follows order of Ti->Tj => C1->C2. Transaction T1 is executed before T2
hence there is no chances of conflict occur. R1(x) appears before W1(x) and transaction T1
is committed before T2 i.e. completion of first transaction performed first update on data
item x, hence given schedule is recoverable.
42
DATA BASE MANAGEMENT SYSTEMS
Example 2: Consider the following schedule involving two transactions T 1 and T2.
T1 T2
R(A)
W(A)
W(A)
R(A)
commit
commit
This is a recoverable schedule since T 1 commits before T2, that makes the value read by
T2 correct.
43