MODULE 4-
Normalization
Different anomalies in designing a database, The
idea of normalization, Functional dependency,
Armstrong’s Axioms (proofs not required),
Closures and their computation, Equivalence of
Functional Dependencies (FD), Minimal Cover
(proofs not required). First Normal Form (1NF),
Second Normal Form (2NF), Third Normal Form
(3NF), Boyce Codd Normal Form (BCNF),
Lossless join and dependency preserving
decomposition, Algorithms for checking Lossless
Join
(LJ) and Dependency Preserving (DP) properties.
Different anomalies in
designing a database
Ifa table is not properly normalized
and have data redundancy then it
will not only eat up extra memory
space but will also make it difficult
to handle and update the database,
without facing data loss.
Insertion, Updation and Deletion
Anomalies are very frequent if
database is not normalized.
Student table
rollno name branch hod office_tel
401 Akon CSE Mr. X 53337
402 Bkon CSE Mr. X 53337
403 Ckon CSE Mr. X 53337
404 Dkon CSE Mr. X 53337
In the table above, we have data
of 4 Computer Sci. students.
As we can see, data for the
fields branch, hod(Head of
Department) and office_tel is
repeated for the students who
are in the same branch in the
college, this is Data
Redundancy.
Insertion Anomaly
Suppose for a new admission, until
and unless a student opts for a
branch, data of the student cannot
be inserted, or else we will have to
set the branch information as NULL.
Also, if we have to insert data of 100
students of same branch, then the
branch information will be repeated
for all those 100 students.
These scenarios are nothing
but Insertion anomalies.
Updation Anomaly
What if Mr. X leaves the college?
or is no longer the HOD of
computer science department?
In that case all the student
records will have to be updated,
and if by mistake we miss any
record, it will lead to data
inconsistency.
This is Updation anomaly.
Deletion Anomaly
In our Student table, two different
informations are kept together,
Student information and Branch
information.
Hence, at the end of the academic
year, if student records are
deleted, we will also lose the
branch information.
This is Deletion anomaly.
What is Normalization?
Normalization is a database
design technique that reduces data
redundancy and eliminates
undesirable characteristics like
Insertion, Update and Deletion
Anomalies.
Normalization rules divides larger
tables into smaller tables and links
them using relationships.
The purpose of Normalisation in
SQL is to eliminate redundant
Functional dependencies in
DBMS
A functional dependency is a
constraint that specifies the
relationship between two sets of
attributes where one set can
accurately determine the value of
other sets.
It is denoted as X → Y, where X is
a set of attributes that is capable
of determining the value of Y.
X is called Determinant,
dept_nam dept_buildi
roll_no name
e ng
42 abc CO A4
43 pqr IT A3the
44 xyz CO A4
valid functional dependencies:
roll_no
→ { name, dept_name,
dept_building },→
◦ Here, roll_no can determine values of fields
name, dept_name and dept_building, hence a
valid Functional dependency
roll_no → dept_name ,
◦ Since, roll_no can determine whole set of
{name, dept_name, dept_building}, it can
determine its subset dept_name also.
dept_name → dept_building ,
◦ Dept_name can identify the dept_building
accurately, since departments with different
dept_name will also have a different
dept_building
invalid functional
dependencies:
name → dept_name
◦ Students with the same name can
have different dept_name, hence this
is not a valid functional dependency.
dept_building → dept_name
◦ There can be multiple departments
in the same building, For example, in
the above table departments ME and
EC are in the same building B2,
◦ hence dept_building → dept_name is
an invalid functional dependency.
Armstrong’s axioms/properties
of functional dependencies:
Reflexivity: If Y is a subset of X, then X→Y holds by
reflexivity rule
◦ For example, {roll_no, name} → name is valid.
Augmentation: If X → Y is a valid dependency, then
XZ → YZ is also valid by the augmentation rule.
◦ For example, If {roll_no, name} → dept_building is valid,
hence {roll_no, name, dept_name} → {dept_building,
dept_name} is also valid.→
Transitivity: If X → Y and Y → Z are both valid
dependencies, then X→Z is also valid by the Transitivity
rule.
◦ For example, roll_no → dept_name & dept_name →
dept_building, then roll_no → dept_building is also valid.
Types of Functional dependencies
in DBMS:
Fully functional dependency
Partial Functional Dependency
Trivial functional dependency
Non-Trivial functional
dependency
Multivalued functional
dependency
Transitive functional dependency
Trivial functional dependency
A → B has trivial functional
dependency if B is a subset of A.
◦ {Emp_id, Emp_name} -> Emp_id is a
trivial functional dependency as
Emp_id is a subset of
{Emp_id,Emp_name}.
Non Trivial Functional Dependency
when A->B holds true where B is
not a subset of A. In a
relationship, if attribute B is not a
subset of attribute A, then it is
conside
◦ (Company} -> {CEO} (if we know
the Company, we knows the CEO
name)
◦ But CEO is not a subset of Company,
and hence it's non-trivial functional
dependency.
Multivalued Dependency
Multivalued dependency occurs in
the situation where there are
multiple independent multivalued
attributes in a single table.
A multivalued dependency is a
complete constraint between two
sets of attributes in a relation.
It requires that certain tuples be
present in a relation.
Car(carmodel,mafyear,colour)
◦ maf_year and color are independent
of each other but dependent on
car_model.
◦ In this example, these two columns
are said to be multivalue dependent
on car_model.
◦ This dependence can be represented
like this:
◦ car_model -> maf_year
Transitive Dependency in DBMS
Company} -> {CEO} (if we know
the compay, we know its CEO's
name)
{CEO } -> {Age} If we know the
CEO, we know the Age
Therefore according to the rule of
transitive dependency:
{ Company} -> {Age} should
hold, that makes sense because if
we know the company name, we
Closure of an Attribute
Closure of an Attribute: Closure
of an Attribute can be defined as a
set of attributes that can be
functionally determined from it.
Closure of an attribute X is X+
Find the closure of A,B,C,D given
R(A,B,C,D),FD : {A->B,B->D,C->B}
A+=A->B->D
◦ A+=ABD
B+=B->D
=BD
C+=C->B->D
=CBD
D+=D
Closure of attribute A-
Consider a A+ ={A}
relation R ( A , B = { A , B , C }
,C,D,E,F,G ( Using A → BC )
) with the = { A , B , C , D , E }
functional
( Using BC → DE )
dependencies-
= { A , B , C , D , E ,
A → BC
F } ( Using D → F )
BC → DE
= { A , B , C , D , E ,
D → F
F , G } ( Using CF →
CF → G
G)
Thus,
A+ = { A , B , C ,
A → BC D+ ={D}
BC → DE = { D , F }
D → F ( Using D → F )
CF → G
A → BC { B , C }+= { B , C }
BC → DE = { B , C , D , E }
D → F ( Using BC → DE )
CF → G = { B , C , D , E , F } (
Using D → F )
= { B , C , D , E , F , G
} ( Using CF → G )
Thus,
{ B , C } + = { B , C ,
D,E,F,G}
Given relational schema R( P Q R S T U
V) having following attribute P Q R S T U and V,
also there is a set of functional dependency
denoted by FD = { P->Q, QR->ST, PTV-
>V }. Determine Closure of (QR)+ and (PR)+
QR+ = QR FD QR→ST
=QRST
PR + = PR → P → Q
=PRQ →ST
=PRQST
Given relational schema R( P Q R S T) having
following attributes P Q R S and T, also there is
a set of functional dependency denoted by FD
= { P->QR, RS->T, Q->S, T-> P }. Determine
Closure of ( T )+
T+=T → P → QR → S
=TPQRS
Different kinds of keys
candidate key
A candidate key may be defined
as-
◦ A set of minimal attribute(s) that can
identify each tuple uniquely in the
given relation is called as a
candidate key.
OR
◦ A minimal super key is called as a
candidate key.
Considerthe following Student schema-
Student ( roll , name , sex , age ,
address , class , section )
Given below are the examples of candidate
keys-
( class , section , roll )
( name , address )
These are candidate keys because each set
consists of minimal attributes required to
identify each student uniquely in the Student
table.
Let R = (A, B, C, Determine all
D, E, F) be a essential
relation scheme attributes of the
with the given relation.
following Essential
dependencies- attributes of the
C → F relation are- C
E → A and E.
EC → D So, attributes C
A → B and E will
definitely be a
part of every
So, we have-
{ CE }+
C →F
= { C , E }
E → A
= { C , E , F } ( Using C → F
EC → D )
A → B = { A , C , E , F } ( Using E
→A)
= { A , C , D , E , F }
( Using EC → D )
= { A , B , C , D , E , F }
( Using A → B )
We conclude that CE can
determine all the attributes
of the given relation.
So, CE is the only possible
Let R = (A, B,
C, D, E) be a
relation
scheme with
the following
dependencie
s-
AB → C
C → D
B → E
Determine
the total
number of
Closures of a set of functional dependencies
A Closure is a set of FDs is a set
of all possible FDs that can be
derived from a given set of FDs.
It is also referred as
a Complete set of FDs.
If F is used to donate the set of
FDs for relation R, then a closure
of a set of FDs implied by F is
denoted by F+.
Find F+
Closure Algorithm
Closure=X
Repeat
{
Foreach FD u v in F
Such that u C closure
Then set closure = closure U v
}
Until there is no change to closure
Ssn+
a) ssn+
result=ssn
repeat
{
pno →(pname,ploc) (u v)
Pno C ssn
}
result=ssn
repeat
{
ssn →ename (u v)
ssn C ssn
Then result ssn U ename
Result=(ssn,ename)
}
Normalization Rule
Normalization rules are divided
into the following normal forms:
First Normal Form
Second Normal Form
Third Normal Form
BCNF
Fourth Normal Form
Database Normalization is a technique of
organizing the data in the database.
Normalization is a systematic approach of
decomposing tables to eliminate data
redundancy(repetition) and undesirable
characteristics like Insertion, Update and
Deletion Anomalies.
It is a multi-step process that puts data
into tabular form, removing duplicated
data from the relation tables.
Normalization is used for mainly two
purposes,
◦ Eliminating redundant(useless) data.
◦ Ensuring data dependencies make sense i.e
First Normal Form (1NF)
For a table to be in the First
Normal Form, it should follow the
following 4 rules:
It should only have single(atomic)
valued attributes/columns.
Values stored in a column should be of
the same domain
All the columns in a table should have
unique names.
And the order in which data is stored,
does not matter.
roll_no name subject
101 Akon OS, CN
103 Ckon Java
102 Bkon C, C++
Our table already satisfies 3 rules out
of the 4 rules, as all our column names
are unique, we have stored data in the
order we wanted to and we have not
inter-mixed different type of data in
columns.
But out of the 3 different students in
our table, 2 have opted for more than
1 subject.
And we have stored the subject names
in a single column. But as per the 1st
Normal form each column must
contain atomic value.
How to solve this Problem?
roll_no name subject
101 Akon OS
101 Akon CN
103 Ckon Java
102 Bkon C
102 Bkon C++
By doing so, although a few
values are getting repeated but
values for the subject column are
now atomic for each record/row.
Using the First Normal Form, data
redundancy increases, as there
will be many columns with same
data in multiple rows but each
row as a whole will be unique.
Second Normal Form
For a table to be in the Second
Normal Form, it must satisfy two
conditions:
◦ The table should be in the First
Normal Form.
◦ There should be no Partial
Dependency.
What is Dependency?
Let's take an example of
a Student table with
columns student_id, name, reg_no
branch and address .
In this table, student_id is the primary
key and will be unique for every row,
hence we can use student_id to fetch
any row of data from this table
Even for a case, where student names
are same, if we know
the student_id we can easily fetch the
correct record.
student_i name reg_no branch address
d
10 Akon 07-WY CSE Kerala
11 Akon 08-WY IT Gujarat
Hence we can say a Primary
Key for a table is the column or a
group of columns(composite key)
which can unique
if I ask for name of student
with student_id 10 or 11, I will
get it.
So all I need is student_id and
every other column depends on
it, or can be fetched using it.
This is Dependency and we also
Partial Dependency
For a simple table like Student, a
single column like student_id can
uniquely identfy all the records in
a table.
But this is not true all the time.
Let's create another table
for Subject, which will
have subject_id and subject_name
fields and subject_id will be the
primary key.
Subject
subject_id(Primary Key) subject_name
1 Java
2 C++
3 Php
Now we have a Student table
with student information and
another table Subject for storing
subject information.
Let's create another table Score,
to store the marks obtained by
students in the respective
subjects.
We will also be saving name of
the teacher who teaches that
subject along with marks.
Score
score_id student_i subject_id marks teacher
d
1 10 1 70 Java
Teacher
2 10 2 75 C++
Teacher
3 11 1 80 Java
Teacher
In the score table we are saving
the student_id to know which student's marks
are these and subject_id to know for which
subject the marks are for.
Together, student_id + subject_id forms
a Candidate Key for this table, which can be
the Primary key
Now if you look at the Score table, we have a
column names teacher which is only dependent
on the subject, for Java it's Java Teacher and for
C++ it's C++ Teacher & so on.
primary key for this table is a composition of
two columns which
is student_id & subject_id but the teacher's
name only depends on subject, hence
the subject_id, has nothing to do
How to remove Partial Dependency
There can be many different
solutions for this, but out objective
is to remove teacher's name from
Score table.
The simplest solution is to remove
columns teacher from Score table
and add it to the Subject table.
Hence, the Subject table will
become:
The simplest solution is to remove
columns teacher from Score table and
add it to the Subject table. Hence, the
Subject table will become:
id subject_name teacher
1 Java Java Teacher
2 C++ C++ Teacher
3 Php Php Teacher
And our Score table is now in the second
normal form, with no partial dependency
score_id student_id subject_id marks
1 10 1 70
2 10 2 75
3 11 1 80
Third Normal Form (3NF)
A relation will be in 3NF
◦ if it is in 2NF and not contain any
transitive partial dependency.
3NF is used to reduce the data
duplication.
It is also used to achieve the data
integrity.
If there is no transitive
dependency for non-prime
attributes, then the relation must
Employee Table
EMP_ID EMP_NAM EMP_ZIP EMP_STAT EMP_CITY
E E
222 Harry 201010 UP Noida
333 Stephan 02228 US Boston
444 Lan 60007 US Chicago
Super key in the table above:
◦ {EMP_ID}, {EMP_ID, EMP_NAME}, {EMP_ID
, EM
Non-prime attributes: In the given
table, all attributes except EMP_ID are
non-prime.
◦ Here, EMP_STATE & EMP_CITY dependent
on EMP_ZIP and
◦ EMP_ZIP dependent on EMP_ID.
◦ The non-prime attributes (EMP_STATE,
EMP_CITY) transitively dependent on super
key(EMP_ID). It violates the rule of third
normal form.(ab,bc ac)
◦ So need to move the EMP_CITY and
EMP_STATE to the new <EMPLOYEE_ZIP>
Employee Table
EMP_ID EMP_NAME EMP_ZIP
222 Harry 201010
333 Stephan 02228
444 Lan 60007
EMPLOYEE_ZIP table
EMP_ZIP EMP_STATE EMP_CITY
201010 UP Noida
02228 US Boston
60007 US Chicago
Boyce Codd normal form (BCNF)
BCNF is the advance version of
3NF. It is stricter than 3NF.
◦ A table is in BCNF if every functional
dependency X → Y, X is the super
key of the table.
◦ For BCNF, the table should be in 3NF,
and for every FD, LHS is super key.
EMPLOYEE table
EMP_ID EMP_COU EMP_DEPT DEPT_TYP EMP_DEPT
NTRY E _NO
264 India Designing D394 283
264 India Testing D394 300
364 UK Stores D283 232
Inthe above table Functional
dependencies are as follows:
◦ EMP_ID → EMP_COUNTRY
◦ EMP_DEPT → {DEPT_TYPE, EMP_DE
PT_NO}
Candidate key: {EMP-ID,
EMP-DEPT}
The table is not in BCNF because
neither EMP_DEPT nor EMP_ID
alone are keys.
To convert the given table into
BCNF, we decompose it into three
EMP_COUNTRY table:
EMP_ID EMP_COUNTRY
264 India
264 India
EMP_DEPT table:
EMP_DEPT DEPT_TYPE EMP_DEPT_NO
Designing D394 283
Testing D394 300
Stores D283 232
Developing D283 549
EMP_DEPT_MAPPING table
EMP_ID EMP_DEPT
D394 283
D394 300
D283 232
D283 549
Candidate keys:
◦ For the first table: EMP_ID
For the second table: EMP_DEPT
For the third table: {EMP_ID,
EMP_DEPT}
Functional dependencies:
◦ EMP_ID → EMP_COUNTRY
◦ EMP_DEPT → {DEPT_TYPE, EMP_DE
PT_NO}
Now, this is in BCNF because left
side part of both the functional
dependencies is a key.
Fourth Normal Form (4NF)
For a table to satisfy the Fourth
Normal Form, it should satisfy the
following two conditions:
◦ It should be in the Boyce-Codd
Normal Form.
◦ And, the table should not have
any Multi-valued Dependency.
STUDENT
STU_ID COURSE HOBBY
21 Computer Dancing
21 Math Singing
34 Chemistry Dancing
The given STUDENT table is in 3NF, but
the COURSE and HOBBY are two
independent entity. Hence, there is no
relationship between COURSE and
HOBBY.
In the STUDENT relation, a student with
STU_ID, 21 contains two
courses, Computer and Math and two
hobbies, Dancing and Singing. So
there is a Multi-valued dependency on
STU_ID, which leads to unnecessary
repetition of data.
So to make the above table into 4NF, we
can decompose it into two tables:
STUDENT_COURSE
STU_ID COURSE
21 Computer
21 Math
34 Chemistry
STUDENT_HOBBY
STU_ID HOBBY
21 Dancing
21 Singing
34 Dancing
Relational Decomposition
When a relation in the relational model is
not in appropriate normal form then the
decomposition of a relation is required.
In a database, it breaks the table into
multiple tables.
If the relation has no proper
decomposition, then it may lead to
problems like loss of information.
Decomposition is used to eliminate some
of the problems of bad design like
anomalies, inconsistencies, and
redundancy.
Types of Decomposition
Lossless Decomposition
If the information is not lost from
the relation that is decomposed,
then the decomposition will be
lossless.
The lossless decomposition
guarantees that the join of relations
will result in the same relation as it
was decomposed.
The relation is said to be lossless
decomposition if natural joins of all
Lossless Join Decomposition
If we decompose a relation R into
relations R1 and R2,
Decomposition is lossy if R1 ⋈
R2 ⊃ R
Decomposition is lossless if R1 ⋈
R2 = R
EMPLOYEE_DEPARTMENT
table:
EMP_ID EMP_NA EMP_AG EMP_CI DEPT_ID DEPT_N
ME E TY AME
22 Denim 28 Mumbai 827 Sales
33 Alina 25 Delhi 438 Marketi
ng
46 Stephan 30 Bangalo 869 Finance
re
52 Katheri 36 Mumbai 575 Producti
ne on
60 Jack 40 Noida 678 Testing
The above relation is decomposed
into two relations EMPLOYEE and
DEPARTMENT
EMP_ID EMP_NAME EMP_AGE EMP_CITY
22 Denim 28 Mumbai
33 Alina 25 Delhi
46 Stephan 30 Bangalore
52 Katherine 36 Mumbai
60 Jack 40 Noida
DEPARTMENT table
DEPT_ID EMP_ID DEPT_NAME
827 22 Sales
438 33 Marketing
869 46 Finance
575 52 Production
678 60 Testing
Employee ⋈
Department
EMP_ID EMP_NA EMP_AG EMP_CI DEPT_ID DEPT_N
ME E TY AME
22 Denim 28 Mumbai 827 Sales
33 Alina 25 Delhi 438 Marketi
ng
46 Stephan 30 Bangalo 869 Finance
re
52 Katheri 36 Mumbai 575 Producti
ne on
60 Jack 40 Noida 678 Testing
To check for lossless join
decomposition using FD set,
following conditions must hold:
Union of Attributes of R1 and R2 must
be equal to attribute of R. Each attribute
of R must be either in R1 or in R2.
◦ Att(R1) U Att(R2) = Att(R)
Intersectionof Attributes of R1 and R2
must not be NULL.
◦ Att(R1) ∩ Att(R2) ≠ Φ
Common attribute must be a key for at
least one relation (R1 or R2)
◦ Att(R1) ∩ Att(R2) -> Att(R1) or Att(R1) ∩
Att(R2) -> Att(R2)
A relation R (A, B, C, D) with FD set{A-
>BC}.Perform decomposition and check
whether it is lossy or lossless ?
R is decomposed into R1(ABC) and
R2(AD)
First condition holds true as Att(R1) U
Att(R2) = (ABC) U (AD) = (ABCD) =
Att(R).
Second condition holds true as Att(R1)
∩ Att(R2) = (ABC) ∩ (AD) ≠ Φ
Third condition holds true as Att(R1) ∩
Att(R2) = A is a key of R1(ABC)
because A->BC is given.
Dependency Preserving
It is an important constraint of the database.
In the dependency preservation, at least one
decomposed table must satisfy every
dependency.
If a relation R is decomposed into relation R1
and R2, then the dependencies of R either
must be a part of R1 or R2 or must be derivable
from the combination of functional
dependencies of R1 and R2.
For example, suppose there is a relation R (A,
B, C, D) with functional dependency set (A-
>BC).
The relational R is decomposed into R1(ABC)
and R2(AD) which is dependency preserving