• Data Normalization process and the normal forms
• Introduction to data normalization
• 1st Normal Form(1st NF)
• 2nd Normal Form (2nd NF)
• 3rd Normal Form (3rd NF)
Introduction
• Conceptual Modeling is a subjective process
• Therefore, the schema after the logical database design
phases may not be very good. (contain redundancies)
• However, there are formalisms to ensure that the
schema is good.
• This process is called Normalization.
Normalization
• Relational database schema = set of relations
• Relations = set of attributes
• However we group the attributes to relations is very
important.
• Too many attributes in a relation
❑ Waste space
❑ Anomalies
• Decomposing the relation into too smaller set of relations.
❑ Loss-less join property
❑ Dependency preserving property.
❑ Example Lecturer (id,name,address,salary,department,building)
Anomalies
• Insertion anomaly
Inserting a new lecturer to the LECTURER table. Department
information is reported (ensure that correct department
information is inserted)
Note: Inserting a department with no employees (impossible
– because null values for id is not allowed)
• Deletion anomaly
Deleting the last lecturer from the department will lose
information about the department.
Anomalies
• Update Anomaly
Updating the department’s building needs to be done for all
lecturers working for that department.
• When redundancies exits , we should decompose the
relations to smaller relations.
Anomalies
• Decomposing the relation into too smaller relations
• Less-less join property : we might lose information if we
decompose relations…
• Dependency-preserving property: The set of dependencies in
S can be verified by a set of dependencies in R1 and R2.
• Example
S R1 R2
S P D S P P D
S1 P1 D1 S1 P1 P1 D1
S2 P2 D2 S2 P2 P2 D2
S3 P3 D3 S3 S3 P1 D3
Anomalies
• Joining them together, we get spurious
tuples…
S P D
S1 P1 D1
S1 P1 D3
S2 P2 D2
S3 P1 P1
S3 P1 D3
Normalization
• To avoid the above mentioned issues in the relational schema,
we can apply a formal process called Normalizations.
• Normalization is based on functional dependence.
Functional Dependency
• Functional dependencies (FDs) are used to specify formal
measures of the "goodness" of relational designs.
• FDs and keys are used to define normal forms for
relations.
• FDs are constraints that are derived from the meaning
and interrelationships of the data attributes.
• A set of attributes X functionally determines a set of
attributes Y if the value of X determines a unique value
for Y.
Functional Dependency
• X -> Y holds if whenever two tuples have the same value
for X, they must have the same value for Y.
• For any two tuples t1 and t2 in any relation instance r(R):
If t1[X]=t2[X], then t1[Y]=t2[Y]
• X -> Y in R specifies a constraint on all relation instances
r(R) Written as X -> Y;
• can be displayed graphically on a relation schema as in
Figures
• FDs are derived from the real-world constraints on the
attributes
Functional Dependency
• social security number determines employee name
SSN -> ENAME
• project number determines project name and location
PNUMBER -> {PNAME, PLOCATION}
• employee ssn and project number determines the hours per
week that the employee works on the project
{SSN, PNUMBER} -> HOURS
Functional Dependency
• An FD is a property of the attributes in the schema R
• The constraint must hold on every relation instance r(R)
• If K is a key of R, then K functionally determines all attributes
in R (since we never have two distinct tuples with t1[K]=t2[K])
Keys
• A superkey of a relation schema R = {A1, A2, ...., An} is a set of
attributes S subset-of R with the property that no two tuples
t1 and t2 in any legal relation state r of R will have t1[S] =t2[S]
• A key K is a superkey with the additional property that
removal of any attribute from K will cause K not to be a
superkey any more.
Keys
• If a relation schema has more than one key, each is called a
candidate key. One of the candidate keys is arbitrarily
designated to be the primary key, and the others are called
secondary keys.
• A Prime attribute must be a member of some candidate key
• A Nonprime attribute is not a prime attribute—that is, it is
not a member of any [Link] key
1st Normal Form
• Disallows composite attributes, multivalued attributes, and
nested relations; attributes whose values for an individual
tuple are non-atomic.
• A relation R is in first normal form (1NF) if domains of all
attirbutes in the relations are atomic.
• Considered to be part of the definition of relation
1st Normal Form
2nd Normal Form
• Uses the concepts of FDs, primary key
• Prime attribute - attribute that is member of the primary
key K
• Full functional dependency - a FD Y -> Z where removal
of any attribute from Y means the FD does not hold any
more
{SSN, PNUMBER} -> HOURS is a full FD since neither SSN
-> HOURS nor PNUMBER -> HOURS hold
- {SSN, PNUMBER} -> ENAME is not a full FD (it is called a
partial dependency ) since SSN -> ENAME also holds
2nd Normal Form
• A relation schema R is in second normal form (2NF) if every
non-prime attribute A in R is fully functionally dependent on
the primary key
• R can be decomposed into 2NF relations via the process of
2NF normalization
2nd Normal Form
3rd Normal Form
• Transitive functional dependency
• a FD X -> Z that can be derived from two FDs X -> Y
and Y -> Z
- SSN -> DMGRSSN is a transitive FD since
SSN -> DNUMBER and DNUMBER -> DMGRSSN hold
-SSN -> ENAME is non-transitive since there is no set of
attributes X where SSN -> X and X -> ENAME
3rd Normal Form
• A relation schema R is in third normal form (3NF) if it is in 2NF
and no non-prime attribute A in R is transitively dependent
on the primary key
• R can be decomposed into 3NF relations via the process of
3NF normalization
NOTE:
In X -> Y and Y -> Z, with X as the primary key, we consider this
a problem only if Y is not a candidate key. When Y is a
candidate key, there is no problem with the transitive
dependency .
E.g., Consider EMP (SSN, Emp#, Salary ).
Here, SSN -> Emp# -> Salary and Emp# is a candidate key
• Normalization
• Redancadancy
• Anomalies
• Insert, delete and update anomalies
• Candidate key, primary key, prime attribute , non prime
attribute.
• 1st normal, 2nd normal form, 3 rd normal form.
• There are many Normal Forms proposed to reduce
redundancies.
• Some of the well-known ones are:
• 1st Normal Form
• 2nd Normal Form
• 3rd Normal Form
• Each key of a relation is called candidate key.
• A candidate key is chosen to be the primary key.
• An attribute key which is a member of a candidate key is
prime attribute.
• Discuss the attribute semantics as an informal measure of
goodness for a relation schema.
• What is insertion, deletion and modification anomalies.
• Why are they considered bad? Illustrate with example.
• What is functional dependency?
• Define the first, second and third normal forms with a
example.