Normalization
Normalization is the process of structuring a database to reduce data redundancy and improve data integrity. It
organizes tables and columns to eliminate duplicate data and ensure that data dependencies make logical sense.
The primary goal is to break down a large table into smaller, more manageable tables and establish relationships
between them.
Unnormalized Table Example
Let's start with an unnormalized table named Student Course Enrollment that contains information about
students, their courses, and their instructors
Problems:
Multiple values in Courses and Instructor columns (bad for searching, updating)
Redundancy: "Alice" appears twice
First Normal Form (1NF)
A table is in 1NF if it eliminates repeating groups and ensures that all data values are atomic, meaning they are
indivisible. This means a single cell in a table should not contain multiple values, like a comma-separated list of
items or an array. To achieve this, you break up a table with repeating information into separate rows.
Rule: A table is in 1NF if it meets the following criteria:
1. Atomicity: Each column contains atomic (indivisible) values.
2. No Repeating Groups: There are no repeating groups of columns.
Second Normal Form
A table is in 2NF if it's already in 1NF and all its non-key attributes are fully dependent on the entire primary
key. This rule primarily applies to tables with a composite primary key (a key made up of two or more
columns). If a non-key column depends on only part of the primary key, it creates a partial dependency. To fix
this, you split the table. The columns with the partial dependency are moved to a new table with the partial key
as its primary key.
Rule:
Must be in 1NF
A table is in 2NF if it is in 1NF and all non-key attributes are fully functionally dependent on the entire
primary key.
Third Normal Form (3NF)
A table is in 3NF if it's in 2NF and all non-key attributes have no transitive dependency. A transitive
dependency occurs when a non-key attribute is dependent on another non-key attribute, which in turn is
dependent on the primary key. In simpler terms, a non-key field must depend on "the key, the whole key, and
nothing but the key.
Rule:
Must be in 2NF
A table is in 3NF if it is in 2NF and all non-key attributes are not transitively dependent on the primary key.
Database Language
Database languages are specialized languages used to communicate with and manage database systems. They
are typically used to define the structure of a database, manipulate the data within it, control user access, and
manage transactions. The most well-known of these is SQL (Structured Query Language), which is a
comprehensive language that incorporates different types of commands to handle all these functions.
The four main types of database languages are:
Data Definition Language (DDL)
DDL commands are used to define and manage the structure of a database. They deal with the database's
schema, which is the skeleton that holds the data. You use DDL to create, alter, and delete database objects like
tables, views, indexes, and schemas. Once a DDL command is executed, the changes are permanent and cannot
be rolled back.
Common DDL Commands:
CREATE: Creates a new database object (e.g., CREATE TABLE Students).
ALTER: Modifies the structure of an existing database object (e.g., ALTER TABLE Students ADD
Email VARCHAR(255)).
DROP: Deletes a database object and all its data (e.g., DROP TABLE Students).
TRUNCATE: Deletes all data from a table while keeping its structure intact.
RENAME: Changes the name of a database object.
Data Manipulation Language (DML)
DML commands are used to manipulate the data stored within the database objects. These are the most
frequently used commands as they allow you to insert, retrieve, update, and delete the actual information. DML
commands are not automatically saved, so their changes can be undone.
Common DML Commands:
SELECT: Retrieves data from a database (e.g., SELECT * FROM Students WHERE StudentID = 101).
This is often considered a subcategory called Data Query Language (DQL).
INSERT: Adds new rows of data to a table (e.g., INSERT INTO Students (StudentID, StudentName)
VALUES (104, 'Dave')).
UPDATE: Modifies existing data within a table (e.g., UPDATE Students SET StudentMajor = 'Biology'
WHERE StudentID = 101).
DELETE: Removes rows of data from a table (e.g., DELETE FROM Students WHERE StudentID =
102).
Data Control Language (DCL)
DCL commands are used to control user access and permissions within the database. They are crucial for
database security, allowing administrators to grant and revoke privileges to different users and roles. This
ensures that only authorized users can perform specific actions on the data and its structure.
Common DCL Commands:
GRANT: Gives a user or role specific privileges (e.g., GRANT SELECT ON Students TO user1).
REVOKE: Removes previously granted privileges from a user or role (e.g., REVOKE DELETE ON
Students FROM user1).
Transaction Control Language (TCL)
TCL commands are used to manage transactions, which are a sequence of database operations that are treated
as a single, indivisible unit. These commands ensure data integrity by allowing changes to be saved
permanently, undone, or managed in a controlled way.
Common TCL Commands:
COMMIT: Permanently saves the changes made by a transaction.
ROLLBACK: Undoes all changes made since the last COMMIT or SAVEPOINT.
SAVEPOINT: Creates a temporary point in a transaction to which you can later roll back.
Centralized Database
A centralized database stores all its data and related processing on a single, central server. All users access this
one physical location over a network to retrieve and manage information. This model is straightforward and is
commonly used by smaller organizations or for applications where data doesn't need to be globally distributed.
The primary advantage is simplicity: it's easy to manage, secure, and maintain because all of the data resides in
one place. If the central server fails, the entire system becomes unavailable, which can cause significant
downtime. As the number of users and the amount of data grow, a centralized database can also suffer from
performance bottlenecks, as all traffic is directed to a single server. This makes it less scalable and less suitable
for large, geographically diverse organizations that require high availability and local data access.
Advantages
1. Simplified Management: It's much easier to manage, maintain, and secure since all data is in one place.
Tasks like backups and administration are straightforward.
2. Strong Data Integrity: With all data consolidated in a single location, it's easier to enforce data
consistency and integrity, as there are no synchronization issues between different data copies.
3. Lower Cost: The initial setup and operational costs are generally lower. You only need to invest in one
powerful server and the necessary software licenses.
4. Security Control: Security is more manageable as you only need to protect a single point of entry with
firewalls, access controls, and other security measures.
5. Simplified Development: Application development is simpler because there's no need to consider data
fragmentation or complex distributed transaction protocols.
Disadvantages
1. Single Point of Failure: If the central server goes down, the entire database becomes inaccessible,
leading to a complete service outage.
2. Performance Bottleneck: All user queries and requests are routed to a single server. As the number of
users grows, the server can become overloaded, leading to slow response times.
3. Scalability Issues: Scaling a centralized database is difficult and often requires expensive hardware
upgrades (vertical scaling) to handle increased load.
4. Reduced Availability: The database is only available if the central server is running. Any downtime,
planned or unplanned, affects all users.
5. High Latency: For geographically distant users, network latency can be a significant issue, as all data
requests must travel to and from the central server.
Distributed Database
A distributed database is a database in which data is stored across multiple computers, also known as nodes,
that are interconnected by a network. It differs from a centralized database where all data is on a single
machine. The system is designed to appear as a single, logical database to the user, who doesn't need to know
where the data is physically located. This architecture is crucial for handling large volumes of data and high
user traffic, as it can distribute the workload and data across multiple servers.
Distributed databases use two main strategies to manage data across nodes:
Fragmentation: The database is broken into smaller logical units or fragments, and each fragment is
stored on a different node. For example, a global company might store customer data for a specific
region on a server located within that region.
Replication: Copies of the same data are stored on multiple nodes. This ensures that if one node fails,
the data is still available from a replica on another node, increasing reliability and fault tolerance.
A Distributed Database Management System (DDBMS) is the software that manages this complex
environment, handling tasks like data fragmentation, replication, and query processing to ensure consistency
and transparency for the user.
Advantages
1. High Availability: There is no single point of failure. If one server goes down, the rest of the network
can continue to operate, ensuring uninterrupted service.
2. Scalability: Distributed databases can be easily scaled horizontally by adding more nodes, which is
often more cost-effective than upgrading a single, powerful server.
3. Improved Performance: Data can be stored closer to the users who access it most frequently, reducing
network latency and improving response times.
4. Fault Tolerance: Data is often replicated across multiple nodes. If a node fails, its data can still be
accessed from a healthy replica, making the system resilient to failures.
5. Modular Growth: New nodes can be added to the network incrementally to handle growth in data or
user traffic without impacting the rest of the system.
Disadvantages
1. Complexity: They are significantly more complex to design, implement, and manage than centralized
databases. Ensuring data consistency across all nodes is a major challenge.
2. Higher Cost: The initial setup and ongoing management can be more expensive due to the need for
multiple servers, complex networking, and specialized administrative skills.
3. Data Consistency Issues: Keeping all replicated data consistent in real-time is difficult. Network delays
or failures can lead to temporary data inconsistencies.
4. Security Risks: A distributed system has multiple access points, increasing the overall security risk and
making it more challenging to protect the data.
5. Lack of Standardization: There is a lack of universal standards for distributed database systems, which
can make it challenging to integrate different systems or migrate data between them.
Data Security
Data security in a Database Management System (DBMS) is the set of measures and controls designed to
protect the confidentiality, integrity, and availability of data. This involves not only safeguarding the data itself
but also the DBMS software, the physical infrastructure, and any applications that access the data. Key
components of database security include robust access control, which uses authentication to verify a user's
identity and authorization to define what they are allowed to do. Encryption, both for data at rest (on disk) and
in transit (over a network), serves as a final line of defense against unauthorized access. Additionally, constant
auditing and monitoring of database activity is essential for detecting suspicious behavior, investigating
security incidents, and ensuring compliance with regulations. Common threats that database security aims to
mitigate include SQL injection attacks, insider threats, and a lack of proper configuration, all of which can lead
to data breaches or corruption. By implementing a multi-layered security strategy, organizations can
significantly reduce the risk of data loss and maintain the trustworthiness of their information.
Key Security Measures
Access Control: This is the most fundamental layer. It ensures only the right people can access the
database. This includes:
Authentication: Verifying a user's identity, usually with a username and password. Multi-factor
authentication adds an extra layer of security.
Authorization: Defining what an authenticated user is allowed to do. The principle of least
privilege is key here—users are given only the minimum permissions needed for their job.
Data Encryption: This is a vital last line of defense that makes data unreadable to anyone who doesn't
have the decryption key. It's used in two main ways:
Encryption at Rest: Encrypting data while it's stored on a disk. Even if a server is stolen, the
data is useless.
Encryption in Transit: Encrypting data as it travels over a network between the user and the
database, preventing interception.
Auditing and Monitoring: This involves continuously tracking and logging all database activity to
detect suspicious behavior. Auditing creates a record of who did what and when, which is crucial for
investigations and compliance. Real-time monitoring systems can alert administrators to unusual
activity, such as multiple failed login attempts.
Common Threats
Database security must address a range of threats, including SQL injection attacks, where malicious code is
inserted into an application's input fields, and insider threats, which are risks posed by employees or
contractors with authorized access. Other threats include privilege abuse by a legitimate user and
vulnerabilities caused by weak passwords or system misconfigurations.
Database Administrator (DBA)
A Database Administrator (DBA) is a specialized IT professional who manages, maintains, and secures a
database system. They ensure that data is correctly stored, organized, and available to users when needed. The
role is critical in any organization that relies on data for its operations, as the DBA is the primary person
responsible for the performance, integrity, and security of the database.
Installation and Configuration 🛠️: DBAs install and configure the database management system
(DBMS) software and related tools, customizing them to meet an organization's specific requirements.
Backup and Recovery: They create and implement comprehensive backup strategies to prevent data
loss. This includes scheduling regular backups and testing recovery procedures to ensure data can be
restored quickly and accurately after a system failure.
Security Management: DBAs are the gatekeepers of the database. They manage user accounts, roles,
and permissions to ensure only authorized individuals can access and modify sensitive data. They also
monitor for security breaches.
Performance Tuning: They constantly monitor database performance, analyze query efficiency, and
fine-tune system parameters to ensure fast data retrieval and a smooth user experience.
Data Integrity: DBAs enforce data integrity rules to ensure data is consistent and accurate. They often
implement checks and constraints to prevent bad data from entering the database.
Troubleshooting: When issues arise, such as a slow query, a system crash, or data corruption, the DBA
is responsible for diagnosing the root cause and implementing a solution to fix it.
Maintenance: This includes routine tasks like applying software patches and updates, managing storage
space, and archiving old data to maintain a healthy database environment.
Database Design: DBAs collaborate with developers and architects to design the database schema,
including tables, relationships, and indexes, to ensure it is logical and efficient.
Capacity Planning: They monitor database growth and usage trends to predict future resource needs,
such as storage and memory, and plan for upgrades or new hardware.
Data Migration: When an organization needs to move data, whether between different servers, from
on-premises to the cloud, or to a new database system, the DBA is responsible for planning and
executing the migration smoothly and without data loss.
ER- Diagram
An Entity-Relationship (ER) diagram is a visual tool used to model the logical structure of a database. It acts
as a blueprint, showing how different pieces of information, or entities, relate to each other. By using
standardized symbols, it helps developers and analysts understand the relationships, attributes, and constraints
of a database before it's actually built.
Key Figures and Their Workings
ER diagrams are built using three core components: entities, attributes, and relationships. Different notations
exist (like Chen or Crow's Foot), but the basic figures and their meanings are consistent.
Entities: Represented by a rectangle, an entity is a real-world object or concept that you want to store
data about. Think of entities as the nouns in your database.
Example: Student, Course, Instructor.
Attributes: Represented by an oval or listed within the entity rectangle, attributes are the properties or
characteristics of an entity.
Example: A Student entity might have attributes like StudentID, StudentName, and
StudentMajor. A key attribute, which uniquely identifies an entity (like StudentID), is often
underlined.
Relationships: Represented by a diamond, a relationship is how two or more entities are associated
with each other. It's the verb that connects the nouns.
Example: A Student enrolls in a Course. The diamond would be labeled "Enrolls In."
The lines connecting these figures also show cardinality, which defines the number of instances of one entity
that can be related to instances of another. The three main types are:
One-to-One (1:1): Each entity instance relates to exactly one instance of another entity. (e.g., a person
has one social security card).
One-to-Many (1:N): One entity instance can relate to many instances of another. (e.g., an instructor
teaches many courses).
Many-to-Many (M:N): Many instances of one entity can relate to many instances of another. (e.g.,
many students enroll in many courses).
Symbols Used in ER Model
ER Model is used to model the logical view of the system from a data perspective which consists of these
symbols:
Key Advantages of ER Diagrams
1. Clarity and Simplicity: ER diagrams provide a simple, graphical representation of a complex database
structure. This visual format is much easier to understand than a detailed textual description or a list of
tables. It helps everyone involved—from developers and business analysts to clients—get a clear picture
of how the data is organized.
2. Effective Communication: By acting as a universal blueprint, ER diagrams facilitate communication
among various teams. Business users can validate the model to ensure it accurately represents their
needs, while developers can use the diagram to build the database without misunderstanding the
requirements.
3. Better Database Design: ER diagrams are a crucial step in the design process. They help designers
identify and correct flaws like redundancy, missing relationships, or inconsistent data before any code is
written. This proactive approach leads to a more efficient, logical, and robust final database.
4. Reduced Redundancy: The process of creating an ER diagram naturally encourages normalization. By
clearly mapping out entities and their relationships, designers can spot repeating data and structure the
database to minimize redundancy and improve data integrity.
5. Blueprint for Implementation: The ER diagram serves as a direct blueprint for building the database.
Once the diagram is finalized, developers can easily translate entities into tables, attributes into columns,
and relationships into foreign key constraints, significantly speeding up the implementation process.
What is an Entity?
An Entity represents a real-world object, concept or thing about which data is stored in a database. It act as a
building block of a database. Tables in relational database represent these entities.
Example of entities:
Real-World Objects: Person, Car, Employee etc.
Concepts: Course, Event, Reservation etc.
Things: Product, Document, Device etc.
Types of |Entity
Strong Entity
A strong entity is an independent entity that can exist on its own and has a primary key to uniquely identify its
instances. It does not rely on any other entity for its existence.
For example, in a university database, Student is a strong entity because a student can exist independently of a
course. Their primary key, StudentID, is sufficient to identify them.
Weak Entity
A weak entity is an entity that cannot exist without its related strong entity. It does not have a primary key of
its own; instead, it relies on the primary key of its parent strong entity to form its own unique identifier.
For example, a Dependent entity (a child of an employee) is a weak entity because a dependent cannot exist in
the database unless they are associated with a specific Employee (the strong entity).
Data Integrity
Data integrity refers to the overall accuracy, completeness, and consistency of data throughout its entire
lifecycle. It's about ensuring that data is reliable and trustworthy. Maintaining data integrity is crucial for
making sound business decisions and for meeting legal and regulatory compliance requirements.
There are two main types of data integrity:
Entity Integrity: This type of integrity ensures that each row in a table is uniquely identified. It is
enforced by the primary key, which must contain unique values and cannot be NULL. This prevents
duplicate rows and guarantees that every entity instance is distinct.
Referential Integrity: This ensures that relationships between tables remain consistent. It is enforced
using foreign keys. A foreign key in one table must either point to a valid, existing record in another
table or be NULL. For example, a student record with a foreign key referencing a course ID must refer
to a course that actually exists in the courses table. This prevents "orphan" records.
Data integrity is protected through various constraints and rules defined within the database schema, such as
UNIQUE, NOT NULL, CHECK, and FOREIGN KEY constraints. These measures prevent the entry of
incorrect, incomplete, or duplicate data.