UNIT I – Introduction to Database Management Systems (DBMS)
1. Introduction to Database
In any organization—be it a business, educational institute, bank, hospital, or government
office—data plays a central role. Whether it’s customer details, employee records, student
grades, financial transactions, or inventory stock, managing data accurately and efficiently is
critical. This is where the concept of a database comes in.
A database is an organized collection of data. It stores information in a structured format so
that it can be easily accessed, managed, and updated. Unlike random files stored on a
computer, a database is built using specific structures and rules that make it suitable for
regular use, fast access, and long-term reliability.
For example, a retail company may maintain a database to store customer profiles, product
details, and sales transactions. Instead of having different files scattered across multiple
systems, a database combines all this information into a unified format where relationships
between data can be established and tracked easily.
2. Understanding File, Record, and Fields
Before diving deeper into databases, it’s important to understand how data is organized
within a traditional system. Earlier, organizations relied heavily on file-based systems to
store data.
2.1 File
A file is a collection of related information stored on a storage device. Think of a file as a
document that holds data. In computer systems, files are usually structured in a way that
makes it easier to read and write information.
For instance, an Employee.txt file might contain details like this:
CopyEdit
101, Rohan Mehta, Manager, 50000
102, Priya Shah, Analyst, 35000
This file contains multiple records.
2.2 Record
A record is a complete set of data that relates to one item or individual. In the above
example, the entire line for Rohan Mehta (with ID, name, designation, and salary) is one
record.
Each record provides all the necessary details about one person or item.
2.3 Field
A field refers to a single piece of data within a record. For instance, “101” is the Employee ID
field, “Rohan Mehta” is the Name field, and so on. A record is made up of several fields.
To summarize:
• A file contains multiple records.
• A record contains multiple fields.
• A field contains a single value.
3. Problems with File-Based Systems
Before databases became popular, most organizations used file-based systems to handle
their data. While simple to implement, these systems had many limitations, especially as the
organization grew. Some of the key problems are discussed below:
3.1 Data Redundancy
This means the same data is stored in multiple files. For example, the name and contact
details of a customer might be saved in both the sales file and the delivery file. This
repetition leads to:
• Wasted storage space
• More time required to update data in multiple places
• Higher chances of inconsistency
3.2 Data Inconsistency
When data is stored in multiple places and not updated everywhere, it can become
inconsistent. For example, if the customer’s phone number is changed in one file but not in
another, the system has conflicting information.
3.3 Lack of Data Integration
Different departments may use separate file formats or storage methods, making it difficult
to combine or analyze data. This limits the organization’s ability to make strategic decisions
based on all available information.
3.4 Difficult Data Access
To retrieve information, programmers had to write long and complex programs. There were
no standard query tools, so even simple reports were time-consuming.
3.5 Security Problems
File systems offered very basic security. If someone had access to a file, they could view or
change all the data without restrictions.
3.6 No Concurrent Access
File systems didn’t handle multiple users accessing or modifying the same data at the same
time. This often led to errors or data corruption.
3.7 Poor Backup and Recovery
If a file was accidentally deleted or corrupted, there was no reliable way to recover the data
unless a manual backup had been made.
These drawbacks led to the development of more structured and secure systems—Database
Management Systems.
4. What is a DBMS (Database Management System)?
A Database Management System (DBMS) is software that helps users store, organize,
access, and modify data in a database. It provides a layer between users and the raw data,
ensuring that data is:
• Easily accessible
• Stored efficiently
• Secure and consistent
• Protected from unauthorized access or system failures
With a DBMS, users don’t need to know where the data is stored or how it's structured.
They can interact with the database through queries, forms, or applications.
Some of the most widely used DBMSs include:
• Oracle
• Microsoft SQL Server
• MySQL
• PostgreSQL
• IBM DB2
5. Categorization of DBMS
There are different ways to organize and structure data in a database. Over time, several
models have been developed. The three main types of database models are:
5.1 Hierarchical Database
In this model, data is organized in a tree-like structure, where each record has a parent-child
relationship. It is similar to folders in a computer, where each folder can have multiple
subfolders, but each subfolder has only one parent.
Example:
• A company database may have a hierarchy like:
o Company
▪ Departments
▪ Employees
This model works well for situations with clear one-to-many relationships but struggles with
complex relationships.
5.2 Network Database
The network model is more flexible than the hierarchical model. In this system, records can
have multiple parent and child records, forming a graph-like structure.
Example:
• A student can enroll in multiple courses, and each course can have multiple students.
• Therefore, there’s a many-to-many relationship.
This model is powerful but complex to manage and understand.
5.3 Relational Database
The relational model is the most commonly used and easiest to understand. It stores data in
tables, and relationships between data are created using keys (like primary and foreign
keys).
Example:
• A “Customers” table with fields like CustomerID, Name, and Address.
• An “Orders” table with fields like OrderID, CustomerID, and Amount.
The CustomerID in the Orders table refers to the Customers table, linking the two.
Relational DBMSs are widely used due to their simplicity, flexibility, and powerful querying
capabilities using SQL (Structured Query Language).
6. Applications of DBMS
DBMS is used in almost every sector that deals with structured data. Here are some real-life
applications:
6.1 Banking
Banks use DBMS to manage customer accounts, transactions, loans, and more. Every ATM
transaction updates the bank’s database in real time.
6.2 Education
Colleges and universities use databases to manage student records, courses, attendance,
fees, and grades.
6.3 Healthcare
Hospitals store patient records, lab results, treatment history, and appointment schedules
using DBMS.
6.4 Retail
Retail stores use DBMS for inventory management, sales tracking, billing, and customer
loyalty programs.
6.5 Airlines
Flight schedules, passenger bookings, and ticketing are all managed through complex
databases.
6.6 Government
Tax records, voter databases, census data, and vehicle registrations are managed using large
government database systems.
7. Three-Layered Architecture of DBMS
To understand how a DBMS works internally, it's useful to study its three-layered
architecture. This architecture defines different levels of abstraction for how data is viewed
and stored.
7.1 External Level (View Level)
This is the topmost layer where users interact with the data. Different users can have
different views depending on their access rights. For example, a student may only see their
own marks, while a teacher can see marks for all students.
7.2 Conceptual Level (Logical Level)
This layer defines the structure of the entire database. It describes what data is stored, the
relationships between them, and the constraints. This level hides the physical details and
provides a logical view.
7.3 Internal Level (Physical Level)
The lowest layer deals with how the data is actually stored on disk. It includes file formats,
indexing, compression, and data blocks. Users do not interact directly with this level.
Benefits of Three-Tier Architecture
• Provides data abstraction and data independence
• Improves security by hiding sensitive information
• Supports multiple user views without changing the physical structure
8. Advantages of DBMS
A Database Management System offers many benefits over traditional file systems:
8.1 Reduced Redundancy
Centralized data storage eliminates duplicate data entries.
8.2 Improved Consistency
Since the data is updated in one place, consistency is maintained across the system.
8.3 Enhanced Security
User roles and permissions restrict access to sensitive data.
8.4 Easy Data Retrieval
Using SQL, users can retrieve complex information quickly.
8.5 Better Backup and Recovery
Most DBMSs offer automated backups and restore functions to protect against data loss.
8.6 Concurrent Access
Multiple users can access and modify data at the same time, without interference.
8.7 Integrity and Validation
Data validation rules (like age limits or required fields) can be enforced at the database level.
9. Disadvantages of DBMS
While DBMSs offer numerous advantages, they also have certain limitations:
9.1 Cost
Licensing and maintaining enterprise DBMS software can be expensive.
9.2 Complexity
DBMS systems can be complex to set up and require trained professionals for
administration.
9.3 Performance Overhead
For very small applications, a DBMS might add unnecessary overhead.
9.4 Hardware Requirements
Large databases need more powerful hardware for smooth performance.
UNIT II – Components and Design of Database Management Systems
1. Introduction
In the previous unit, we explored the foundational concepts of databases and the evolution
from file-based systems to Database Management Systems (DBMS). We also discussed the
various types of DBMS models and their real-world applications. Building upon that
foundation, this unit delves into the key components that constitute a DBMS, the roles and
responsibilities associated with database management, and the design principles that
ensure data integrity and efficiency.
Understanding these components and design principles is crucial for effectively managing
and utilizing databases in any organizational context. Whether you're overseeing data as a
manager or interacting with databases as an end-user, a clear grasp of these concepts will
enhance your ability to make informed decisions and optimize business processes.
2. Important Components in Database Management
A DBMS is a complex system comprising various components that work together to ensure
efficient data management. Key components include:
2.1 Database Administrator (DBA)
The Database Administrator (DBA) is a pivotal role responsible for the overall management
and maintenance of the database system. The DBA's responsibilities encompass:
• Installation and Configuration: Setting up the DBMS software and configuring it to
meet organizational requirements.
• Security Management: Implementing measures to protect data from unauthorized
access and ensuring compliance with data protection regulations.
• Performance Monitoring: Continuously monitoring the system's performance and
making necessary adjustments to optimize efficiency.
• Backup and Recovery: Establishing procedures for data backup and developing
recovery strategies to safeguard against data loss.
• User Management: Managing user accounts, defining access levels, and ensuring
that users have the necessary permissions to perform their tasks.
The DBA acts as the custodian of the database, ensuring its availability, integrity, and
security.
2.2 Database
At the heart of a DBMS lies the database itself—a structured collection of data stored
electronically. The database serves as the repository where all data is organized and
managed. It consists of:
• Tables: Structures that store data in rows and columns, representing entities and
their attributes.
• Schemas: Blueprints that define the structure of the database, including tables,
fields, relationships, and constraints.
• Indexes: Data structures that improve the speed of data retrieval operations on
database tables.
The database is the core component where data resides, and its design significantly impacts
the efficiency and effectiveness of data management.
2.3 Application Programs
Application programs are software applications that interact with the database to perform
various operations such as data entry, retrieval, update, and deletion. These programs
provide user-friendly interfaces, enabling users to interact with the database without
needing to understand complex query languages. Examples include:
• Customer Relationship Management (CRM) Systems: Manage customer data and
interactions.
• Enterprise Resource Planning (ERP) Systems: Integrate core business processes.
• Inventory Management Systems: Track stock levels, orders, and deliveries.
Application programs serve as the bridge between end-users and the database, facilitating
seamless interactions and operations.
2.4 Data Definition Language (DDL)
Data Definition Language (DDL) comprises a set of SQL commands used to define and
modify the database structure. DDL statements include:
• CREATE: Establishes new database objects such as tables, indexes, or views.
• ALTER: Modifies existing database objects, allowing changes to structures like adding
or dropping columns.
• DROP: Deletes existing database objects from the system.
DDL commands are essential for structuring and organizing the database to meet specific
requirements.
2.5 Data Manipulation Language (DML)
Data Manipulation Language (DML) consists of SQL commands that enable users to
manipulate data within the database. DML statements include:
• SELECT: Retrieves data from one or more tables.
• INSERT: Adds new records into a table.
• UPDATE: Modifies existing records in a table.
• DELETE: Removes records from a table.
DML commands are crucial for interacting with and managing the data stored in the
database.
3. Components of DBMS Architecture
The architecture of a DBMS encompasses various components that work collaboratively to
manage, process, and secure data. Key components include:
3.1 Query Processor
The Query Processor interprets and executes database queries written in SQL. It comprises
several sub-components:
• DDL Interpreter: Processes DDL commands to define or modify database structures.
• DML Compiler: Translates DML statements into low-level instructions that the DBMS
can execute.
• Query Optimizer: Analyzes queries to determine the most efficient execution plan,
enhancing performance.
The Query Processor ensures that user queries are interpreted correctly and executed
efficiently.
3.2 Data Dictionary
The Data Dictionary, also known as the system catalog, is a centralized repository that
stores metadata—data about data. It contains information such as:
• Descriptions of Tables and Columns: Names, data types, and constraints.
• Relationships Between Data Elements: Foreign keys and dependencies.
• Access Privileges: Information on user permissions and roles.
The Data Dictionary serves as a reference guide for the DBMS, aiding in query processing,
integrity enforcement, and security management.
4. Physical Database Structures
The physical structure of a database refers to how data is stored on storage media.
Understanding these structures is vital for optimizing performance and ensuring data
integrity. Key physical components include:
4.1 Data Files
Data Files are the physical files on disk that store the database's data. They hold the actual
records and are organized into blocks or pages for efficient access.
4.2 Indexes
Indexes are special lookup tables that the DBMS uses to speed up data retrieval. An index on
a column creates a data structure that allows for faster searches, much like an index in a
book.
4.3 Tablespaces
A Tablespace is a logical storage unit that groups related data files. It allows for better
organization and management of data storage, enabling the DBA to allocate storage
resources effectively.
4.4 Redo Logs
Redo Logs are files that record all changes made to the database. They are crucial for
recovery operations, allowing the DBMS to reconstruct data in the event of a system failure.
Understanding these physical structures enables efficient storage management and ensures
the reliability of the database system.
5. Normalization and Logical Design
Normalization is a systematic approach to organizing data in a database to reduce
redundancy and improve data integrity. It involves decomposing tables into smaller, related
tables without losing data. The process is divided into normal forms, each with specific rules
and requirements.
5.1 First Normal Form (1NF)
A table is in 1NF if:
• All columns contain atomic (indivisible) values.
• Each column contains values of a single type.
• Each column has a single, unique name, and each row must be unique.
Example (Before 1NF):
StudentID Name Courses
101 Riya Mehta DBMS, Java, Python
In the above example, the Courses column contains multiple values. This violates the rule of
atomicity.
After applying 1NF:
StudentID Name Course
101 Riya Mehta DBMS
101 Riya Mehta Java
101 Riya Mehta Python
Now, each field contains only one value, satisfying 1NF.
5.2 Second Normal Form (2NF)
A table is in 2NF if:
• It is already in 1NF.
• It has no partial dependency, i.e., no non-prime attribute should depend only on a
part of the composite primary key.
Example (Before 2NF):
Consider a table where the primary key is (StudentID, CourseID):
StudentID CourseID StudentName CourseName
101 C01 Riya DBMS
101 C02 Riya Java
Here, StudentName depends only on StudentID, and CourseName depends only on
CourseID, violating 2NF.
After applying 2NF:
• Student Table
StudentID StudentName
101 Riya
• Course Table
CourseID CourseName
C01 DBMS
C02 Java
• Enrollment Table
StudentID CourseID
101 C01
101 C02
This eliminates partial dependency and satisfies 2NF.
5.3 Third Normal Form (3NF)
A table is in 3NF if:
• It is in 2NF.
• It has no transitive dependencies, meaning non-prime attributes should not depend
on other non-prime attributes.
Example (Before 3NF):
EmployeeID EmployeeName Department DeptLocation
201 Neha Sharma HR 3rd Floor
Here, DeptLocation depends on Department, which is not a primary key. This is a transitive
dependency.
After applying 3NF:
• Employee Table
EmployeeID EmployeeName Department
201 Neha Sharma HR
• Department Table
Department DeptLocation
HR 3rd Floor
Now, all attributes depend only on the primary key, satisfying 3NF.
5.4 Importance of Normalization
• Eliminates Redundancy: Avoids duplicate data storage.
• Maintains Integrity: Ensures data accuracy through relationships and keys.
• Improves Maintainability: Easier to update and modify.
• Efficient Querying: Simplifies structure for better performance.
6. Logical Design of Databases
Logical design is the process of converting a conceptual design (often from an Entity-
Relationship Diagram) into a logical structure that can be implemented using a DBMS. While
conceptual design focuses on real-world entities and relationships, logical design translates
those into tables, columns, keys, and constraints.
6.1 Steps in Logical Design
Step 1: Identify Entities and Attributes
Start by identifying the main entities (e.g., Employee, Department) and their attributes (e.g.,
EmployeeName, DepartmentName).
Step 2: Define Relationships
Define how entities are related (e.g., one-to-many, many-to-many).
Step 3: Create Tables
Transform entities into tables with appropriate columns and assign primary keys.
Step 4: Establish Foreign Keys
Define relationships using foreign keys. For example, in an Employee table, DepartmentID
may be a foreign key referencing the Department table.
Step 5: Apply Normalization
Ensure all tables follow normalization rules to avoid redundancy and improve integrity.
Step 6: Define Constraints
Apply constraints such as NOT NULL, UNIQUE, CHECK, and DEFAULT to enforce business
rules.
6.2 Example: University Database Design
Entities Identified:
• Student
• Course
• Instructor
Relationships:
• A student can enroll in many courses.
• An instructor can teach many courses.
Logical Schema:
• Student Table
StudentID StudentName Email
1001 Riya
[email protected] • Course Table
CourseID CourseName
C101 DBMS
• Instructor Table
InstructorID InstructorName
I01 Dr. Joshi
• Enrollment Table
StudentID CourseID
1001 C101
• Teaching Table
InstructorID CourseID
I01 C101
6.3 Significance of Logical Design
Logical design ensures that:
• Data structures reflect the business model accurately.
• The system supports all required operations and constraints.
• Future changes can be implemented with minimal disruption.
UNIT III – Relational Database Management Systems and Transaction Management
1. Introduction to Relational Database Management Systems (RDBMS)
In the modern business environment, data is a critical asset that drives decision-making and
strategic planning. Efficiently managing this data is paramount for organizations aiming to
maintain a competitive edge. This necessity has led to the widespread adoption of
Relational Database Management Systems (RDBMS), which offer a structured and efficient
approach to data management.
1.1 What is an RDBMS?
A Relational Database Management System (RDBMS) is a type of database management
system that stores data in a structured format using rows and columns. This structure allows
for easy data retrieval and management. The term "relational" refers to the way data is
organized into tables (also known as relations), which can be linked based on common data
attributes, enabling complex queries and data analysis.
1.2 Evolution from Traditional DBMS to RDBMS
Traditional Database Management Systems (DBMS) managed data in a hierarchical or
network model, which often led to complexities in data retrieval and redundancy. The
introduction of the relational model by Dr. Edgar F. Codd in 1970 revolutionized data
management by introducing a simple and flexible structure for data storage and retrieval.
1.3 Key Features of RDBMS
• Data Structuring: Data is organized into tables with rows and columns, making it
intuitive and easy to understand.
• Data Integrity: Ensures accuracy and consistency of data through constraints and
relationships.
• Data Security: Provides robust mechanisms to control access and protect data from
unauthorized users.
• Scalability: Capable of handling large volumes of data and numerous users
simultaneously.
• Flexibility: Allows for complex queries and data analysis through Structured Query
Language (SQL).
1.4 Advantages of RDBMS
• Elimination of Data Redundancy: By normalizing data, RDBMS reduces duplication
and ensures efficient storage.
• Data Consistency: Maintains uniformity across the database, ensuring that all users
access the same data.
• Enhanced Data Sharing: Multiple users can access and manipulate data concurrently
without conflicts.
• Improved Data Security: Implements access controls and permissions to safeguard
sensitive information.
1.5 Popular RDBMS Software
• Oracle Database: Known for its robustness and scalability, widely used in enterprise
environments.
• Microsoft SQL Server: Offers a comprehensive set of tools for data management and
business intelligence.
• MySQL: An open-source RDBMS popular for web applications and small to medium-
sized businesses.
• PostgreSQL: An advanced open-source RDBMS known for its extensibility and
standards compliance.
2. Entity-Relationship (E-R) Model and E-R Diagrams
Designing a database requires a clear understanding of the data and its relationships. The
Entity-Relationship (E-R) Model is a conceptual framework used to represent the data and
its interconnections within a system.
2.1 Understanding the E-R Model
The E-R Model provides a graphical representation of entities, attributes, and relationships
in a database. It serves as a blueprint for designing databases by illustrating how data
elements relate to each other.
2.2 Components of the E-R Model
• Entities: Objects or things in the real world that have distinct existence. For example,
in a university database, entities could be Students, Courses, and Instructors.
• Attributes: Characteristics or properties of an entity. For instance, a Student entity
may have attributes like StudentID, Name, and Date of Birth.
• Relationships: Associations between entities. For example, a Student enrolls in a
Course, establishing a relationship between the Student and Course entities.
2.3 E-R Diagrams
An E-R Diagram is a visual representation of the E-R Model, depicting entities as rectangles,
attributes as ovals, and relationships as diamonds connecting the entities.
Example: University Enrollment System
Consider a university enrollment system with the following entities and relationships:
• Entities:
o Student: Attributes include StudentID, Name, and Email.
o Course: Attributes include CourseID, CourseName, and Credits.
o Instructor: Attributes include InstructorID, Name, and Department.
• Relationships:
o Enrolls: A Student enrolls in a Course.
o Teaches: An Instructor teaches a Course.
E-R Diagram Representation:
scss
CopyEdit
[Student]----(Enrolls)----[Course]----(Teaches)----[Instructor]
In the diagram:
• Rectangles represent entities (Student, Course, Instructor).
• Diamonds represent relationships (Enrolls, Teaches).
• Ovals connected to rectangles represent attributes of the entities.
2.4 Creating an E-R Diagram: Step-by-Step
1. Identify Entities: Determine the main objects that need to be represented in the
database.
2. Determine Attributes: For each entity, list the relevant attributes that describe it.
3. Establish Relationships: Define how entities interact with each other.
4. Draw the Diagram: Use standardized symbols to represent entities, attributes, and
relationships.
2.5 Exercise: Designing an E-R Diagram
Scenario: Design an E-R diagram for a Library Management System.
Entities:
• Book: Attributes include ISBN, Title, Author, and Publisher.
• Member: Attributes include MemberID, Name, and MembershipDate.
• Loan: Attributes include LoanID, LoanDate, and ReturnDate.
Relationships:
• Borrows: A Member borrows a Book.
Solution:
1. Draw Entities: Represent Book, Member, and Loan as rectangles.
2. Add Attributes: Connect ovals to each rectangle representing the attributes.
3. Define Relationships: Use diamonds to represent the Borrows relationship between
Member and Book.
This exercise helps in understanding how to translate real-world scenarios into database
designs using the E-R Model.
3. E.F. Codd’s 12 Rules for Relational Databases
When the relational model was first introduced in 1970 by Dr. Edgar F. Codd, it
revolutionized the way data was organized, accessed, and managed. To ensure that database
systems followed the relational model properly, Dr. Codd proposed 12 rules (plus a
foundational Rule 0) that any system must satisfy to be considered truly relational.
Although most modern relational database systems do not strictly follow all 12 rules, they
are still widely used to evaluate how “relational” a database management system truly is.
Rule 0: Foundation Rule
A system must manage data using only its relational capabilities.
This rule sets the stage—any system claiming to be relational must support and rely solely
on relational principles for storing and retrieving data.
Rule 1: Information Rule
All information in a relational database is represented explicitly as values in tables.
This means that everything—including metadata, user data, and structural definitions—
should be stored in table format, with rows and columns.
Rule 2: Guaranteed Access Rule
Every individual data value is accessible by using a combination of table name, primary key,
and column name.
This rule ensures that there’s a consistent and logical way to access any data point, making
systems more predictable and structured.
Rule 3: Systematic Treatment of Null Values
Nulls must be supported in a consistent and systematic way, representing missing or
inapplicable information.
This includes proper treatment of nulls during comparisons, arithmetic operations, and
constraints.
Rule 4: Dynamic Online Catalog Based on the Relational Model
The database must include a data dictionary (catalog) that is accessible to authorized users
using SQL queries.
This means even metadata about tables and columns should be stored as tables, not hidden
behind proprietary systems.
Rule 5: Comprehensive Data Sublanguage Rule
The system must support at least one relational language that:
• Has a linear syntax
• Can be used for data definition, manipulation, constraints, and transactions
• Can be embedded in host programming languages
SQL is widely used to meet this requirement.
Rule 6: View Updating Rule
All views that are theoretically updatable should be updatable through the system.
Views (virtual tables) must not just be read-only—they should support insert, update, and
delete operations wherever logically possible.
Rule 7: High-Level Insert, Update, and Delete
The system must support high-level operations for sets of rows, not just row-by-row
processing.
This is essential for performance and simplicity—allowing batch operations on groups of
data.
Rule 8: Physical Data Independence
Changes in the physical storage of data should not require changes in the application or
schema.
For example, moving a table to a new disk or using a different storage format shouldn’t
impact how users interact with it.
Rule 9: Logical Data Independence
Changes in the logical structure (such as splitting a table) should not affect user views or
applications.
This is more difficult to achieve than physical independence but essential for scalability and
maintainability.
Rule 10: Integrity Independence
Integrity constraints (like primary keys or foreign keys) should be stored in the catalog and
not hardcoded into applications.
This ensures that the system, not just the developers, enforces data rules consistently.
Rule 11: Distribution Independence
A user should not need to know whether the data is distributed across multiple locations.
This is the principle behind distributed databases—the system must handle data location
transparently.
Rule 12: Nonsubversion Rule
If the system has low-level access mechanisms, they must not bypass integrity constraints or
relational rules.
In other words, there should be no backdoors—every operation must obey the same rules.
Summary of Codd’s Rules
Rule No. Description
0 Foundation Rule (must support relational model)
1 All data stored in tables
2 Guaranteed access via table, key, and column
3 Proper treatment of null values
4 Self-describing catalog
5 Comprehensive relational language (e.g., SQL)
6 Updatable views
7 Set-level operations
8 Physical data independence
9 Logical data independence
10 Integrity constraints stored in catalog
11 Transparency in data distribution
12 No subversion through low-level operations
Real-World Relevance
While no RDBMS fully complies with all 13 rules (including Rule 0), they offer a gold
standard for evaluating database systems. Oracle, MySQL, PostgreSQL, and SQL Server each
implement many of these rules to varying degrees.
Database Concept: Transaction Management, Properties of a Transaction, Commit and
Rollback, Concurrency, and Locking
1. Transaction Management
A transaction is a sequence of one or more database operations (like INSERT, UPDATE, or
DELETE) that are executed as a single unit of work. The purpose of a transaction is to ensure
that data remains accurate and consistent, even when multiple users are working with the
same database at the same time.
Think of a transaction as a complete business task. For example, in a banking system, when
money is transferred from Account A to Account B, both the debit from Account A and the
credit to Account B must happen together. If only one occurs, the database becomes
inconsistent.
Example of a transaction:
sql
CopyEdit
BEGIN TRANSACTION;
UPDATE accounts SET balance = balance - 10000 WHERE acc_no = 'A123';
UPDATE accounts SET balance = balance + 10000 WHERE acc_no = 'B456';
COMMIT;
This represents a money transfer. If any step fails (e.g., Account B does not exist), the
transaction should not be completed at all.
2. Properties of a Transaction (ACID Properties)
To ensure safe and reliable database operations, every transaction must follow ACID
properties:
A – Atomicity
• All steps in a transaction must either complete entirely or not at all.
• If something fails midway, the transaction is rolled back.
• This prevents partial updates to the database.
Example: If money is deducted from one account but not added to another, atomicity
ensures the deduction is canceled.
C – Consistency
• The transaction must ensure the database remains valid before and after execution.
• Business rules, constraints, and relationships must be preserved.
Example: If a bank balance must never be negative, a transaction that violates this is
rejected.
I – Isolation
• Each transaction should act as if it’s the only one running.
• No transaction should interfere with another’s data.
Example: Two people withdrawing from the same ATM should not read and update the
same balance simultaneously.
D – Durability
• Once a transaction is committed, its changes are permanent, even if the system
crashes.
• Data is safely written to disk or non-volatile memory.
Example: Once money is transferred and confirmed, it will not be lost even during a power
failure.
3. Commit and Rollback
These are commands used to finalize or cancel transactions.
COMMIT
• Saves the changes made during the transaction.
• After a commit, changes become permanent.
Example:
sql
CopyEdit
COMMIT;
Use this after successful completion of all steps.
ROLLBACK
• Cancels all changes made in the current transaction.
• The database returns to its previous consistent state.
Example:
sql
CopyEdit
ROLLBACK;
Used when an error or failure occurs.
Real-World Example:
Consider an online shopping system:
• A user adds items to a cart (inventory is updated),
• Proceeds to payment,
• Then confirms the order.
If payment fails, all changes must be rolled back. If successful, they must be committed.
4. Concurrency
Concurrency occurs when multiple users or systems access the same data at the same time.
While concurrency improves system performance and responsiveness, it can also cause data
conflicts if not managed correctly.
Problems Due to Concurrency:
1. Lost Updates:
o Two transactions update the same data.
o One update overwrites the other without knowing.
2. Dirty Read:
o A transaction reads data that is not yet committed.
o If the original transaction rolls back, the read data becomes invalid.
3. Non-Repeatable Read:
o A transaction reads the same data twice and gets different results because
another transaction changed it in between.
4. Phantom Read:
o A transaction executes a query, then re-executes it and sees new rows
inserted by another transaction.
Example of a Problem:
Two people trying to book the last seat on a flight:
• Both see it's available.
• Both try to book.
• Without proper control, both could be allowed to book, creating errors.
5. Locking
Locking is a mechanism used to manage concurrency by controlling access to data. It
ensures that two transactions do not modify the same data at the same time.
Types of Locks:
1. Shared Lock (Read Lock):
o Allows multiple users to read data.
o No one can modify the data while it's locked for reading.
2. Exclusive Lock (Write Lock):
o Only one transaction can read and write the data.
o Others must wait until the lock is released.
Two-Phase Locking (2PL):
A popular strategy where:
• In the growing phase, a transaction can acquire locks but not release them.
• In the shrinking phase, it can release locks but not acquire new ones.
This helps maintain serializability, meaning transactions behave as if executed one after
another.
Deadlocks:
A deadlock occurs when two transactions wait for each other’s resources forever.
Example:
• Transaction 1 locks Table A and waits for Table B.
• Transaction 2 locks Table B and waits for Table A.
The system must detect and resolve this by aborting one transaction.
UNIT IV – Advanced Concepts in Database Management Systems
1. Introduction
In the previous units, we explored the foundational aspects of databases, including their
structures, models, and design principles. As we delve deeper into the complexities of
Database Management Systems (DBMS), it becomes imperative to understand advanced
concepts that ensure data remains accurate, secure, and efficiently managed.
This unit focuses on critical areas such as Data Integrity, which ensures the accuracy and
consistency of data; Auditing, which monitors and records database activities; Backup and
Recovery, essential for data preservation and restoration; and components like the Data
Dictionary and System Catalogue, which store metadata about the database.
Additionally, we'll introduce Distributed Databases, which allow data distribution across
multiple locations, and discuss Client-Server Architecture along with ODBS Connectivity,
which facilitate interactions between clients and database servers.
2. Data Integrity
Data Integrity refers to the accuracy, consistency, and reliability of data stored in a database.
It ensures that the data remains unaltered during operations such as transfer, storage, or
retrieval. Maintaining data integrity is crucial for making informed business decisions and
ensuring the credibility of organizational data.
2.1 Types of Data Integrity
1. Entity Integrity: Ensures that each table has a unique identifier, known as the
Primary Key, and that this key is not null. This guarantees that each record within a
table is unique and identifiable.
2. Referential Integrity: Maintains the consistency of relationships between tables. It
ensures that a Foreign Key in one table corresponds to a Primary Key in another,
preventing orphan records and preserving the logical connections between data.
3. Domain Integrity: Enforces valid entries for a given column by restricting the type,
format, or range of values. For example, a column designated for ages should only
accept numerical values within a plausible range.
4. User-Defined Integrity: Pertains to rules and constraints defined by users to meet
specific business requirements. These are custom rules that ensure data adheres to
the unique needs of an organization.
2.2 Importance of Data Integrity
• Accuracy: Ensures that data reflects real-world scenarios correctly.
• Consistency: Maintains uniformity of data across the database.
• Reliability: Provides dependable data for analysis and decision-making.
• Security: Protects data from unauthorized access and malicious alterations.
3. Integrity Constraints
Integrity Constraints are rules applied to database tables to enforce data integrity. They act
as safeguards, ensuring that the data entered into the database adheres to predefined rules
and standards.
3.1 Types of Integrity Constraints
1. Primary Key Constraint: Ensures that a column (or a combination of columns)
uniquely identifies each row in a table. For example, a StudentID in a Students table
serves as a unique identifier for each student.
2. Foreign Key Constraint: Establishes a link between two tables by ensuring that a
column in one table corresponds to a primary key in another. This maintains
referential integrity between related tables.
3. Unique Constraint: Ensures that all values in a column are distinct across the table,
preventing duplicate entries.
4. Not Null Constraint: Specifies that a column cannot have a null value, ensuring that
essential data is always provided.
5. Check Constraint: Validates that the values in a column meet a specific condition. For
instance, a Check Constraint can ensure that the values in an Age column are greater
than zero.
3.2 Implementing Integrity Constraints
Implementing these constraints involves defining them during the creation or alteration of
tables using Structured Query Language (SQL). By enforcing these rules, databases can
prevent invalid data entry and maintain overall data quality.
4. Auditing in Databases
Auditing in the context of databases refers to the systematic tracking and recording of
database activities. This process is vital for monitoring user actions, ensuring compliance
with policies, and detecting unauthorized or malicious activities.
4.1 Objectives of Database Auditing
• Accountability: Tracks user activities to hold individuals responsible for their actions.
• Compliance: Ensures adherence to internal policies and external regulations.
• Security Monitoring: Detects unauthorized access or anomalies that may indicate
security breaches.
• Forensic Analysis: Provides a trail of activities that can be analyzed in the event of an
incident.
4.2 Auditing Mechanisms
1. Statement Auditing: Monitors specific SQL statements executed within the database.
2. Privilege Auditing: Tracks the use of system privileges and roles by users.
3. Object Auditing: Focuses on access and modifications to specific database objects
like tables or views.
4. User Auditing: Monitors activities performed by specific users or groups.
4.3 Implementing Database Auditing
Implementing auditing involves configuring the DBMS to record specified activities. This can
be achieved through built-in auditing features provided by the DBMS or by integrating
external auditing tools. It's essential to balance the depth of auditing with system
performance considerations.
5. Backup and Recovery
Backup and Recovery are critical processes that ensure data availability and resilience in the
face of hardware failures, software issues, or unforeseen disasters.
5.1 Backup Strategies
1. Full Backup: Captures the entire files. This method provides a complete copy of the
database and is typically performed during off-peak hours due to the time it takes.
2. Incremental Backup:
Only backs up data that has changed since the last backup (either full or
incremental). This approach saves time and storage space but requires more time to
restore, as it must apply changes in sequence.
3. Differential Backup:
B acks up all changes made since the last full backup. It strikes a balance between full
and incremental backups, offering quicker restores than incremental.
4. Logical Backup:
Extracts schema and data using SQL statements like CREATE, INSERT, etc. It's usually
used for smaller databases or when migrating between systems.
5. Physical Backup:
Copies raw database files directly from disk. This method is faster for large systems
but requires consistent file-level snapshots to avoid corruption.
5.2 Recovery Techniques
1. Rollforward (Redo Logs):
Used when committed transactions need to be re-applied to the data after a system
failure. This brings the database up to the state it was in before the crash.
2. Rollback (Undo Logs):
Used to undo uncommitted changes made by incomplete transactions during a crash,
restoring the data to its last consistent state.
3. Point-in-Time Recovery (PITR):
Restores the database to a specific point in time, useful in scenarios such as
accidental deletion or malicious changes.
4. Cold Backup & Hot Backup:
• Cold backup is taken when the database is shut down.
• Hot backup occurs while the database is running, and is suitable for 24/7 systems.
5.3 Importance of Backup and Recovery
• Business Continuity: Prevents data loss that could halt operations.
• Disaster Recovery: Protects against cyberattacks, hardware failures, or natural
disasters.
• Regulatory Compliance: Ensures organizations meet data retention laws and
standards.
• User Error Correction: Allows recovery from unintended deletions or changes.
6. Data Dictionary and System Catalogue
Understanding the structure and rules of the database is essential for both users and the
DBMS. This is made possible by the Data Dictionary and the System Catalogue.
6.1 Data Dictionary
The Data Dictionary is a centralized repository that holds metadata—information about the
database itself. It defines how data is organized, the relationships among various data
elements, and what constraints apply.
What it Contains:
• Names and descriptions of all tables, views, columns
• Data types and sizes
• Integrity constraints
• User access privileges
• Indexes and relationships
Why It's Important:
• Acts as a reference for developers and DBAs
• Supports query processing and validation
• Helps with troubleshooting and optimization
6.2 System Catalogue
The System Catalogue is a collection of tables and views maintained by the DBMS. It is the
DBMS’s internal database that stores metadata and system-level data.
It Includes:
• User and role definitions
• Storage allocation details
• Query execution statistics
• Audit logs (in some systems)
The system catalogue is essential for efficient query planning, resource management, and
system health monitoring.
7. Introduction to Distributed Databases
In today's interconnected world, organizations often operate from multiple locations. To
manage data efficiently in such setups, Distributed Databases are used.
7.1 What is a Distributed Database?
A Distributed Database is a collection of logically related databases distributed over a
computer network. Although the data is spread across multiple physical locations, it appears
to users as a single unified database.
7.2 Features of Distributed Databases
• Data Distribution: Data is stored at multiple sites.
• Transparency: Users don’t need to know the location of the data.
• Concurrency Control: Supports multiple users accessing distributed data.
• Fault Tolerance: If one node fails, the rest can still function.
7.3 Types of Data Distribution
1. Horizontal Fragmentation:
Data rows are distributed across different sites. For example, sales data for different
regions stored in separate servers.
2. Vertical Fragmentation:
Different columns of a table are stored in different locations. For example, customer
name in one location, contact info in another.
3. Replication:
Copies of the same data are stored in multiple locations to improve availability and
performance.
7.4 Advantages of Distributed Databases
• Improved Reliability: A failure in one site doesn’t stop operations elsewhere.
• Faster Local Access: Users can access nearby servers instead of remote ones.
• Scalability: Easy to add new nodes as data grows.
• Cost Efficiency: Utilizes local computing power and reduces network traffic.
7.5 Challenges of Distributed Databases
• Complexity: Difficult to design and maintain.
• Data Synchronization: Keeping all data copies consistent.
• Security: Needs consistent security policies across all locations.
• Query Optimization: More challenging than in centralized systems.
8. Introduction to Client–Server and ODBS Connectivity
Modern applications do not directly connect to the database files. Instead, they follow a
client-server model, and in many systems, ODBS (Open Database Business Standard)
connectivity is used.
8.1 What is Client–Server Architecture?
Client-server is a network-based computing model where:
• The client (user-side application) sends requests.
• The server (hosted DBMS) processes requests and sends back responses.
Components:
• Client: Front-end interface such as a web browser, CRM system, or mobile app.
• Server: Back-end database engine like Oracle, MySQL, or PostgreSQL.
8.2 Benefits of Client–Server Model
• Centralized Control: All data and rules are stored and enforced on the server.
• Efficiency: Reduces client-side processing.
• Scalability: Easily supports hundreds of users.
• Security: Server handles access control and permissions centrally.
8.3 What is ODBS (Open Database Business Standard)?
ODBS is a standard that allows different software applications to connect to various
databases without depending on the database-specific interfaces.
While ODBC (Open Database Connectivity) is the more common term in traditional database
systems, ODBS in some environments or contexts refers to business-standard database
interaction that is tailored for enterprise systems.
8.4 Features of ODBS Connectivity
• Uniform Access: Enables business applications to connect to multiple types of
databases using a standard interface.
• Platform Independence: Works across different operating systems and hardware.
• Database Independence: One application can communicate with Oracle, MySQL, or
SQL Server without needing separate interfaces.
• Ease of Integration: Helps connect ERP, CRM, or BI tools to databases smoothly.
8.5 Why ODBS is Important for Business Applications
• Reduces the cost and time to build systems that support multiple databases.
• Supports enterprise-level systems with multiple data sources.
• Makes reporting, analytics, and data migration easier and more reliable.
UNIT V – Advanced Database Concepts and Data Analysis Techniques
1. Introduction to SQL
Structured Query Language (SQL) is the standard language used to communicate with
relational database management systems (RDBMS). It enables users to create, manipulate,
and retrieve data stored in databases. SQL is essential for managing structured data and is
widely used in various applications, from small-scale systems to large enterprise solutions.
1.1 Components of SQL
SQL can be broadly categorized into several components:
• Data Definition Language (DDL): Commands that define the structure of the
database, such as CREATE, ALTER, and DROP.
• Data Manipulation Language (DML): Commands that manage data within the
database, including SELECT, INSERT, UPDATE, and DELETE.
• Data Control Language (DCL): Commands that control access to data, such as GRANT
and REVOKE.
• Transaction Control Language (TCL): Commands that manage transactions, including
COMMIT, ROLLBACK, and SAVEPOINT.
2. Data Manipulation Language (DML) Commands
DML commands are used to interact with data stored in the database. They allow users to
retrieve, insert, modify, and delete data.
2.1 SELECT
The SELECT statement retrieves data from one or more tables. It can be used to filter, sort,
and display data based on specific criteria.
Example:
sql
CopyEdit
SELECT first_name, last_name, department
FROM employees
WHERE department = 'Sales'
ORDER BY last_name;
This query retrieves the first and last names of employees in the Sales department, ordered
by their last names.
2.2 INSERT
The INSERT statement adds new records to a table.
Example:
sql
CopyEdit
INSERT INTO employees (first_name, last_name, department, hire_date)
VALUES ('John', 'Doe', 'Marketing', '2025-04-02');
This command adds a new employee named John Doe to the Marketing department with a
hire date of April 2, 2025.
2.3 UPDATE
The UPDATE statement modifies existing records in a table.
Example:
sql
CopyEdit
UPDATE employees
SET department = 'Human Resources'
WHERE employee_id = 101;
This command updates the department of the employee with ID 101 to Human Resources.
2.4 DELETE
The DELETE statement removes records from a table.
Example:
sql
CopyEdit
DELETE FROM employees
WHERE department = 'Temporary';
This command deletes all employees who are in the Temporary department.SQL Server
Mentalist
3. Relational Algebra and SQL
Relational algebra is a formal system for manipulating relations (tables) in a database. It
provides a set of operations that take one or more relations as input and produce a new
relation as output. SQL is based on the principles of relational algebra, translating its
operations into practical commands for data retrieval and manipulation.
3.1 Basic Operations in Relational Algebra
• Selection (σ): Retrieves rows from a relation that satisfy a given condition.
• Projection (π): Retrieves specific columns from a relation.
• Union (∪): Combines the results of two relations, removing duplicates.
• Set Difference (-): Returns rows that are in one relation but not in another.
• Cartesian Product (×): Combines all rows from two relations.
• Join (⨝): Combines rows from two relations based on a related attribute.
3.2 Translating Relational Algebra to SQL
SQL implements the operations of relational algebra through its commands. For example:
• Selection: Implemented using the WHERE clause in SQL.
• Projection: Implemented by specifying columns in the SELECT statement.
• Join: Implemented using JOIN clauses in SQL.
Example:
sql
CopyEdit
SELECT first_name, last_name
FROM employees
WHERE department = 'Finance';
This SQL query performs both selection (employees in the Finance department) and
projection (retrieving only first and last names).
4. Security and Integrity Violations
Ensuring the security and integrity of data is paramount in database management.
Violations can lead to unauthorized access, data corruption, and loss of trust.
4.1 Common Security Violations
• Unauthorized Access: When individuals gain access to data without proper
authorization.
• Data Breaches: Exposure of sensitive data to unauthorized parties.
• SQL Injection: A code injection technique that can destroy or manipulate databases.
4.2 Common Integrity Violations
• Data Redundancy: Unnecessary duplication of data leading to inconsistencies.
• Data Inconsistency: Conflicting information stored in different parts of the database.
Wikipedia, la enciclopedia libre
• Referential Integrity Violations: When foreign keys do not correctly reference
primary keys, leading to orphan records.
5. Authorization and Granting of Privileges
In any database system, managing who can access what is a critical task. This is done
through authorization and privilege management. It ensures that only legitimate users can
view or manipulate the data they are allowed to.
5.1 What is Authorization?
Authorization refers to the process of assigning permissions to users for accessing specific
database resources like tables, views, or procedures. It is part of database security and
access control mechanisms.
For example:
• A data entry clerk may only have permission to insert and update data.
• A manager may have access to view, edit, and delete records.
• An auditor might only be allowed to read but not change any data.
5.2 Granting Privileges
SQL provides the GRANT command to give permissions to users.
Example:
sql
CopyEdit
GRANT SELECT, INSERT ON employees TO user1;
This command allows the user named user1 to select (read) and insert (add) data in the
employees table.
5.3 Revoking Privileges
To remove access rights from a user, SQL uses the REVOKE command.
Example:
sql
CopyEdit
REVOKE INSERT ON employees FROM user1;
This command removes the INSERT privilege, but user1 can still SELECT from the table unless
that is also revoked.
5.4 Roles and Privilege Groups
For larger systems, managing individual user rights can be complex. Hence, roles (groups of
privileges) are created. Users are then assigned roles instead of individual permissions.
Example:
• Role: HR_Manager
• Privileges: SELECT, UPDATE, DELETE on employee_records
• Assigning role to user:
sql
CopyEdit
GRANT HR_Manager TO user2;
6. Security Specification in SQL
Database systems must implement a range of security measures to protect sensitive data.
These include:
6.1 Access Control
Access control ensures that only authorized users can perform permitted operations.
• Discretionary Access Control (DAC): Based on user identity and their privileges.
• Mandatory Access Control (MAC): Based on predefined security labels for both users
and data.
6.2 User Authentication
Users must login with valid credentials before accessing the system. Authentication can
include:
• Username and password
• Token-based access
• Biometric authentication (in high-security systems)
6.3 View-Level Security
SQL provides views to limit user access to sensitive columns or rows.
Example:
sql
CopyEdit
CREATE VIEW view_employee AS
SELECT name, department
FROM employees
WHERE department = 'Finance';
Now, a user with access to view_employee will only see employees from the Finance
department.
6.4 Auditing and Logs
Database systems maintain logs of all activities, including user login times, query histories,
and changes made. This helps in tracking misuse, investigating issues, and ensuring
compliance with legal regulations.
7. Introduction to Data Warehousing
As organizations generate large volumes of data from multiple sources, it becomes essential
to consolidate, store, and analyze this information for strategic decision-making. This leads
to the concept of a Data Warehouse.
7.1 What is a Data Warehouse?
A data warehouse is a central repository where data from various sources (like CRM, ERP,
sales, HR systems) is collected, cleaned, and stored. It is designed specifically for querying
and analysis, not for transaction processing.
7.2 Characteristics of a Data Warehouse
• Subject-Oriented: Organized around key business areas (e.g., sales, finance).
• Integrated: Data from multiple sources is combined into a unified format.
• Non-Volatile: Once entered, data is not usually updated or deleted.
• Time-Variant: Keeps historical data for trend analysis.
7.3 Data Warehouse vs. Database
Feature Database Data Warehouse
Purpose Daily operations Analytical decision-making
Data Type Current, real-time Historical, summarized
Design Transactional (OLTP) Analytical (OLAP)
Data Volume Moderate Very large
Users Clerical staff Managers, analysts
8. Multidimensional Data Models
To support analysis in data warehouses, data is often represented in a multidimensional
format—just like an Excel pivot table.
8.1 Facts and Dimensions
• Fact Table: Contains numeric values (facts) such as sales amount, revenue, or units
sold.
• Dimension Table: Contains descriptive data such as customer name, region, time, or
product category.
8.2 Star Schema
The most common model is the Star Schema:
• Central fact table linked to multiple dimension tables.
• Easy to understand and supports fast queries.
Example:
A fact table sales may connect to dimensions like product, time, and region.
8.3 Snowflake Schema
An extension of star schema where dimension tables are further normalized into sub-
dimensions. It saves space but makes queries more complex.
9. Data Warehouse Architecture
A standard data warehouse architecture includes:
9.1 Data Sources
Data comes from:
• Internal systems: ERP, CRM, HRMS
• External sources: Web, cloud, suppliers
9.2 ETL Process (Extract, Transform, Load)
• Extract: Collect data from source systems
• Transform: Clean, validate, and convert data to a standard format
• Load: Store it in the warehouse
9.3 Staging Area
A temporary storage where raw data is processed before being loaded into the warehouse.
9.4 Data Storage and Access
• Data Marts: Subsections of the warehouse tailored for specific departments (e.g.,
marketing, finance).
• Metadata: Stores information about the structure and origin of the warehouse data.
10. ROLAP, MOLAP, HOLAP
Data can be analyzed using different OLAP (Online Analytical Processing) technologies:
10.1 ROLAP (Relational OLAP)
• Uses relational databases
• Good for large data volumes
• Slower because of complex queries
10.2 MOLAP (Multidimensional OLAP)
• Uses pre-aggregated data in cubes
• Very fast for querying
• Requires more storage
10.3 HOLAP (Hybrid OLAP)
• Combines the benefits of both
• Stores summary data in MOLAP and detailed data in ROLAP
11. OLAP vs. OLTP
Feature OLAP OLTP
Full Form Online Analytical Processing Online Transaction Processing
Use Data analysis Day-to-day operations
Data Type Historical, summarized Current, detailed
Feature OLAP OLTP
Queries Complex, long-running Simple, short
Examples BI Reports, Forecasting ATM, POS Systems
12. Data Mining
12.1 What is Data Mining?
Data mining is the process of discovering patterns, trends, and relationships from large
datasets using techniques from machine learning, statistics, and database systems.
12.2 Applications
• Retail: Market basket analysis (e.g., people who buy bread also buy butter)
• Banking: Fraud detection and credit scoring
• Healthcare: Predict disease trends
• Marketing: Customer segmentation
13. Data Preprocessing
Before mining, data must be cleaned and formatted. This is known as Data Preprocessing.
13.1 Steps in Data Preprocessing
• Data Cleaning: Remove duplicates, fill missing values
• Data Integration: Merge data from various sources
• Data Transformation: Normalize or standardize data
• Data Reduction: Remove irrelevant or redundant data
14. Data Marts
14.1 What is a Data Mart?
A Data Mart is a smaller, more focused version of a data warehouse, built for specific
business units like marketing, HR, or sales.
14.2 Types of Data Marts
• Dependent: Gets data from the central warehouse
• Independent: Created directly from source systems
15. Cluster Analysis
Cluster analysis is a technique used in data mining to group similar data points into clusters.
15.1 Applications
• Customer segmentation based on behavior
• Market research to group similar products
• Anomaly detection in finance or security
15.2 Types of Clustering
• Hierarchical Clustering
• K-means Clustering
• Density-Based Clustering
16. Decision Making with Data
Finally, the entire purpose of databases, warehousing, and mining is to help organizations
make better decisions.
16.1 Business Intelligence (BI)
BI tools convert data into visual dashboards, graphs, and reports, helping managers:
• Forecast future trends
• Understand past performance
• Improve efficiency
16.2 Data-Driven Strategy
Using data:
• Improves product offerings
• Enhances customer satisfaction
• Identifies cost-saving opportunities