DBMS Notes
DBMS Notes
1. Entity
2. Entity Type
o Example: Student is an entity type, where individual students like John, Jane, etc., are
entities.
o Example: For the Student entity type, attributes could be Name, Roll No, Age, Address.
4. Key Attributes
5. Types of Entity:
o Strong Entity
o Weak Entity
▪ Does not have a primary key; instead, uses a foreign key combined with a partial
key.
6. Entity Set
o Example: A Student entity can have a relationship with a Course entity via the Enrollment
relationship.
1. Definition
o An entity set is a collection of entities of the same type that share the same attributes.
o Example: All students in a university form a "Student" entity set, where each student is
an individual entity.
o Each entity in the set has attributes that define its properties.
o Example: In the "Student" entity set, attributes might include Name, Roll No, Age, and
Department.
3. Key Characteristics
▪ Instances: The actual data or objects (e.g., John, Roll No. 101).
▪ Does not have a primary key; uses a foreign key and a discriminator (partial
key).
5. Representation in ER Diagram
Attributes in DBMS
1. Definition
o Example: For the Student entity, attributes include Name, Roll Number, and Age.
2. Types of Attributes
o Simple Attribute
o Composite Attribute
▪ Example: Full Name can be divided into First Name and Last Name.
o Derived Attribute
o Single-Valued Attribute
o Multi-Valued Attribute
o Null Attribute
3. Representation in ER Diagrams
1. Definition
o A key is an attribute (or a set of attributes) used to uniquely identify a tuple (row) in a
relation (table).
2. Types of Keys
o Super Key
o Candidate Key
▪ A minimal super key (no proper subset can uniquely identify tuples).
o Primary Key
o Alternate Key
▪ Example: If Roll Number is the primary key, Email could be an alternate key.
o Foreign Key
▪ An attribute in one table that refers to the primary key in another table.
o Composite Key
o Unique Key
▪ Ensures all values in a column are unique but allows null values.
Relationships in DBMS
1. Definition
2. Representation in ER Diagrams
3. Components of a Relationship
4. Degree of a Relationship
1. Based on Cardinality
o One-to-One (1:1)
▪ Each entity in one set is related to at most one entity in another set.
o One-to-Many (1
▪ One entity in the first set is related to multiple entities in the second set.
o Many-to-Many (M
▪ Many entities in one set are related to many entities in another set.
▪ Example: Students enroll in multiple Courses, and each Course has multiple
Students.
2. Based on Participation
o Total Participation
o Partial Participation
o Identifying Relationship
▪ A relationship where a weak entity depends on a strong entity for its identity.
o Non-Identifying Relationship
4. Recursive Relationship
Example Student enrolls in Course A table with columns like Student_ID, Course_ID
Roles in DBMS
1. Definition
2. Types of Roles
o Explicit Roles
o Implicit Roles
▪ Role 1: Supervisor.
▪ Role 2: Subordinate.
o Structural constraints define the rules or restrictions on how entities can participate in a
relationship.
▪ Cardinality
▪ Participation
1. Cardinality
• Specifies the number of instances of one entity that can or must be associated with instances of
another entity.
• Types of Cardinality:
o One-to-One (1:1)
o One-to-Many (1
o Many-to-Many (M
▪ Example: Students enroll in multiple Courses, and each Course has multiple
Students.
2. Participation
• Types of Participation:
o Total Participation
o Partial Participation
o Relationship: Works_For.
o Roles:
o Structural Constraints:
).
Key Differences
1
Example Teacher and Course in Teaches
relationship with total participation.
Definition
• A weak entity is an entity type that cannot be uniquely identified by its attributes alone.
• Example: A Dependent entity (e.g., child) relies on an Employee entity (e.g., parent).
o They are identified using a partial key along with a foreign key from a related strong
entity.
3. Existence Dependency
4. Participation Constraint
o Weak entities have total participation in the relationship with the strong entity.
o This means every weak entity must be linked to at least one strong entity.
o An attribute or set of attributes in a weak entity that uniquely identifies it within the
context of a strong entity.
Representation in ER Diagram
• Identifying Relationship:
• Primary Key:
o Combination of the partial key and the primary key of the strong entity.
1. Scenario:
2. Attributes:
3. Relationship:
Primary Key Has a unique primary key. Does not have a primary key of its own.
Participation Can have partial or total participation. Always has total participation.
The Enhanced ER Model (EER) extends the basic Entity-Relationship (ER) model by incorporating more
advanced concepts to represent complex data relationships.
o Generalization:
o Representation:
2. Inheritance
o Example: A Student inherits Name and Address from the Person entity.
4. Aggregation
o Example: Loan is a relationship between Bank and Customer, and it can participate in
another relationship, Approval, with an Employee.
,M
) in relationships.
Object modeling in DBMS is based on the Object-Oriented Data Model, which combines the principles
of object-oriented programming with database design.
1. Objects
o Example: A Student object has attributes (Name, Roll No) and methods
(CalculateGrade()).
2. Classes
o Example: Class Student can have attributes (Name, Roll No) and methods (Enroll(),
Drop()).
3. Inheritance
o Allows one class (subclass) to inherit properties and methods from another (superclass).
4. Encapsulation
o Combines attributes and methods within a single unit (object), hiding implementation
details from the user.
5. Polymorphism
o Enables the same method to perform different operations based on the object calling it.
o Example: A method CalculateSalary() can behave differently for Manager and Employee
objects.
Definition
1. Superclass:
o A generalized entity that represents common attributes and relationships shared by its
subclasses.
2. Subclass:
o A specialized entity that inherits attributes and relationships from a superclass while
introducing additional attributes or relationships.
o Generalization:
o Specialization:
2. Inheritance
o Example: If Person has attributes Name and Address, then Student (a subclass) also has
these attributes.
3. Attributes in Subclasses
o Example: Student may have Roll Number, while Employee may have Employee ID.
o Disjoint:
o Overlapping:
▪ An entity instance of the superclass can belong to more than one subclass.
2. Completeness Constraints
o Total Specialization:
o Partial Specialization:
• Constraints:
1. Scenario
o Subclasses:
2. Specialization Constraints
3. ER Diagram Representation
Applications
• In Databases:
o Allows more efficient organization of data by representing hierarchical relationships.
o Example: Separating customers into Corporate and Individual types in a CRM database.
• In Object-Oriented Systems:
Definition
Indexed Sequential Access Method (ISAM) is a file organization technique that combines the advantages
of both sequential access and indexing to provide efficient access to data in a file. In ISAM, the data is
stored sequentially, and an index is maintained to provide faster access to records.
o Data records are stored sequentially in a file according to a specific key attribute.
o For example, records in an employee file might be sorted based on Employee ID.
2. Indexing
o An index file is created, which contains keys and pointers to the actual data records.
o The index allows for direct access to the data without needing to scan through the entire
file sequentially.
o Sequential access is used for operations like scanning the entire file.
o Indexed access allows for quick retrieval of records without scanning the entire file,
improving performance for searches.
Structure of ISAM
1. Data File
o Contains the actual records, which are stored in sorted order based on a key field.
2. Index File
o Contains an ordered list of key values and corresponding pointers to the data file.
o Primary Index: Directly indexes the records using the key field.
3. Overflow Area
o In ISAM, an overflow area is used when new records cannot fit into the predefined slots
of the index or data file due to insertions or deletions.
o This area stores records that don’t fit into the sequential structure.
1. Insertion
o When a new record is inserted, it is added in the appropriate sorted order in the data
file.
o If there is no space in the index, it might require a reorganization of the index and data
files.
2. Search
o To search for a record, the system looks up the index for the key value and retrieves the
corresponding pointer to the data file.
o The pointer directs the system to the exact location of the record in the data file.
o When a record is deleted or updated, it might require the reorganization of the file to
maintain the sequential order and integrity of the index.
4. Overflow Handling
o When the data file becomes full, the new records that don't fit are stored in an overflow
area.
o Overflow records are linked to the main file or index, and a pointer is used to access
them.
Advantages of ISAM
1. Fast Access
o Provides faster retrieval of records compared to purely sequential files by using indexes.
2. Efficient for Range Queries
o Because the records are sorted and indexed, it is efficient for range-based queries,
where you need to find records that fall within a specific range of key values.
o It allows for sequential scanning of the file and direct access to individual records via
the index, combining the advantages of both methods.
Disadvantages of ISAM
1. Reorganization Overhead
o The need to reorganize the file (rebuilding the index and data file) when records are
inserted or deleted can be costly and time-consuming, especially when the file grows
large.
o ISAM files are typically designed with fixed-size blocks and indexes. This can lead to
overflow issues when the number of records exceeds the capacity of the file or index
structure.
o If there are frequent insertions and deletions, ISAM may become inefficient due to the
overhead of reorganization and handling overflow areas.
Fast for search and range Slow for search and range Fast for search and range
Access Speed
queries queries queries
Applications of ISAM
• Transaction Processing Systems: ISAM is well-suited for systems where there is a mix of
sequential processing and direct access, such as bank account records or inventory systems.
• Data Warehousing: When efficient retrieval and range queries are required, ISAM is often used
for storing and accessing large volumes of data.
In a Database Management System (DBMS), both B-trees and B+ trees are commonly used for indexing
because they offer efficient search, insertion, and deletion operations. These trees are balanced,
meaning they maintain a structure that ensures logarithmic access time for all operations, which is
crucial for database indexing.
B-Tree
A B-tree is a self-balancing search tree data structure that maintains sorted data and allows for efficient
insertion, deletion, and search operations. It is generalized to allow more than two children per node.
Properties of B-Tree
1. Balanced:
2. Nodes:
o Each node contains multiple keys, and each key is associated with a pointer to a child
node.
o The number of keys in each node is between a defined minimum and maximum number
of keys (based on the order of the tree).
3. Order:
o The order of a B-tree defines the maximum number of children a node can have. For
example, a B-tree of order m can have at most m-1 keys and m children in each node.
4. Key Property:
5. Searching:
o Searching in a B-tree involves traversing the tree from the root to the leaf, comparing
keys at each node to determine which child pointer to follow.
B-Tree Operations
1. Search:
o Start from the root node and search through the keys in each node.
o If the key is less than the first key, follow the left child; if greater than the first key but
less than the second key, follow the middle child, and so on.
2. Insertion:
o If the node is full, split the node into two, and push the middle key to the parent node.
3. Deletion:
o If the node has fewer than the minimum required keys after deletion, borrow a key from
a sibling or merge nodes.
• A B-tree of order 3 can have a maximum of 2 keys and 3 children per node.
css
Copy code
[17, 28]
/ | \
B+ Tree
A B+ Tree is a variation of the B-tree where all leaf nodes are linked in a linked list to facilitate range
queries. It maintains the structure of the B-tree but has some differences in how it stores data.
Properties of B+ Tree
o In a B+ Tree, only the leaf nodes contain the actual data records (or pointers to the
data). Internal nodes only store keys to guide the search.
o Leaf nodes are linked in a doubly linked list, allowing efficient range queries.
3. Sorted Data:
o Just like in B-trees, the keys in both the internal and leaf nodes are sorted.
4. Search Efficiency:
o The search operation is similar to that of a B-tree, but in B+ trees, since the leaf nodes
are linked, range queries can traverse the leaf nodes directly.
B+ Tree Operations
1. Search:
o The search in a B+ tree starts like a B-tree but follows the internal nodes to find the
appropriate leaf node.
2. Insertion:
o Similar to B-trees, insert the key into the appropriate leaf node. If the node is full, it
splits and the middle key is pushed up to the parent node.
3. Deletion:
o Deletion follows the same procedure as B-trees. If a node falls below the minimum
number of keys, borrowing or merging happens.
B+ Tree Example
Consider a B+ Tree of order 3 with a maximum of 2 keys per internal node and 2 keys per leaf node. The
structure might look like:
css
Copy code
[10]
/ \
/ | \
• Leaf nodes contain data or references to data: [1, 2], [8], [11, 13].
Implementation of B-Tree and B+ Tree
Below is a simple implementation outline for both B-tree and B+ tree operations.
python
Copy code
class BTreeNode:
self.is_leaf = is_leaf
self.keys = []
self.children = []
class BTree:
self.order = order
self.root = BTreeNode(True)
root = self.root
new_node = BTreeNode(False)
new_node.children.append(self.root)
self.split(new_node, 0)
self.root = new_node
self.insert_non_full(self.root, key)
node.keys.append(key)
node.keys.sort()
else:
i = len(node.keys) - 1
i -= 1
full_node = parent.children[index]
new_node = BTreeNode(full_node.is_leaf)
mid = len(full_node.keys) // 2
parent.keys.insert(index, full_node.keys[mid])
parent.children.insert(index + 1, new_node)
full_node.keys = full_node.keys[:mid]
if not full_node.is_leaf:
full_node.children = full_node.children[:mid + 1]
B+ Tree Implementation
python
Copy code
class BPlusTreeNode:
def __init__(self, is_leaf):
self.is_leaf = is_leaf
self.keys = []
self.children = []
class BPlusTree:
self.order = order
self.root = BPlusTreeNode(True)
root = self.root
new_node = BPlusTreeNode(False)
new_node.children.append(self.root)
self.split(new_node, 0)
self.root = new_node
self.insert_non_full(self.root, key)
if node.is_leaf:
node.keys.append(key)
node.keys.sort()
else:
i = len(node.keys) - 1
i -= 1
self.insert_non_full(node.children[i + 1], key)
full_node = parent.children[index]
new_node = BPlusTreeNode(full_node.is_leaf)
mid = len(full_node.keys) // 2
parent.keys.insert(index, full_node.keys[mid])
parent.children.insert(index + 1, new_node)
full_node.keys = full_node.keys[:mid]
Conclusion
• B-trees are widely used for general-purpose indexing because of their balanced structure and
efficient search, insertion, and deletion.
• B+ trees are often preferred when range queries are important, as they store all data in the leaf
nodes, which are linked together.
These trees ensure that the database can scale efficiently, providing fast query performance even with
large datasets.
What is Hashing?
Hashing is a technique used in databases and other applications to quickly access data in a large set by
using a hash function. The goal of hashing is to map data of arbitrary size (like a string or a record) to a
fixed-size value (called a hash value or hash code). Hashing is used to facilitate constant-time retrieval of
data, typically in hash tables.
In DBMS, hashing is primarily used in indexing methods and file organization to ensure that data
retrieval operations are fast, especially when the dataset is large.
Hash Function
A hash function is a mathematical algorithm that takes an input (or key) and converts it into a fixed-size
output called the hash value or hash code. The hash function's goal is to distribute the keys uniformly
across the hash table, minimizing the number of collisions.
1. Deterministic:
o For the same input, a hash function must always return the same hash value.
2. Uniform Distribution:
o A good hash function ensures that keys are distributed uniformly across the hash table
to minimize collisions.
3. Efficiency:
o The hash function should be computationally efficient, meaning it should quickly map
the input to the hash value.
4. Minimize Collisions:
o A collision occurs when two different keys produce the same hash value. The hash
function should minimize the likelihood of collisions.
o The output of the hash function should always be of a fixed size, regardless of the size of
the input.
There are two common methods used for hashing in DBMS: Static Hashing and Dynamic Hashing.
1. Static Hashing
In static hashing, the size of the hash table is fixed, and the hash function directly maps a key to a bucket
in the table.
1. Hash function: The hash function h(k) maps the key k to an integer value, which determines the
bucket number where the record is stored.
2. Buckets: A bucket is a data structure (often a linked list or an array) where records with the same
hash value are stored.
3. Overflow Handling:
o Overflow occurs when a bucket is full. One common approach to handle overflow is
chaining, where each bucket holds a linked list of records that hash to the same value.
o Open addressing is another method for resolving collisions, where the next available
bucket is used.
Suppose we have a hash table of size 10, and we want to insert a key k. The hash function h(k) = k % 10
determines the bucket location.
• The fixed size of the hash table means that if there are too many records or if keys are not
uniformly distributed, the table will suffer from overflows or high load factors, making
operations slower.
2. Dynamic Hashing
Dynamic hashing is an extension of static hashing that allows the hash table to grow or shrink as
needed, maintaining efficient performance as the database size changes. This technique addresses the
limitations of static hashing by resizing the hash table dynamically.
1. Extendible Hashing:
This method involves a directory of hash buckets, which dynamically adjusts its size as more
buckets are needed. The hash function is applied to the key, and a global depth is used to decide
the number of bits used for the hash value.
o The directory holds pointers to buckets that contain the actual data records. The
directory size grows as needed.
o Each bucket has a local depth, and the global depth indicates the number of bits from
the hash value used for indexing.
• Suppose a hash table with a global depth of 2 (2 bits are used from the hash value) and 4
buckets. The hash function is h(k) = k % 4 initially.
• When the table grows, the global depth increases, and more bits are considered for the hash
function, expanding the number of available buckets.
Since it's not always possible to have a perfect hash function, collisions (when two keys map to the same
bucket) are inevitable. Common methods for handling collisions are:
1. Chaining:
o Each bucket is implemented as a linked list, and all records that hash to the same bucket
are stored in the list. The linked list helps to store multiple records in the same bucket.
o Pros: Easy to implement, and no need to worry about the size of the table.
2. Open Addressing:
o When a collision occurs, the hash table looks for the next available slot (bucket).
Common techniques for open addressing include:
▪ Linear Probing: If a collision occurs, check the next available bucket in a linear
fashion.
▪ Quadratic Probing: Similar to linear probing but with a quadratic step size (e.g.,
check 1, 4, 9, ... positions).
▪ Double Hashing: Uses a second hash function to compute the next available
position.
o Cons: Can suffer from clustering (when adjacent buckets fill up).
1. Indexing:
Hashing is used to build hash indexes, where a hash function is applied to the key to quickly
locate records in the database.
2. Partitioning:
Hashing is used for data partitioning in distributed databases. The hash function maps records
to different partitions or nodes, allowing parallel processing and storage.
3. Caching:
Hashing is used in caching to store frequently accessed data in a hash table, ensuring fast
retrieval.
4. Encryption:
Hash functions are widely used in cryptographic systems for creating secure hash values (e.g.,
MD5, SHA-256).
less
Copy code
scss
Copy code
Collision resolution is a critical aspect of hashing, especially in situations where multiple keys hash to the
same index (bucket) in a hash table. This can happen because the hash function does not guarantee
unique hash codes for each possible key, so multiple keys may map to the same hash value (bucket).
In DBMS and other applications that use hash tables, resolving collisions efficiently is key to maintaining
the performance of hash-based operations like insert, search, and delete.
There are two primary techniques for handling collisions: Open Addressing and Chaining.
In chaining, each bucket in the hash table stores a linked list (or another data structure) of all the keys
that hash to the same index. Instead of storing just one key per bucket, each bucket stores a linked list of
keys (or even complete records).
• If two or more keys hash to the same bucket, they are inserted into the linked list at that bucket.
• The linked list allows the bucket to hold multiple keys, which is useful in cases where collisions
are frequent.
Pros:
• Easy to implement.
Cons:
• Performance can degrade if too many keys hash to the same bucket (linked lists become long),
leading to slower operations.
Example of Chaining:
Suppose we have a hash table of size 5 with the hash function h(k) = k % 5. Let's insert keys 10, 15, and
20:
yaml
Copy code
Table:
Bucket 1: []
Bucket 2: []
Bucket 3: []
Bucket 4: []
In this case, h(10) = 0, h(15) = 0, and h(20) = 0. All three keys end up in the same bucket (bucket 0), and
they are stored as a linked list.
In open addressing, instead of using linked lists to store colliding elements, the hash table itself is used
to store all keys. When a collision occurs, the algorithm searches for another open slot within the hash
table, using a probing strategy to find the next available bucket.
Open addressing is typically used with hash tables that have fixed size.
1. Linear Probing
In linear probing, when a collision occurs at the hash index i, the algorithm checks the next slot i + 1. If
that slot is also occupied, it checks i + 2, and so on, until an empty slot is found.
Formula:
h(k, i) = (h(k) + i) % table_size
Pros:
o Easy to implement.
Cons:
o Clustering: This is when consecutive slots fill up, leading to primary clustering (adjacent
keys clustering in the same part of the table) and reducing the efficiency of lookups.
For a table of size 5 and a hash function h(k) = k % 5, let's insert keys 10, 15, and 20:
o h(15) = 0, but bucket 0 is occupied, so linear probing finds the next available bucket
(bucket 1).
o h(20) = 0, but buckets 0 and 1 are occupied, so linear probing finds the next available
bucket (bucket 2).
yaml
Copy code
Table:
Bucket 0: 10
Bucket 1: 15
Bucket 2: 20
Bucket 3: []
Bucket 4: []
2. Quadratic Probing
In quadratic probing, when a collision occurs, instead of checking the next slot sequentially, it checks
slots in a quadratic pattern. Specifically, after checking i slots, the algorithm checks the (i^2)th slot.
Formula:
h(k, i) = (h(k) + i^2) % table_size
This approach reduces primary clustering compared to linear probing, but it can still suffer from
secondary clustering (keys that hash to the same index follow the same probing pattern).
Example of Quadratic Probing:
For a table of size 5 and a hash function h(k) = k % 5, insert keys 10, 15, and 20:
o h(20) = 0, but buckets 0 and 1 are occupied. The next probe is h(20, 1) = (0 + 1^2) % 5 = 1
(already occupied), so the next probe is h(20, 2) = (0 + 2^2) % 5 = 4, and 20 goes to
bucket 4.
yaml
Copy code
Table:
Bucket 0: 10
Bucket 1: 15
Bucket 2: []
Bucket 3: []
Bucket 4: 20
3. Double Hashing
Double hashing uses a second hash function to determine the step size for probing. This method
minimizes clustering by ensuring that the probe sequence is less predictable.
Formula:
h(k, i) = (h1(k) + i * h2(k)) % table_size
Where h1(k) is the primary hash function and h2(k) is the secondary hash function.
For a table of size 5, let the primary hash function be h1(k) = k % 5 and the secondary hash function be
h2(k) = 1 + (k % 4). Insert keys 10, 15, and 20.
o h1(15) = 0, but bucket 0 is occupied. The secondary hash function gives h2(15) = 1 + (15
% 4) = 2. The next probe is h(15, 1) = (0 + 1 * 2) % 5 = 2, so 15 goes to bucket 2.
o h1(20) = 0, but buckets 0 and 2 are occupied. The next probe is h(20, 1) = (0 + 1 * 2) % 5
= 2 (already occupied). The next probe is h(20, 2) = (0 + 2 * 2) % 5 = 4, so 20 goes to
bucket 4.
yaml
Copy code
Table:
Bucket 0: 10
Bucket 1: []
Bucket 2: 15
Bucket 3: []
Bucket 4: 20
Simple to implement; Requires extra memory for Good for cases with frequent
Chaining
handles overflow easily linked lists collisions or dynamic datasets
Linear Simple; no extra memory Clustering can degrade Suitable for small datasets with
Probing needed performance a low load factor
Double Reduces clustering; more Secondary hash function Best for larger datasets where
Hashing uniform distribution needs to be carefully chosen clustering is a concern
Dynamic hashing is an approach to handle the problems encountered with static hashing, particularly
when dealing with overflow and the fixed size of the hash table. It allows the hash table to grow or
shrink dynamically as needed, ensuring that the database can scale and maintain efficient performance
for insertions, deletions, and lookups.
In static hashing, the size of the hash table is fixed, and if the number of records exceeds the number of
available slots in the hash table, the performance degrades due to overflow or clustering. Dynamic
hashing overcomes this issue by adapting the hash table's size dynamically to the number of records,
thereby maintaining efficient access as the dataset grows.
Here, we'll focus on extendible hashing, which is widely used in DBMS systems.
Extendible Hashing
Extendible Hashing is a dynamic hashing technique that uses a directory of pointers to buckets, where
the directory size can grow or shrink dynamically. The idea is to adjust the global depth of the hash table
as needed, to ensure that the table can grow efficiently without causing a lot of collisions or overflow.
2. Global Depth:
The global depth indicates how many bits of the hash value are used to access the directory. The
more bits are used, the larger the directory and the number of buckets. As the number of
records increases, the global depth may increase, causing the directory to double in size.
3. Local Depth:
Each bucket has a local depth, which indicates how many bits of the hash value are actually
being used to decide the placement of records in that specific bucket. If a bucket overflows, it
can be split, and its local depth may increase, while the global depth remains the same.
4. Splitting Buckets:
When a bucket overflows, it splits into two, and the global depth is increased. This may require
doubling the directory to accommodate the new split.
6. Bucket Management:
When the directory is doubled, some pointers may need to be adjusted. The records in the split
buckets are redistributed based on the hash value and the local and global depth.
• Compute the hash value for the key using the hash function.
o The second part, based on the local depth of the bucket, determines the placement of
the key within the bucket.
• If the target bucket has space, the key is inserted directly into the bucket.
• If the bucket overflows, it is split. The bucket’s local depth increases, and records are
redistributed between the old bucket and the new one.
2. Directory Doubling
• If the local depth of a bucket exceeds the global depth, the directory must double in size to
accommodate more pointers.
• The new directory has the same entries, but the addressing of buckets changes. The split bucket
may cause records to be redistributed, and some entries may now point to the newly created
buckets.
• The global depth is incremented, and the directory now points to twice as many buckets.
3. Deletion Process
• To delete a record, the bucket is checked to see if the key exists. If found, the key is removed.
• After deletion, if a bucket becomes empty and its local depth is greater than the global depth, it
is merged with another bucket.
Consider a hash table with a hash function h(k) = k % 8 and a global depth of 2, using 3 bits of the hash
value. We have a directory with pointers to 8 buckets. After inserting several keys, if a bucket overflows,
the bucket will be split, and the global depth will increase.
Consider a hash table with 4 buckets and a global depth of 2 (i.e., 2 bits used from the hash value):
less
Copy code
Assume the hash function is h(k) = k % 4 (which gives the last 2 bits of k):
1. Insert k = 3:
h(3) = 3 % 4 = 11 → Insert into bucket B3.
2. Insert k = 7:
h(7) = 7 % 4 = 11 → Bucket B3 overflows, so we split it.
o Split the bucket and increment the local depth of the bucket.
o The global depth is increased to 3, and the directory is doubled to accommodate the
new local depth.
less
Copy code
Now, the number of available buckets has increased, and the hash table can handle more records
efficiently.
1. Dynamic Resizing:
o The directory and the hash table grow or shrink as needed, preventing overflow
problems in static hashing.
2. Efficient Lookup:
o Even as the table grows, the number of buckets and the directory size remain
manageable, ensuring that lookups remain efficient.
3. No Overflow:
o Buckets are split and merged dynamically, ensuring that overflow is handled gracefully.
4. Distributed Load:
o The dynamic adjustment of bucket sizes and directory depth leads to better distribution
of keys across the buckets, minimizing clustering.
1. Space Overhead:
o The need for maintaining a directory and potentially doubling its size leads to extra
space requirements.
2. Complexity:
o The process of handling splits, directory resizing, and adjusting local/global depth can be
more complex than static hashing.
o As the global depth increases, the directory size may grow, potentially leading to higher
access time for locating the correct bucket.
The Relational Model is the most commonly used data model in Database Management Systems
(DBMS), introduced by E.F. Codd in 1970. It organizes data into tables (called relations) and describes
the relationships between the data in a logical and structured way.
In the relational model, data is stored in tables, where each table is a collection of rows and columns.
Each row represents a record or a tuple, and each column represents an attribute of the data. The
relational model provides a formal framework for defining, querying, and manipulating data.
1. Relation (Table)
o A relation is a set of tuples (rows) and a set of attributes (columns). It can be thought of
as a table in a database, where:
▪ Columns (also called attributes) represent the data fields for each record.
o Example: A Student relation could have attributes like StudentID, Name, Age, Major, etc.
2. Tuple
o A tuple (or row) is a single record in the table. It is a collection of attribute values that
describe a single entity.
o In the above example, (101, Alice, 20, Computer Science) is a tuple that represents a
student.
3. Attribute
o For instance, in the Student relation, StudentID, Name, Age, and Major are attributes of
the relation.
4. Domain
o The domain of an attribute is the set of all possible values that the attribute can take.
For example, the domain of Age might be {18, 19, 20, ..., 100} and the domain of Major
might be a set of valid department names such as Computer Science, Mathematics, etc.
5. Primary Key
o In the Student table, StudentID can be the primary key since each student will have a
unique StudentID.
6. Foreign Key
o A foreign key is an attribute (or a set of attributes) in one relation that refers to the
primary key of another relation. It is used to establish a relationship between two
relations.
o For example, in a Course table, StudentID could be a foreign key that references the
StudentID in the Student relation.
7. Relationship
o In the relational model, a relationship between two relations is established through a
foreign key. A relation can be linked to another relation using the primary key and
foreign key pair.
The relational model supports a variety of operations that allow data to be queried and manipulated.
These operations are defined by Relational Algebra and SQL (Structured Query Language). The key
operations include:
1. Select (σ):
The select operation is used to retrieve rows from a relation that satisfy a given condition.
SQL Query:
sql
Copy code
2. Project (π):
The project operation is used to retrieve specific columns (attributes) from a relation.
SQL Query:
sql
Copy code
3. Union (∪):
The union operation is used to combine the results of two relations that have the same set of
attributes. It returns all distinct tuples from both relations.
SQL Query:
sql
Copy code
UNION
SELECT * FROM Student2;
SQL Query:
sql
Copy code
EXCEPT
SQL Query:
sql
Copy code
6. Join (⨝):
The join operation is used to combine two relations based on a common attribute, typically the
foreign key. The most common type of join is the inner join.
SQL Query:
sql
Copy code
FROM Student
1. Candidate Key:
A candidate key is any attribute (or set of attributes) that can uniquely identify a tuple in a
relation. There can be multiple candidate keys, but one of them is chosen as the primary key.
2. Superkey:
A superkey is a set of one or more attributes that can uniquely identify a tuple. Every primary
key is a superkey, but not every superkey is a primary key.
3. Composite Key:
A composite key is a primary key that consists of more than one attribute. For example, if a
student is identified by both StudentID and CourseID, the composite key would be the
combination of both.
4. Alternate Key:
An alternate key is a candidate key that was not chosen as the primary key.
Integrity constraints ensure the accuracy and consistency of data in a relational database. There are
several types of integrity constraints in the relational model:
1. Entity Integrity:
Ensures that each tuple in a relation has a unique identifier (the primary key) and that no
primary key value is null.
2. Referential Integrity:
Ensures that foreign keys correctly refer to primary keys in other relations. A foreign key in one
relation must either be null or match a primary key in the referenced relation.
3. Domain Integrity:
Ensures that the values of attributes are from a valid domain (i.e., they conform to the data type,
range, or other restrictions of the attribute).
4. User-Defined Integrity:
Refers to specific rules defined by the user to maintain consistency based on business logic (e.g.,
age should not be less than 18).
1. Simplicity:
The relational model is simple to understand and use, as it deals with tables, which are intuitive
and easy to conceptualize.
2. Flexibility:
The relational model is highly flexible and can handle a wide variety of data types and
relationships, including many-to-many and one-to-many relationships.
3. Data Independence:
The relational model provides a level of abstraction, allowing users to interact with data without
worrying about the underlying storage mechanisms.
4. Normalization:
The relational model allows data to be normalized, reducing redundancy and improving data
integrity by organizing data into smaller, related tables.
Relational constraints are rules that help ensure the accuracy and integrity of data in a relational
database. These constraints restrict the type of data that can be inserted into a relation (table) and
ensure that the database maintains consistency and correctness over time.
Domain integrity ensures that the data in a relation conforms to the domain of each attribute (i.e., the
allowable set of values). These constraints ensure that each column in a table contains values of the
correct type, range, and format.
• Example:
The domain of the Age attribute in a Person table might be restricted to positive integers
between 0 and 100.
• Enforced By:
Data types, constraints on values (e.g., constraints on valid values using CHECK), and ranges.
Example SQL:
sql
Copy code
Name VARCHAR(100),
);
Entity integrity ensures that each row in a relation is uniquely identifiable. This is typically done by
enforcing a primary key constraint.
• Primary Key Constraint:
A primary key is a set of one or more attributes that uniquely identify a tuple (row) in a relation.
The primary key cannot contain NULL values, ensuring every tuple is distinct.
• Enforced By:
The primary key column(s) cannot contain null values and must have unique values across all
tuples in the table.
Example SQL:
sql
Copy code
Name VARCHAR(100),
Department VARCHAR(50)
);
• Example Explanation:
The EmployeeID is the primary key, meaning each EmployeeID must be unique and cannot be
NULL.
Referential integrity ensures that relationships between tables are maintained and that foreign key
values always point to valid primary key values in another table. A foreign key is an attribute (or a set of
attributes) in one table that refers to the primary key in another table.
• Enforced By:
The foreign key constraint ensures that any value in the foreign key column must exist in the
referenced primary key column of another table.
Example SQL:
sql
Copy code
DepartmentName VARCHAR(100)
);
Name VARCHAR(100),
DepartmentID INT,
);
• Example Explanation:
The DepartmentID in the Employee table is a foreign key that refers to the DepartmentID in the
Department table. It ensures that any DepartmentID in the Employee table must correspond to
an existing DepartmentID in the Department table.
4. Key Constraints
Key constraints ensure that each tuple in a relation is uniquely identifiable by a set of attributes. The
primary key uniquely identifies a record in a table, and other possible unique identifiers are known as
candidate keys.
• Candidate Key:
A candidate key is a set of attributes that can uniquely identify a tuple in a table. There can be
multiple candidate keys in a table, but one of them is selected as the primary key.
• Unique Key:
A unique key is a set of attributes that must have unique values across all tuples in the table, but
it allows for a NULL value (unlike the primary key, which does not allow NULL).
• Enforced By:
The unique constraint ensures that all values in the column(s) are distinct, and that no two rows
have the same values in the specified columns.
Example SQL:
sql
Copy code
Name VARCHAR(100)
);
• Example Explanation:
The Email attribute has a unique key constraint, ensuring that no two employees can have the
same email address.
User-defined integrity constraints are custom rules created by the user to enforce business-specific logic
or rules that cannot be covered by domain, entity, referential, or key constraints. These constraints are
typically enforced through triggers, stored procedures, or check constraints.
• Enforced By:
Check Constraints or custom application logic.
Example SQL:
sql
Copy code
Name VARCHAR(100),
Age INT,
CHECK (Age >= 18 AND Salary >= 30000) -- Ensures that employees are at least 18 years old and earn
at least $30,000
);
• Example Explanation:
The check constraint ensures that the Age of an employee must be 18 or older, and the Salary
must be at least $30,000. This is an example of user-defined integrity.
A not null constraint ensures that a particular attribute must always contain a value; it cannot be left
empty. This is useful for attributes that should always have meaningful data.
• Enforced By:
The NOT NULL constraint ensures that the attribute cannot have NULL values.
Example SQL:
sql
Copy code
Address VARCHAR(200)
);
• Example Explanation:
The Name column has a NOT NULL constraint, which ensures that every customer must have a
name.
7. Check Constraints
The check constraint ensures that values in a column satisfy a specific condition or logical expression.
This is commonly used to enforce business rules.
• Enforced By:
The CHECK keyword is used to define conditions that values must meet.
Example SQL:
sql
Copy code
Name VARCHAR(100),
Age INT CHECK (Age >= 18) -- Ensures Age is greater than or equal to 18
);
• Example Explanation:
The Age column has a check constraint that ensures the age is greater than or equal to 18.
Here are some SQL queries demonstrating common database operations and concepts like creating
tables, inserting data, updating, and querying data:
1. Create a Table
The CREATE TABLE statement is used to create a new table in the database.
SQL Query:
sql
Copy code
SQL Query:
sql
Copy code
The SELECT statement is used to retrieve data from one or more tables.
SQL Query:
sql
Copy code
SQL Query:
sql
Copy code
SQL Query:
sql
Copy code
SQL Query:
sql
Copy code
UPDATE Employee
SET Age = 31
WHERE EmployeeID = 1;
SQL Query:
sql
Copy code
6. Join Tables
The JOIN operation is used to combine rows from two or more tables based on a related column
between them.
SQL Query:
sql
Copy code
FROM Employee
• Left Join (all rows from the left table, and matching rows from the right table):
SQL Query:
sql
Copy code
FROM Employee
LEFT JOIN Department ON Employee.DepartmentID = Department.DepartmentID;
• Group by is used to arrange identical data into groups. It is often used with aggregate functions
like COUNT(), SUM(), AVG(), etc.
SQL Query:
sql
Copy code
FROM Employee
GROUP BY DepartmentID;
8. Sorting Results
The ORDER BY statement is used to sort the result set by one or more columns.
SQL Query:
sql
Copy code
SQL Query:
sql
Copy code
9. Create Index
SQL Query:
sql
Copy code
The ALTER TABLE statement is used to modify an existing table structure, like adding or deleting columns.
SQL Query:
sql
Copy code
SQL Query:
sql
Copy code
You can rename an existing table or column using ALTER in combination with RENAME.
• Rename a table:
SQL Query:
sql
Copy code
RENAME TO Staff;
• Rename a column:
SQL Query:
sql
Copy code
A view is a virtual table based on the result set of a query. It does not store data but presents it in a
structured way.
SQL Query:
sql
Copy code
FROM Employee
13. Subqueries
A subquery is a query nested inside another query. It can be used to perform complex queries.
SQL Query:
sql
Copy code
The DROP TABLE statement is used to remove a table from the database permanently.
SQL Query:
sql
Copy code
The DROP COLUMN statement is used to remove a column from an existing table.
SQL Query:
sql
Copy code
A regular entity type represents a real-world object that has attributes and can exist independently.
• Mapping:
For each strong entity (regular entity) in the EER model, create a relation (table) where:
o The attributes of the entity type become the attributes (columns) of the table.
Example:
An entity Student with attributes StudentID, Name, and DOB.
• EER:
Student (StudentID, Name, DOB)
• Relational Mapping:
sql
Copy code
Name VARCHAR(100),
DOB DATE
);
• Mapping:
For each weak entity, create a relation that includes:
o The primary key of the strong entity that owns the weak entity.
o The partial key from the weak entity, combined with the primary key of the strong entity
to form a composite primary key for the weak entity.
Example:
A weak entity OrderItem with attributes Quantity, and Price, dependent on a strong entity Order.
• EER:
OrderItem (OrderID, ProductID, Quantity, Price)
Where OrderID and ProductID are part of the composite key.
• Relational Mapping:
sql
Copy code
OrderID INT,
ProductID INT,
Quantity INT,
);
A specialization is a process of creating sub-entities from a general entity, and generalization is the
reverse. The subtypes inherit attributes and relationships of the parent (superclass).
There are two primary ways to represent this relationship in the relational model:
• Mapping:
A table is created that contains all attributes of both the superclass and the subclasses. A
discriminator attribute is added to indicate which subclass an entity belongs to.
Example:
Superclass Employee with subclasses FullTimeEmployee and PartTimeEmployee.
• EER:
Employee (EmpID, EmpName, EmpType)
FullTimeEmployee (Salary)
PartTimeEmployee (HourlyRate)
• Relational Mapping:
sql
Copy code
EmpName VARCHAR(100),
EmpType VARCHAR(50),
);
Each subclass is mapped to its own table, which includes the attributes of both the superclass and the
subclass. The superclass is also mapped to its own table, and each subclass table has a foreign key to the
superclass.
• Mapping:
Create separate tables for the superclass and each subclass. Use a foreign key in each subclass
table that references the primary key of the superclass table.
Example:
Employee superclass with FullTimeEmployee and PartTimeEmployee subclasses.
• EER:
Employee (EmpID, EmpName)
FullTimeEmployee (EmpID, Salary)
PartTimeEmployee (EmpID, HourlyRate)
• Relational Mapping:
sql
Copy code
EmpName VARCHAR(100)
);
);
);
4. Mapping Aggregation
Aggregation is used when we need to model a relationship between an entity and a relationship set (not
just between two entities). The relationship is treated as a higher-level entity.
• Mapping:
Convert the relationship into an entity, create a table for that entity, and link the original entities
to it via foreign keys.
Example:
If a ProjectAssignment is an aggregation of the Employee and Project entities:
• EER:
ProjectAssignment (EmployeeID, ProjectID, AssignmentDate)
• Relational Mapping:
sql
Copy code
EmployeeID INT,
ProjectID INT,
AssignmentDate DATE,
);
In union types, an entity instance can belong to one or more subclasses. The subclasses are not
necessarily mutually exclusive.
• Mapping:
Create a table for the union type that contains the common attributes and a discriminator
column to identify the different types (subtypes).
Example:
Person is a union type with Employee and Customer subtypes.
• EER:
Person (ID, Name, Type)
Employee (ID, Salary)
Customer (ID, PurchaseHistory)
• Relational Mapping:
sql
Copy code
Name VARCHAR(100),
Type VARCHAR(50) -- 'Employee' or 'Customer'
);
);
PurchaseHistory TEXT,
);
Strong Entity Create a table with the entity attributes and primary key.
Create a table with the weak entity's attributes, including the primary
Weak Entity
key of the owner entity.
Superclass-Subclass (Single Create one table for the superclass and subclasses, including a
Table) discriminator column.
Superclass-Subclass Create separate tables for the superclass and each subclass, with foreign
(Multiple Tables) keys linking to the superclass.
Create a table for the aggregated relationship and link to original entities
Aggregation
via foreign keys.
Create a table with a discriminator column and separate tables for each
Union Types
subtype.
Data Normalization in DBMS
Data Normalization is the process of organizing the attributes and tables of a relational database to
minimize redundancy and dependency. The goal is to ensure that data is stored efficiently and that
relationships between data are well defined to prevent data anomalies such as insertion, update, and
deletion anomalies.
Normalization is typically achieved by dividing large tables into smaller, manageable ones, and defining
relationships between them. This process ensures that the data is logically structured and avoids
redundancy.
There are several normal forms (NF) in database normalization, each with specific rules to improve the
structure of the database.
Example:
Consider a table where a single column contains multiple values (not atomic).
OrderID Products
1 Apple, Banana
2 Orange, Mango
To bring this into 1NF, we separate the products into individual rows.
OrderID Product
1 Apple
1 Banana
2 Orange
OrderID Product
2 Mango
• It does not contain any partial dependency, i.e., every non-prime attribute is fully functionally
dependent on the entire primary key.
A partial dependency occurs when a non-prime attribute is dependent on part of a composite primary
key (rather than the entire key).
Example: Consider a table where OrderID and ProductID together form the composite primary key, and
ProductName depends only on ProductID.
1 101 Apple 5
1 102 Banana 3
2 101 Apple 2
Here, ProductName depends only on ProductID, not on OrderID. This is a partial dependency, so the
table is not in 2NF.
sql
Copy code
ProductName VARCHAR(100)
);
sql
Copy code
CREATE TABLE Orders (
OrderID INT,
ProductID INT,
Quantity INT,
);
• It does not contain any transitive dependency, i.e., no non-prime attribute depends on another
non-prime attribute.
A transitive dependency occurs when a non-prime attribute depends on another non-prime attribute,
rather than directly on the primary key.
Example:
Consider a table where OrderID is the primary key, and CustomerCity depends on CustomerID, and
CustomerID depends on OrderID:
sql
Copy code
CustomerName VARCHAR(100),
CustomerCity VARCHAR(100)
);
sql
Copy code
CustomerID INT,
);
• For every non-trivial functional dependency, the left-hand side must be a superkey (a key that
uniquely identifies each record).
Example:
Consider a table where both StudentID and CourseID form the composite primary key, but
InstructorName depends only on CourseID:
Here, InstructorName depends only on CourseID, which is not a superkey. This violates BCNF.
sql
Copy code
);
sql
Copy code
StudentID INT,
CourseID VARCHAR(10),
);
• It does not contain any multi-valued dependency, i.e., no attribute set in the table is
independent of another.
Example:
Consider a table where a student can have multiple phone numbers and multiple email addresses, but
the phone numbers and emails are independent of each other:
This violates 4NF because PhoneNumber and EmailAddress are multi-valued attributes.
1. StudentPhones Table:
sql
Copy code
PhoneNumber VARCHAR(15),
);
2. StudentEmails Table:
sql
Copy code
StudentID INT,
EmailAddress VARCHAR(100),
);
• It does not contain any join dependency, i.e., it cannot be decomposed into multiple relations
without losing information.
In most cases, 5NF is used to deal with complex relationships and ensure that the database can be
decomposed into smaller, more manageable pieces without loss of data integrity.
Normal
Criteria
Form
1NF Eliminate repeating groups; each attribute must contain atomic values.
Eliminate all non-trivial functional dependencies where the left-hand side is not a
BCNF
superkey.
Concurrency Control in Database Management Systems (DBMS) refers to the mechanisms that ensure
that database transactions are executed in a way that preserves the integrity of the database when
multiple transactions are occurring simultaneously. It aims to prevent anomalies and maintain
consistency in a multi-user environment where several users may access and modify the same data
concurrently.
Concurrency control is critical in ensuring that transactions do not interfere with each other in ways that
could lead to errors, such as lost updates, temporary inconsistency, or uncommitted data being read.
1. Lock-based Protocols
2. Timestamp-based Protocols
1. Lock-based Protocols
Locking is one of the most widely used mechanisms to ensure that only one transaction can access a
particular piece of data at any given time. When a transaction wants to read or write data, it locks the
data item to prevent other transactions from accessing it until the lock is released.
Types of Locks:
• Exclusive Lock (X-lock): This type of lock is used when a transaction intends to write (modify) a
data item. It prevents all other transactions from reading or writing to the same data item.
• Shared Lock (S-lock): This lock is used when a transaction wants to read a data item. Multiple
transactions can hold a shared lock on the same data item, but no transaction can write to the
item while there are active shared locks.
Locking Protocols:
• Two-Phase Locking (2PL): This protocol ensures that transactions acquire all the necessary locks
before any data is modified (growing phase), and release the locks only after the transaction is
completed (shrinking phase). It guarantees serializability but may cause deadlocks (when two or
more transactions wait for each other indefinitely).
• Strict 2PL: This is a stricter version of 2PL, where transactions hold all locks until they commit,
and only release them when they commit or abort. It guarantees serializability and also prevents
cascading rollbacks.
Deadlock in Locking: A deadlock occurs when two or more transactions are blocked forever because
each one is waiting for the other to release a resource (lock). To handle deadlocks:
• Deadlock Detection: Periodically check for deadlocks and resolve them, usually by aborting one
of the transactions involved.
• Deadlock Prevention: Avoid situations where deadlocks can occur, often by ensuring
transactions acquire locks in a predefined order.
2. Timestamp-based Protocols
Timestamp-based protocols use timestamps to decide the serializability of transactions without using
locks. Each transaction is assigned a timestamp when it starts, and this timestamp is used to decide the
order of transactions.
• When a transaction requests to access a data item, its timestamp is compared to the last
timestamp of any transaction that has already accessed the item.
• If the transaction's timestamp is greater than the last access timestamp, it is allowed to access
the data. Otherwise, the transaction is aborted or rolled back.
Key Rules:
• Read Rule: A transaction can read a data item if no write has been performed after its
timestamp.
• Write Rule: A transaction can write a data item only if no other transaction has read or written
the item after its timestamp.
Advantages:
Disadvantages:
• It may lead to unnecessary transaction aborts or rollbacks if the timestamps do not allow for
transaction progression.
Optimistic Concurrency Control assumes that conflicts between transactions are rare, so transactions
are allowed to execute without locks. However, before committing a transaction, the system checks for
conflicts and ensures that the transaction can safely commit.
OCC Phases:
1. Read Phase: During this phase, the transaction reads the data and makes local copies of the
items it accesses. It does not make any changes to the database.
2. Validation Phase: Before committing, the system checks whether there were any conflicts during
the transaction execution. A conflict happens if another transaction has modified any data that
the current transaction has accessed.
3. Write Phase: If validation passes, the transaction writes its changes to the database and
commits. If validation fails, the transaction is rolled back and restarted.
Advantages:
• Suitable for systems with low contention and high read operations.
Disadvantages:
Multiversion Concurrency Control (MVCC) is a concurrency control method that allows multiple versions
of a data item to exist concurrently. Each transaction accesses a snapshot of the database at a particular
point in time, ensuring that transactions can work independently without locking the data.
• When a transaction modifies a data item, it does not overwrite the existing value but instead
creates a new version of that data item.
• Each version of the data item is tagged with a timestamp or transaction identifier to track when
it was created.
• Transactions access the most appropriate version of the data based on the snapshot they are
working with.
Advantages:
• Provides a high degree of concurrency, as transactions can work with different versions of data.
Disadvantages:
• Serializable Isolation Level: This is the highest level of isolation in which transactions are
executed such that the result of executing the transactions concurrently is the same as if they
were executed serially, one after another. It ensures no anomalies but can be performance-
intensive.
• Isolation Levels: SQL databases support different isolation levels to balance consistency and
concurrency:
3. Repeatable Read: Transactions can read the same data multiple times, ensuring no
other transaction modifies the data in the meantime.
4. Serializable: Transactions are fully isolated from each other, with no interference,
ensuring strict consistency.
Database Security refers to the protection of databases from unauthorized access, misuse, or
corruption. Since databases often contain sensitive and critical data, ensuring their security is vital for
maintaining confidentiality, integrity, and availability. Database security is designed to protect the
database from various threats, including unauthorized access, data breaches, data manipulation, and
other types of cyber-attacks.
1. Confidentiality: Ensuring that only authorized users can access sensitive data.
2. Integrity: Ensuring that the data remains accurate and consistent, preventing unauthorized
changes.
3. Availability: Ensuring that data is available for authorized users when needed, and preventing
denial of service.
4. Accountability: Keeping track of who accessed the database and what actions were taken,
ensuring audit trails and logs are available.
• SQL Injection: A type of attack where malicious SQL statements are injected into a query to
manipulate the database.
• Data Breaches: Unauthorized access to sensitive or confidential data, leading to data leakage.
• Denial of Service (DoS): An attacker may overload the system, preventing legitimate users from
accessing the database.
• Insider Threats: Security breaches caused by users who have legitimate access to the database
but misuse it.
• Backup and Recovery Failures: Failure to properly back up and restore data in the event of a
failure.
There are various techniques and measures used to protect databases. These can be broadly classified
into the following categories:
1. Authentication
Authentication ensures that only authorized users can access the database. Common authentication
mechanisms include:
• Username and Password: The most basic form of authentication, where the user is identified by
a unique username and verified using a password.
• Multi-factor Authentication (MFA): Requires multiple forms of identification (e.g., password and
fingerprint, password and OTP) for access.
• Biometric Authentication: Uses physical characteristics, like fingerprints, retina scans, or facial
recognition, to authenticate users.
• Single Sign-On (SSO): A mechanism that allows users to authenticate once and gain access to
multiple systems or databases without needing to log in multiple times.
2. Authorization
Authorization ensures that authenticated users can only access resources that they are permitted to use,
based on their roles and privileges. It controls the permissions assigned to users or roles, such as:
• Role-Based Access Control (RBAC): Users are assigned roles, and each role has predefined
permissions (e.g., read, write, delete) associated with it. This simplifies access control and
improves security.
• Discretionary Access Control (DAC): The owner of a resource (e.g., a table or view) decides who
can access it and what operations they can perform.
• Mandatory Access Control (MAC): Security policies define how access rights are granted, often
based on levels of classification (e.g., confidential, secret) and labels assigned to data and users.
• Attribute-Based Access Control (ABAC): Access control decisions are based on the attributes of
users, resources, or the environment (e.g., time of access, location, or job role).
3. Encryption
Encryption ensures that the data is unreadable to unauthorized users, even if it is intercepted or
accessed improperly.
• Data-at-Rest Encryption: Encrypting data stored in the database, ensuring that it remains
protected even if the physical storage media is compromised.
• Data-in-Transit Encryption: Encrypting data while it is being transmitted over the network (e.g.,
using SSL/TLS protocols) to prevent eavesdropping and man-in-the-middle attacks.
• Column-Level Encryption: Encrypting specific sensitive columns in the database (e.g., credit card
numbers, social security numbers) rather than the entire database.
• Transparent Data Encryption (TDE): A feature supported by many DBMSs that automatically
encrypts and decrypts data at the storage level without requiring changes to the application.
Auditing and Monitoring involve tracking database activity and maintaining logs to detect and prevent
unauthorized actions, such as:
• Audit Trails: Detailed logs of all database actions, including who performed them, when they
were performed, and what data was accessed or modified. These logs help detect suspicious
activities and investigate security incidents.
• Access Logs: Logs that record login attempts, successful logins, failed logins, and the duration of
the session. This helps detect brute-force attacks or unauthorized access attempts.
• Alerting Systems: Automated alerts triggered by suspicious activity, such as unauthorized access,
database changes, or failed login attempts.
Backup and recovery mechanisms are essential to ensure that the database can be restored to a secure
and consistent state after a failure, attack, or corruption.
• Regular Backups: Ensuring periodic backups of the entire database, including all tables, indexes,
and metadata, so data can be restored if needed.
• Point-in-Time Recovery: This feature allows a database to be restored to a specific point in time
(e.g., before a data breach or corruption occurred).
• Offsite Backups: Storing backups in a secure offsite location (e.g., cloud storage) ensures that the
data can be recovered even in the event of physical damage to the primary data storage system.
• Encryption of Backups: Ensuring that backup data is encrypted to prevent unauthorized access
in case the backup media is lost or stolen.
SQL Injection is one of the most common database security vulnerabilities, where an attacker inserts or
manipulates SQL queries to execute malicious commands on the database.
• Prepared Statements (Parameterized Queries): Ensure that SQL queries are structured properly
and that user input is treated as data, not executable code.
• Stored Procedures: Use stored procedures to encapsulate SQL logic and reduce the possibility of
injection by not allowing direct user input in SQL queries.
• Input Validation: Ensure that user inputs are sanitized and validated to prevent malicious input,
such as checking for special characters like semicolons (;) and single quotes (').
• Escaping User Input: Properly escape special characters in SQL queries to ensure that user inputs
are treated as literals, not code.
7. Database Firewalls
A Database Firewall is a specialized security tool designed to monitor and filter database traffic to
prevent malicious or unauthorized access.
• Traffic Filtering: Database firewalls can monitor incoming queries and block suspicious requests,
such as those attempting SQL injection or unauthorized access.
• Behavior Analysis: The firewall can learn the typical query patterns of authorized users and flag
any queries that deviate from this pattern.
• Granular Access Control: Enforcing access control rules to ensure that only authorized users and
applications can access certain tables, columns, or database features.
8. Database Masking
Data Masking involves obfuscating sensitive data to protect it from unauthorized access while
maintaining its utility for testing and development.
• Static Data Masking: Replaces sensitive data with fake data in non-production environments. For
example, credit card numbers may be replaced with dummy values like XXXXXXXXXXXX1234.
• Dynamic Data Masking: Dynamically masks sensitive data when it is accessed, ensuring that only
authorized users can see the full data.
9. Physical Security
Physical Security involves protecting the physical hardware where the database resides, ensuring that
unauthorized individuals cannot gain access to the database servers.
• Access Controls: Limiting physical access to the database servers to authorized personnel only.
• Surveillance Systems: Installing cameras and security systems to monitor access to the database
hardware.
• Environmental Controls: Ensuring that data centers are equipped with proper fire suppression
systems, backup power supplies, and climate control to protect the hardware from physical
threats.
Database Recovery refers to the process of restoring a database to a consistent and correct state after a
failure or crash, ensuring that no data is lost, and integrity is maintained. A failure can occur due to
various reasons, including hardware malfunctions, software bugs, user errors, or power outages. The
goal of database recovery is to handle these failures in a way that minimizes downtime and ensures that
the database is returned to a reliable and usable state.
Key Concepts in Database Recovery
2. Log: The transaction log is a crucial component of database recovery. It records all the changes
made to the database, including both the before and after values of data items. The log helps in
identifying the transactions that were in progress at the time of a failure and provides the
necessary information for rolling back or rolling forward transactions.
3. Checkpoint: A checkpoint is a mechanism that periodically saves the current state of the
database and logs in the database. It marks a point at which the system can safely recover,
reducing the need to replay the entire transaction log during recovery.
4. Undo and Redo: These are two fundamental operations used during recovery:
o Undo: Reverts changes made by a transaction that was not completed (rolled back).
o Redo: Re-applies changes made by a transaction that was committed before the failure.
1. Transaction Failures: A transaction may fail due to logical errors or violations of the integrity
constraints (e.g., attempting to divide by zero). In this case, only the affected transaction needs
to be rolled back.
2. System Failures: A system crash, such as a power failure or operating system crash, can occur.
This type of failure may leave some transactions in an inconsistent state, and the database needs
to be recovered to ensure consistency.
3. Media Failures: These failures occur when the storage medium (hard disk, SSD, etc.) experiences
physical damage, resulting in the loss of data. Recovery from this type of failure is often more
complex and may require restoration from backups.
The Write-Ahead Log (WAL) protocol is the foundation of many database recovery techniques.
According to this protocol:
• Before any changes are made to the database, the log record of the transaction (which includes
the before and after images of the data) must be written to disk.
• Only after the log is written can the changes be applied to the actual database.
This protocol ensures that the database can always be restored to a consistent state, even in the event of
a crash, by replaying the transaction logs.
The Transaction Log is a crucial part of the recovery process. It records each transaction's operation,
which can be used to:
• Redo operations that were committed but not yet written to the database before the failure.
• Undo operations for transactions that were not committed at the time of the failure.
• Deferred Update: Changes are first written to the log, and then the database is updated. If a
transaction fails, no changes are made to the database.
• Immediate Update: Changes are applied to the database immediately, but the transaction log is
still used for recovery. If a failure occurs, the database uses the log to undo or redo changes.
3. Checkpointing
A Checkpoint is a point in time where the database and log are synchronized, meaning that all
transactions before the checkpoint are either committed or rolled back. After a checkpoint, only a subset
of the transaction log needs to be replayed for recovery.
Checkpointing reduces the time required for recovery after a failure because it minimizes the number of
transactions that must be reprocessed from the log. Typically, a checkpoint includes:
The recovery process can then begin from the last checkpoint, avoiding the need to scan the entire log.
1. Redo Phase: Re-applies all transactions that were committed before the failure but whose
changes were not written to disk. This ensures that no committed transaction is lost.
2. Undo Phase: Rolls back any transactions that were active at the time of the failure and had not
been committed. This ensures that the database does not contain any partial transactions that
could lead to inconsistency.
• Step 1: The system starts by scanning the transaction log from the last checkpoint.
• Step 2: It first re-applies the operations (redo) for transactions that were committed but had not
yet been written to disk at the time of the failure.
• Step 3: Then, it undoes the operations for transactions that were active or had failed before the
crash.
5. Shadow Paging
Shadow Paging is another technique used for database recovery, where two pages of data are
maintained: the current page and the shadow page (a backup page). During normal operations:
• When a transaction makes a change, it updates the current page and creates a new shadow page
with the previous version of the data.
• If a failure occurs, the system can recover by using the shadow page (which remains unchanged)
to restore the previous consistent state.
This approach reduces the need for extensive logging but may require more space since two copies of
the data are maintained at all times.
ARIES is a widely used recovery algorithm that combines WAL and checkpointing with the concept of
"log-based" recovery. It follows these key principles:
• Write-Ahead Logging (WAL): All log records are written to stable storage before changes are
applied to the database.
• Cache Management: The ARIES algorithm ensures that recovery can take place even if pages are
cached in memory and not written to disk.
• Redo and Undo Operations: ARIES performs both redo (to reapply committed transactions) and
undo (to roll back incomplete transactions).
• Analysis Phase: Scans the log to identify which transactions were active and which pages need
to be redone or undone.
While transaction logs and recovery mechanisms can restore a database to its most recent state after a
failure, backups are essential for more catastrophic failures, such as media crashes.
• Full Backup: A complete snapshot of the entire database at a specific point in time.
• Incremental Backup: A backup that contains only the changes made since the last backup, which
reduces storage requirements.
• Differential Backup: A backup that includes changes made since the last full backup.
• Point-in-Time Recovery: In case of a failure, the system can restore from backups and apply
transaction logs to bring the database to a specific point in time before the failure.
Database recovery techniques are critical in ensuring that a database can restore to a consistent state
after a failure. Different failure types (transaction failures, system crashes, media failures) require various
strategies and techniques. The recovery process ensures that no committed data is lost and that
uncommitted data is not present in the database after recovery.
Types of Failures
• Transaction Failures: Occurs when a transaction cannot complete due to internal errors or
violates integrity constraints.
• System Failures: System crashes, such as a power failure or operating system crash.
• Media Failures: Hardware failures or damage to storage devices that affect the database’s data.
The recovery process involves using transaction logs, backups, and specific algorithms that allow the
system to restore the database state.
Write-Ahead Logging is one of the most widely used recovery techniques. It ensures that logs of the
transaction are written before any changes are made to the database. This guarantees that if the system
crashes, the log can be used to redo committed transactions or undo incomplete ones.
• Log records: Before any changes are made to the database, the transaction's log record
(including before and after images) is written to disk.
• Transaction commit: Once the log record is written, the changes can then be applied to the
database.
• Recovery: In case of a crash, WAL ensures that the database can be restored by using the log to
either redo or undo transactions.
The Transaction Log is a crucial part of recovery. It records the changes made by transactions and is used
to determine which transactions were committed or rolled back.
• Redo: When recovering, transactions that were committed before the failure, but whose
changes were not fully written to the database, are reapplied to the database.
• Undo: For transactions that were in progress at the time of the failure, any changes they made
need to be undone.
• Deferred Update: In this approach, changes are not written to the database until the transaction
commits. If the transaction fails before committing, no changes are made to the database,
ensuring consistency.
• Immediate Update: In this approach, changes are written to the database immediately, but the
log is still used to ensure recovery. If the transaction fails before committing, the log can be used
to undo changes.
3. Checkpointing
A checkpoint is a mechanism to reduce recovery time by periodically saving the current state of the
database and its log. During a checkpoint:
• The transaction log is also updated to reflect that all changes up to that point are safely saved.
In case of a failure, the system can begin recovery from the most recent checkpoint, reducing the need
to process the entire log.
• Benefits: Checkpointing helps to minimize the amount of work required during recovery, as the
system only needs to replay the log from the last checkpoint.
4. Shadow Paging
Shadow Paging is an alternative recovery technique in which two copies of the database pages are
maintained:
• Shadow Pages: A backup of the data pages before any changes are made.
In this method:
• When a transaction makes a change, a new page is created in memory (instead of directly
modifying the original page), and the shadow page remains intact.
• If a failure occurs, the system can discard the changes made and restore the database to the
state of the shadow pages, effectively "undoing" all uncommitted changes.
• Advantages: No need for a complex log; however, it can require a larger storage space since two
copies of pages are maintained.
ARIES is a sophisticated recovery algorithm used in many modern DBMS systems. It combines WAL,
checkpointing, and transaction logs to provide a robust recovery mechanism.
1. Analysis Phase: Scans the log from the last checkpoint to determine which transactions were
active at the time of the crash and which pages need to be re-logged.
2. Redo Phase: Re-applies all changes made by transactions that were committed before the
failure. This ensures that committed changes are not lost.
3. Undo Phase: Rolls back changes made by transactions that were not committed at the time of
the crash.
ARIES is known for its ability to handle large databases and high concurrency with minimal overhead.
Backup and Restore is a fundamental part of database recovery, especially in cases of hardware or
media failure. Regular backups (full, incremental, or differential) help ensure that data can be restored to
a consistent state after catastrophic failures.
• Full Backup: A complete copy of the entire database at a specific point in time.
• Incremental Backup: Only the changes made since the last backup are saved.
• Differential Backup: Saves the changes made since the last full backup.
During recovery, the database is restored from the latest full backup and then the incremental or
differential backups are applied in sequence to bring the database to the state it was in at the time of
failure.
7. Point-in-Time Recovery
Point-in-Time Recovery (PITR) allows the database to be restored to a specific point in time, typically just
before a failure or undesirable event (e.g., accidental deletion of data).
• Process: The database is first restored from the most recent backup, and then transaction logs
are applied up to the desired recovery point.
• Use case: PITR is useful for undoing specific changes, such as recovering from accidental data
deletion or corruption.
These are techniques used to delay the actual writing of changes to the database until after the
transaction commits.
• Write-Behind Update: Updates are first recorded in memory or log and written to the database
only when needed. This ensures that no changes are lost in case of a crash.
• Deferred Update: Changes are made to the transaction log but not applied to the database until
after the transaction commits. This approach ensures that no updates are made to the database
if a failure occurs before the commit.
Quorum-based replication involves maintaining multiple copies of the database across different servers.
The system requires a majority (quorum) of the database replicas to be available for operations to
proceed.
• Recovery: In the event of a failure, the database can be restored using the available replicas, and
once the failed system recovers, it can be synchronized with the primary database.
This technique helps in achieving high availability and data consistency in distributed systems.
Many modern DBMSs use a combination of different recovery techniques to maximize both performance
and reliability. These hybrid approaches combine transaction logging, checkpoints, backup systems, and
distributed replication to achieve fast and reliable recovery.
For example:
• Write-Ahead Logging (WAL) may be used for transaction-level recovery, while shadow paging or
ARIES may be used to enhance performance or handle specific types of failures more efficiently.