💻
Unit 6: File indexing and
Transaction Processing
16.6 Files of Unordered Records (Heap Files)
Overview
Heap File Definition:
The simplest type of file organization where records are placed in the
order they are inserted.
New records are added at the end of the file.
Also known as a heap or pile file.
Often used with additional access paths like secondary indexes for
efficient searching and retrieval.
Insertion of Records
Efficiency:
Very efficient due to the simplicity of appending records.
The last disk block of the file is copied into a buffer, the new record is
appended, and the block is rewritten back to the disk.
Unit 6: File indexing and Transaction Processing 1
The address of the last file block is stored in the file header.
Searching for Records
Search Procedure:
Involves a linear search through the file block by block, making it an
expensive operation.
On average, if one record satisfies the search condition, half of the file
blocks must be read and searched.
For a file with \( b \) blocks, this requires searching 2b blocks on
average.
If no records or multiple records satisfy the search condition, all \( b \)
blocks must be searched.
Deleting Records
Procedure:
Locate the block containing the record, copy the block into a buffer,
delete the record from the buffer, and rewrite the block back to the disk.
This leaves unused space in the disk block, leading to wasted storage if
many records are deleted.
Deletion Marker:
An alternative method involves using an extra byte or bit (deletion
marker) stored with each record.
A specific value of the marker indicates a deleted record.
Search programs only consider records without the deletion marker.
Reorganization
Purpose:
Periodic reorganization is needed to reclaim unused space from deleted
records.
During reorganization, file blocks are accessed sequentially, and
records are packed by removing deleted records.
Blocks are filled to capacity after reorganization.
Unit 6: File indexing and Transaction Processing 2
Alternative Use of Deleted Space:
Space of deleted records can be reused for new records, requiring
additional bookkeeping to track empty locations.
Record Structure
Spanned vs. Unspanned Organization:
Unordered files can use spanned (records can span multiple blocks) or
unspanned (records are confined to a single block) organization.
Applicable to both fixed-length and variable-length records.
Modification of Records:
Modifying a variable-length record may necessitate deletion and
reinsertion if the modified record does not fit in the original space.
Reading Records in Sorted Order
Creating a Sorted Copy:
Reading records in order of some field values requires creating a sorted
copy of the file.
Sorting large disk files is expensive and requires special external
sorting techniques.
Direct Access
Unordered Fixed-Length Records:
Using unspanned blocks and contiguous allocation, direct access by
record position is straightforward.
If records are numbered 0, 1, 2, ..., r − 1, and records in each block are
numbered 0, 1, ..., bfr − 1 (where bfr is the blocking factor), the i-th
record is located in block ⌊ bfi r ⌋and is the (i mod bfr)-th record in
that block.
Such files are often called relative or direct files due to direct access by
relative positions.
Access Paths:
Unit 6: File indexing and Transaction Processing 3
Direct access by position facilitates constructing access paths such as
indexes, improving search and retrieval efficiency.
16.7 Files of Ordered Records (Sorted Files)
Overview
Ordered (Sequential) File Definition:
Records are physically ordered on disk based on the values of an
ordering field.
If the ordering field is a key field (unique value in each record), it is
called the ordering key.
Often referred to as a sequential file when ordered by an ordering
key.
Advantages
Efficient Reading:
Reading records in order of the ordering key is efficient because no
sorting is required.
Facilitates efficient sequential access and search conditions like key
= value or range conditions such as value1 < key < value2 .
Next Record Access:
Finding the next record in order usually requires no additional block
accesses unless the current record is the last in the block.
Binary Search:
Binary search can be used on blocks rather than records, improving
search efficiency.
For a file with \( b \) blocks, a binary search accesses \( \log_2(b) \)
blocks on average, which is better than the linear search requiring \(
\frac{b}{2} \) blocks on average.
Search Efficiency
Range Search:
Unit 6: File indexing and Transaction Processing 4
Efficient due to the contiguous nature of records that satisfy the
search condition.
Random Access:
Provides no advantages for random or ordered access based on
nonordering fields.
Requires a linear search for random access.
Insertion and Deletion
Insertion:
Expensive as records must remain ordered.
Requires finding the correct position and shifting records, which is
time-consuming for large files.
On average, half the records must be moved, involving reading and
rewriting blocks.
Deletion:
Less severe if deletion markers and periodic reorganization are
used.
Efficiency Improvement:
Keep unused space in each block for new records.
Use an overflow or transaction file for new records, periodically
merging it with the main file to maintain order.
Record Modification
Nonordering Field:
Modified by changing and rewriting the record in its original
location.
Ordering Field:
Modification requires deletion of the old record and insertion of the
modified record due to position change.
Reading and Reorganization
Sequential Reading:
Unit 6: File indexing and Transaction Processing 5
Efficient if ignoring overflow records.
Double buffering can be used to read blocks consecutively.
Including Overflow:
Requires merging overflow records into the correct positions.
Periodic reorganization sorts and merges the overflow file with the
master file, removing records marked for deletion.
Use Cases and Limitations
Primary Index:
Ordered files are often used with an additional access path called a
primary index, resulting in an indexed-sequential file.
Improves random access time on the ordering key field.
Clustered File:
If the ordering attribute is not a key, the file is called a clustered file.
Summary
Access Time:
Average access time in block accesses for a specific record in a file
with \( b \) blocks.
Use in Databases:
Rarely used unless combined with a primary index.
Table 16.3: Summarizes average access times in block accesses to find a
specific record.
Key Points
Unit 6: File indexing and Transaction Processing 6
Ordered records provide efficient sequential and range-based access.
Binary search on blocks enhances search efficiency.
Insertions and deletions are costly operations due to the need to
maintain order.
Combining with primary indexes (indexed-sequential files) enhances
performance in database applications.
17.1 Types of Single-Level Ordered Indexes
Introduction to Ordered Indexes
Concept: Similar to a textbook index which lists important terms in
alphabetical order with page numbers for easy lookup.
Purpose: Allows efficient access to records using indexing fields
without scanning the entire file linearly.
Structure: Index stores values of the index field along with pointers to
disk blocks containing those values, allowing binary search on the
index.
Key Fields and Indexing
Indexing Field: Single field used to create an index, storing values in an
ordered manner for efficient searching.
Binary Search: Performed on the index, which is smaller than the data
file, enhancing search efficiency.
Multilevel Indexes: Extensions of binary search, partitioning search
space into n-ways for better performance.
Types of Ordered Indexes
1. Primary Index:
Defined on the ordering key field of an ordered file.
Unique values for the ordering key field ensure each record is
distinctly accessible.
Unit 6: File indexing and Transaction Processing 7
The index entry consists of the ordering key and a pointer to the
disk block containing the first record with that key value.
2. Clustering Index:
Used when the ordering field is not a key field (multiple records can
have the same value).
The file is clustered based on the ordering field.
Only one clustering index per file is allowed, like the primary index.
Unit 6: File indexing and Transaction Processing 8
Unit 6: File indexing and Transaction Processing 9
3. Secondary Index:
Defined on non-ordering fields.
Allows multiple indexes on different fields to enhance access
efficiency.
Can exist alongside a primary or clustering index.
Unit 6: File indexing and Transaction Processing 10
Primary Indexes
Structure:
Ordered file with fixed-length records, consisting of the primary key
and a pointer to a disk block.
Each entry corresponds to the first record in a block (anchor
record).
Index Entries:
<K(i), P(i)> where K(i) is the primary key and P(i) is the block
pointer.
Can point to the physical address, record address, or logical
address of the block or record.
Example:
Unit 6: File indexing and Transaction Processing 11
Index entry: <K(1) = "Aaron, Ed", P(1) = address of block 1>
Sparse vs. Dense Indexes:
Dense Index: Index entry for every search key value.
Sparse Index: Index entry for some search key values (primary
index is typically sparse).
Efficiency:
Smaller index size compared to the data file.
Requires fewer block accesses for search operations.
Example calculation shows the reduction in block accesses from 13
to 6 when using a primary index.
Insertion and Deletion Issues:
Insertion: Inserting a new record may require moving records and
updating index entries.
Overflow Handling: Use of unordered overflow files or linked lists
of overflow records to manage insertions and deletions.
Deletion: Handled using deletion markers.
Summary
Primary Index: Efficient for ordered files with unique key fields,
enhancing search performance but complex to manage for insertions
and deletions.
Clustering Index: Used for non-key ordered fields with potential for
multiple records sharing the same field value.
Secondary Index: Flexible access method for non-ordering fields,
supporting multiple indexes for varied query optimization.
Understanding the different types of single-level ordered indexes and their
applications helps in designing efficient database systems, balancing
search performance with manageability.
Unit 6: File indexing and Transaction Processing 12
17.2 Multilevel Indexes
Purpose: Reduce search space in an index more efficiently than a
binary search.
Concept:
Fan-out (fo): The blocking factor for the index, reducing the search
space faster than binary search (n-ways instead of 2-ways).
Example: With a 4,096-byte block size and entries (4-byte block
pointer + 9-byte key), the fan-out can be up to 315.
Structure:
First Level (Base Level): Index file with distinct values.
Second Level: Primary index for the first level, using block anchors
(one entry per block of the first level).
Subsequent Levels: Each level indexes the previous one, reducing
entries by a factor of fo.
Calculation:
Number of Blocks per Level:
First Level: \( r_1 \) entries.
Unit 6: File indexing and Transaction Processing 13
Second Level: \( r_2 = \lceil \frac{r_1}{fo} \rceil \).
Continue until the entries fit in a single block.
Number of Levels (t): \( t = \lceil \log_{fo} (r_1) \rceil \).
Advantages:
Fewer block accesses compared to binary search.
Example: Searching a multilevel index might require 4 block
accesses compared to 12 in a single-level binary search.
File Organization:
Common in business data processing (e.g., IBM's ISAM).
Indexed sequential file: ordered file with multilevel primary index.
Search Algorithm:
Algorithm 17.1: Searches a nondense multilevel primary index.
Read the top-level block.
Iterate through index levels, updating pointer (p) to next block
based on key comparison.
Finally, read the data file block and search for the record.
Dynamic Multilevel Index:
To handle insertions/deletions, dynamic multilevel indexes (e.g., B-
trees, B+-trees) leave space in blocks and use specific algorithms
for growing/shrinking the index.
20.1 Introduction to Transaction Processing
20.1.1 Single-User versus Multiuser Systems
Single-User DBMS:
Only one user can use the system at a time.
Typically found in personal computer systems.
Multiuser DBMS:
Allows many users to access the system and database
concurrently.
Unit 6: File indexing and Transaction Processing 14
Common in applications like airline reservations, banking, and
supermarkets.
Concurrency in Multiuser Systems:
Achieved through multiprogramming, allowing multiple processes to
execute simultaneously.
Processes are interleaved by the operating system to keep the CPU
busy.
Multiple hardware processors enable parallel processing.
Database Access in Multiuser Systems:
Users or programs constantly access and modify database items
concurrently.
20.1.2 Transactions, Database Items, Read and Write Operations, and
DBMS Buffers
Transaction:
Logical unit of database processing.
Includes read, write, insert, delete, or retrieval operations.
Database Items:
Represented as named data items.
Can be of varying granularity, from a whole disk block to a single
attribute value.
Read and Write Operations:
read_item(X) : Reads database item X into a program variable.
: Writes the value of program variable
write_item(X) X into the
database item X .
Database Buffer Management:
Buffers in main memory hold disk pages.
Buffer replacement policies determine which buffers are replaced
when new disk pages are needed.
20.1.3 Why Concurrency Control Is Needed
Unit 6: File indexing and Transaction Processing 15
Concurrency Problems:
Lost Update: Interleaved operations overwrite changes made by
another transaction.
Temporary Update (Dirty Read): Transaction reads uncommitted
changes made by another transaction.
Incorrect Summary: Aggregate functions may calculate incorrect
values due to interleaved updates.
Unrepeatable Read: A transaction reads the same item twice and
receives different values due to updates by another transaction.
20.1.4 Why Recovery Is Needed
Types of Failures:
Transaction Failures: Computer/system errors, transaction errors,
concurrency control enforcement.
Disk Failures: Read/write malfunctions, head crashes.
Physical Problems and Catastrophes: Power failure, fire, theft,
sabotage.
Unit 6: File indexing and Transaction Processing 16
Importance of Recovery:
System must recover from failures to ensure database integrity and
consistency.
Recovery techniques are crucial for maintaining database reliability.
20.2 Transaction and System Concepts
20.2.1 Transaction States and Additional Operations
Transaction: An atomic unit of work, either fully completed or not done
at all.
Key Operations:
BEGIN_TRANSACTION: Marks the start of a transaction.
READ/WRITE: Operations on database items within the transaction.
END_TRANSACTION: Marks the end of transaction operations;
checks for commit or abort.
COMMIT_TRANSACTION: Successfully ends the transaction,
making changes permanent.
ROLLBACK/ABORT: Ends the transaction unsuccessfully, undoing
any changes.
Transaction States:
Active: Transaction starts execution and performs READ/WRITE
operations.
Partially Committed: Transaction has ended, but final checks (e.g.,
for serializability) are pending.
Committed: Transaction successfully completes, and changes are
made permanent.
Failed: Transaction encounters an error and must be rolled back.
Terminated: Transaction finishes, and system information is
removed. Failed transactions may restart.
State Transition Diagram:
States: Active, Partially Committed, Committed, Failed, Terminated.
Unit 6: File indexing and Transaction Processing 17
Transitions: Begin, End, Commit, Abort, Read/Write operations.
20.2.2 The System Log
Purpose: Track transaction operations and database changes for
recovery.
Log Characteristics:
Sequential, append-only file on disk.
Log buffers in main memory hold recent entries, written to disk
when filled or needed.
Periodic backups to archival storage.
Log Records:
[start_transaction, T]: Marks the start of transaction T.
[write_item, T, X, old_value, new_value]: Records change of item X
by transaction T.
[read_item, T, X]: Records read of item X by transaction T.
[commit, T]: Marks the successful completion of transaction T.
[abort, T]: Marks the abortion of transaction T.
Recovery:
Undo changes by tracing backward through log records.
Redo operations if a transaction’s updates are recorded but not yet
applied to the database.
20.2.3 Commit Point of a Transaction
Unit 6: File indexing and Transaction Processing 18
Commit Point: Transaction reaches this point when all operations are
successfully executed, and effects are logged.
Commit Process:
Write a commit record [commit, T] to the log.
Force-write log buffer to disk to ensure all log entries are saved.
Ensure transaction effects are permanently recorded in the
database.
Recovery from Failure:
Transactions without a commit record are rolled back.
Transactions with a commit record have their changes redone if
necessary.
20.2.4 DBMS-Specific Buffer Replacement Policies
Purpose: Efficiently manage disk pages in main memory buffers.
Policies:
Domain Separation (DS): Separate buffers for different disk page
types (index, data, log). Basic LRU used within domains.
Hot Set Method: Keeps frequently accessed pages in memory
during specific operations (e.g., nested-loop joins).
DBMIN Method: Uses Query Locality Set Model (QLSM) to predict
page references and allocate buffers accordingly.
Implementation:
DS achieves better performance on average, but is static.
Variations like GRU add dynamic load-balancing.
DBMIN allocates buffers based on the estimated needs for each file
instance in a query.
20.3 Desirable Properties of Transactions
In database management, transactions should possess several properties
collectively known as the ACID properties. These properties ensure that
transactions are processed reliably and that the integrity of the database is
maintained. The ACID properties are:
Unit 6: File indexing and Transaction Processing 19
1. Atomicity:
A transaction is an indivisible unit of processing. It should be
executed in its entirety or not executed at all.
Responsibility: Transaction recovery subsystem of a DBMS.
If a transaction fails (e.g., due to a system crash), the recovery
technique must undo any effects of the transaction on the database.
Conversely, write operations of a committed transaction must
eventually be written to disk.
2. Consistency Preservation:
A transaction should transform the database from one consistent
state to another, ensuring that integrity constraints are not violated.
Responsibility: Database programmers and the DBMS module
enforcing integrity constraints.
Definition: A consistent database state satisfies all specified
constraints and conditions. If the database is consistent before a
transaction starts, it should remain consistent after the transaction
completes, provided no interference from other transactions occurs.
3. Isolation:
Transactions should appear to execute in isolation from each other,
even though they may be executing concurrently.
Responsibility: Concurrency control subsystem of the DBMS.
Explanation: Isolation ensures that the execution of a transaction is
not affected by other concurrently executing transactions. One
method to enforce isolation is to prevent a transaction's updates
from being visible to others until it is committed, solving the
temporary update problem and eliminating cascading rollbacks.
4. Durability (Permanency):
Changes applied to the database by a committed transaction must
persist, meaning they must not be lost due to any failure.
Responsibility: Recovery subsystem of the DBMS.
Explanation: Durability ensures that once a transaction commits, its
changes are permanent and will survive subsequent system
Unit 6: File indexing and Transaction Processing 20
failures.
Levels of Isolation
Isolation levels define the degree to which a transaction must be isolated
from the data modifications made by other transactions. The isolation
levels, from least isolated to most isolated, include:
Level 0 Isolation:
The transaction does not overwrite dirty reads of higher-level
transactions.
Level 1 Isolation:
Prevents lost updates.
Level 2 Isolation:
Prevents lost updates and dirty reads.
Level 3 Isolation (True Isolation):
Ensures repeatable reads in addition to preventing lost updates and
dirty reads. This means that if a transaction reads a data item, it will
read the same value in subsequent reads within the same
transaction, ensuring that no other transaction can modify the data
item during this period.
Snapshot Isolation:
Based on the idea of providing each transaction with a "snapshot"
of the database at a particular point in time, ensuring that it
operates on a consistent set of data. This approach is discussed in
detail in Section 20.6 and again in Chapter 21, Section 21.4.
Summary
The ACID properties—Atomicity, Consistency, Isolation, and Durability—are
critical for ensuring the reliability and integrity of database transactions. By
enforcing these properties, the DBMS can handle concurrent transactions,
maintain consistency, and recover from failures while ensuring that the
database remains in a valid state. Different levels of isolation offer varying
degrees of transaction isolation, balancing between performance and the
strictness of isolation.
Unit 6: File indexing and Transaction Processing 21