Chapter 12: Indexing and Hashing
Organization of Records in Files
Several of the possible ways of organizing records in
files are:
Heap file organization.
Any
record can be placed anywhere in the file where there is space for the record.
There is no ordering of records.
there is a single file for each relation
Typically,
Organization of Records in Files
Sequential file organization.
Records
are stored in sequential order, according to the value of a search key of each record.
Hashing file organization.
A
hash function is computed on some attribute of each record. result of the hash function specifies in which block of the file the record should be placed.
The
Sequential File Organization
A sequential file is designed for efficient processing
of records in sorted order based on some searchkey.
A search key is any attribute or set of attributes; it
need not be the primary key, or even a superkey.
Clustering File Organization
Relational-database systems store each relation in a separate
file.
A clustering file organization is a file organization, that stores
related records of two or more relations in each block.
Such a file organization allows us to read records that would
satisfy the join condition by using one block read.
Data-Dictionary Storage
A relational-database system needs to maintain data about the
relations, such as the schema of the relations. This information is called the data dictionary, or system catalog.
That contains:
Names of the relations Names of the attributes of each relation Domains and lengths of attributes
Integrity constraints (for example, key constraints)
In addition, many systems keep the following data on users of
the system:
Names of authorized users Accounting information about users Passwords or other information used to authenticate users
Chapter 12: Indexing and Hashing
Basic Concepts Ordered Indices Multiple-Key Access Static Hashing Dynamic Hashing Comparison of Ordered Indexing and Hashing
Basic Concepts
Indexing mechanisms used to speed up access to desired data.
E.g., author catalog in library
Search Key - attribute to set of attributes used to look up
records in a file.
An index file consists of records (called index entries) of the
form
search-key pointer
Index files are typically much smaller than the original file. Two basic kinds of indices:
Ordered indices: search keys are stored in sorted order. Hash indices: search keys are distributed uniformly across buckets and values from these buckets can access using a hash function.
Index Evaluation Metrics
Each technique must be evaluated on the basis of these factors:
Access type: finding records with a specified value and finding
records whose attribute values fall in a specified range.
Access time: The time it takes to find a particular data item. Insertion time: The time it takes to insert a new data item.
Finding the place to insert and time to update the index structure.
Deletion time: Space overhead: The additional space occupied by an index
structure.
Ordered Indices
In an ordered index, index entries are stored sorted on the
search key value. E.g., author catalog in library.
Primary index: in a sequentially ordered file, the index whose
search key specifies the sequential order of the file.
Also called clustering index
The search key of a primary index is usually but not necessarily the primary key.
Secondary index: an index whose search key specifies an order
different from the sequential order of the file. Also called non-clustering index.
Index-sequential file: ordered sequential file with a primary index.
Dense and Sparse Index
Dense index:
An index record appears for every search key value in the file. The index record contains the search key and a pointer to the first data record with that search-key value. An index is created only for a few values. Each index contains a value and pointer to first record that contains that value.
Sparse index:
Dense Index Files
Dense index Index record appears for every search-key
value in the file.
Sparse Index Files
Sparse Index: contains index records for only some search-key values.
Applicable when records are sequentially ordered on search-key Find index record with largest search-key value < K Search file sequentially starting at the record to which the index record points
To locate a record with search-key value K we:
Sparse Index Files (Cont.)
Compared to dense indices:
Less space and less maintenance overhead for insertions and deletions.
Generally slower than dense index for locating records.
Good tradeoff: sparse index with an index entry for every block in
file, corresponding to least search-key value in the block.
Multilevel Index
If primary index does not fit in memory, access becomes
expensive.
Solution: treat primary index kept on disk as a sequential file
and construct a sparse index on it.
outer index a sparse index of primary index inner index the primary index file
If even outer index is too large to fit in main memory, yet
another level of index can be created, and so on.
Indices at all levels must be updated on insertion or deletion
from the file.
Indices themselves may become too large for efficient
processing.
Example:
Consider file with 100000 records with 10 records in a block. With sparse index and one index per block we have about 10,000 indices.
Assuming 100 indices fit into a block we need about 100 blocks.
It is desirable to keep the index file in the main memory. Problem: Searching a large index file becomes expensive.
Multilevel Index (Cont.)
Index Update: Record Deletion
If deleted record was the only record in the file with its particular search-
key value, the search-key is deleted from the index also.
Single-level index deletion:
Dense indices deletion of search-key: similar to file record deletion.
Sparse indices
if deleted key value exists in the index, the value is replaced by the next search-key value in the file (in search-key order).
If the next search-key value already has an index entry, the entry is deleted instead of being replaced.
Index Update: Record Insertion
Single-level index insertion:
Perform a lookup using the key value from inserted record Dense indices if the search-key value does not appear in the index, insert it. Sparse indices if index stores an entry for each block of the file, no change needs to be made to the index unless a new block is created.
If
a new block is created, the first search-key value appearing in the new block is inserted into the index.
Multilevel insertion (as well as deletion) algorithms are simple
extensions of the single-level algorithms
Secondary Indices Example
Secondary index on balance field of account Index record points to a bucket that contains pointers to all the
actual records with that particular search-key value.
Secondary indices have to be dense
Hashing
Static Hashing
In a hash file organization, we obtain the address of the disk block
containing a
desired record directly by computing a function on the search-key
value of the record.
A bucket is a unit of storage containing one or more records (a
bucket is typically a disk block).
In a hash file organization we obtain the bucket of a record directly
from its search-key value using a hash function.
Hash function h is a function from the set of all search-key values K
to the set of all bucket addresses B.
Hash function is used to locate records for access, insertion as well
as deletion.
Example of Hash File Organization
Hash file organization of account file, using branch_name as key (See figure in next slide.)
There are 10 buckets,
The binary representation of the ith character is assumed to be the
integer i.
The hash function returns the sum of the binary representations of
the characters modulo 10
E.g. h(Perryridge) = 5
h(Round Hill) = 3 h(Brighton) = 3
Example of Hash File Organization
Hash file organization of account file, using branch_name as key (see previous slide for details).
Hash Functions
Worst hash function maps all search-key values to the same bucket;
this makes access time proportional to the number of search-key values in the file.
An ideal hash function is uniform, i.e., each bucket is assigned the
same number of search-key values from the set of all possible values.
Ideal hash function is random, so each bucket will have the same
number of records assigned to it irrespective of the actual distribution of search-key values in the file.
Handling of Bucket Overflows
If the bucket does not have enough space, a bucket overflow is said
to occur.
Bucket overflow can occur because of
Insufficient buckets
Skew in distribution of records. Some buckets are assigned more records than are others, so a bucket may overflow even when other buckets still have space.
Although the probability of bucket overflow can be reduced, it
cannot be eliminated; it is handled by using overflow buckets.
Handling of Bucket Overflows (Cont.)
Overflow chaining the overflow buckets of a given bucket are chained
together in a linked list.
Above scheme is called closed hashing.
An alternative, called open hashing, which does not use overflow buckets, is not suitable for database applications.
Hash Indices
Hashing can be used not only for file organization, but also for index-
structure creation.
A hash index organizes the search keys, with their associated record
pointers, into a hash file structure.
Strictly speaking, hash indices are always secondary indices
if the file itself is organized using hashing, a separate primary hash index on it using the same search-key is unnecessary. However, we use the term hash index to refer to both secondary index structures and hash organized files.
Example of Hash Index
Deficiencies of Static Hashing
In static hashing, function h maps search-key values to a fixed set of B
of bucket addresses. Databases grow or shrink with time.
If initial number of buckets is too small, and file grows, performance will degrade due to too much overflows. If space is allocated for anticipated growth, a significant amount of space will be wasted initially (and buckets will be underfull). If database shrinks, again space will be wasted.
One solution: periodic re-organization of the file with a new hash
function
Expensive, disrupts normal operations
Better solution: allow the number of buckets to be modified dynamically.
Dynamic Hashing
Good for database that grows and shrinks in size Allows the hash function to be modified dynamically
1.Choose a hash function based on the current file size. This option will
result in performance degradation as the database grows.
2. Choose a hash function based on the anticipated size of the file at
some point in the future. Although performance degradation is avoided, a significant amount of space may be wasted initially.
3. Periodically reorganize the hash structure in response to file growth.
Such a reorganization involves choosing a new hash function, recomputing the hash function on every record in the file, and generating new bucket assignments.
This reorganization is a massive, time-consuming operation.