Module - IV
Module - IV
• RAID is a set of physical disk drives viewed by the operating system as a single logical drive.
• Data are stored across multiple physical disk drives of an array for improved reliability.
• Redundant disk capacity is used to store parity information, which guarantees data recover-
ability in case of a disk failure.
• With multiple disks, data are distributed across multiple disks using a scheme known as data
striping for an improved data transfer rate.
RAID achieve improved data reliability via redundancy. This is achieved by introducing redun-
dancy; that is, by storing extra information that can be used in the event of failure of a disk to
rebuild the lost information. The simplest (but most expensive) approach to introducing redun-
dancy is to duplicate every disk. This technique is called mirroring (or, sometimes, shadowing).
A logical disk then consists of two physical disks, and every write is carried out on both disks. If
one of the disks fails, the data can be read from the other. Data will be lost only if the second disk
fails before the first failed disk is repaired.
RAID also benefited by improved performance via parallelism. With disk mirroring, the rate
at which read requests can be handled is doubled, since read requests can be sent to either disk.
With multiple disks, we can improve the transfer rate as well (or instead) by striping data across
multiple disks. In its simplest form, data striping consists of splitting the bits of each byte across
multiple disks; such striping is called bit-level striping. That is, if there are an array of eight
disks, we write bit i of each byte to disk i. Block-level striping stripes blocks across multiple
disks. It treats the array of disks as a single large disk, and it gives blocks logical numbers. With an
array of n disks, block-level striping assigns logical block i of the disk array to disk (i mod n) + 1;
it uses the bi/nc th physical block of the disk to store logical block i.
RAID Levels
The RAID levels are described as follows:
RAID Level 0: RAID level 0 refers to disk arrays with striping at the level of blocks, but without
any redundancy (such as parity bits). Figure 21(a) shows an array of size 4.
43
CONTENTS 44
Figure 21: RAID levels. Here P indicates error-correcting bits and C indicates a second copy of
the data.
RAID Level 1: RAID level 1 refers to disk mirroring with block striping. Figure 21(b) shows a
mirrored organization that holds four disks’ worth of data.
RAID Level 2: RAID level 2 is also known as memory-style error-correcting code (ECC)
organization. Each byte in a memory system may have a parity bit associated with it that
records whether the numbers of bits in the byte set to 1 is even (parity=0) or odd (parity=1).
The idea of ECC can be used directly in disk arrays via striping of bytes across disks. If
one of the bits in the byte gets damaged (either a 1 becomes a 0, or a 0 becomes a 1), the
parity of the byte changes and thus will not match the stored parity. Similarly, if the stored
parity bit gets damaged, it will not match the computed parity. Thus, all 1-bit errors will be
detected by the memory system.
RAID Level 4: RAID level 4 or block-interleaved parity organization uses block-level strip-
CONTENTS 45
ing, as in RAID 0 and in addition keeps a parity block on a separate disk for corresponding
blocks from N other disks. This scheme is shown in Figure 21(e). If one of the disks fails,
the parity block can be used with the corresponding blocks from the other disks to restore
the blocks of the failed disk.
A block read accesses only one disk, allowing other requests to be processed by the other
disks. Thus, the data-transfer rate for each access is slower, but multiple read accesses can
proceed in parallel, leading to a higher overall I/O rate. Small independent writes also cannot
be performed in parallel.
RAID level 5: RAID level 5 or block-interleaved distributed parity is similar as level 4 but
level 5 spreading data and parity among all N + 1 disks, rather than storing data in N disks
and parity in one disk. For each block, one of the disks stores the parity, and the others store
data. By spreading the parity across all the disks in the set, RAID 5 avoids the potential
overuse of a single parity disk that can occur with RAID 4.
In level 5, all disks can participate in satisfying read requests, unlike RAID level 4, where the
parity disk cannot participate, so level 5 increases the total number of requests that can be
met in a given amount of time.
RAID Level 6: RAID level 6 (is also called the P+Q redundancy scheme) is much like RAID level
5, but stores extra redundant information to guard against multiple disk failures. Instead of
using parity, error-correcting codes such as the Reed-Solomon codes are used.
(a) First approach with record 3 deleted (b) Second approach with record 3 deleted
and all records moved and final record moved
Figure 22: Two simple approaches that can be employed for deleting fixed-length records
CONTENTS 46
Another approach is to maintain a file header at the beginning of the file and store the address
of the first deleted record in the file header. Use this first record to store the address of the second
deleted record, and so on. These stored addresses can be used as pointers, since they point to the
location of a record. The deleted records thus form a linked list, which is often referred to as a free
list (See Figure 23).
Figure 23: Third approach with free list, after deletion of records 1, 4, and 6.
Variable-Length Records
A variable-length field representation uses different sizes for storing records in a file. In contrast
to fixed-length records this representation is memory efficient. Variable-length records arise in
database systems in several ways:
• Storage of multiple record types in a file.
• Record types that allow variable lengths for one or more fields.
3. An array whose entries contain the location and size of each record.
The actual records are allocated contiguously in the block, starting from the end of the block.
The free space in the block is contiguous, between the final entry in the header array, and the
first record. If a record is inserted, space is allocated for it at the end of free space, and an entry
containing its size and location is added to the header.
If a record is deleted, the space that it occupies is freed, and its entry is set to deleted and the
records are moved to occupy the freed space. End-of-free-space pointer in the header is appropri-
ately updated as well.
The slotted-page structure requires that there will be no pointers that point directly to records.
Instead, pointers must point to the entry in the header that contains the actual location of the
record. This level of indirection allows records to be moved to prevent fragmentation of space inside
a block, while supporting indirect pointers to the record.
1. Locate the record in the file that comes before the record to be inserted in search-key order.
2. If there is a free record (that is, space left after a deletion) within the same block as this
record, insert the new record there. Otherwise, insert the new record in an overflow block.
In either case, adjust the pointers so as to chain together the records in search-key order.
CONTENTS 48
B-Tree B±Tree
Primary Clustering Secondary
Dense Sparse
Primary Index
A primary index is an ordered file whose records are of fixed length with two fields, and it acts
like an access structure to efficiently search for and access data records in a data file. The first
field is of the same data type as the ordering key field - called the primary key - of the data file,
and the second field is a pointer to a disk block. There is one index entry in the index file for each
block in the data file. Each index entry has the value of primary key field for the first record in a
block and a pointer to that block as its two field values. The total number of entries in the index
is the same as the number of disk blocks in the ordered data file.
Clustering Indexes
If file records are physically ordered on a non-key field — which does not have a distinct value
for each record — that field is called the clustering field and the data file is called a clustered
file. In this case, a different type of index, called a clustering index, ca be created to speed up
retrieval of all the records that have the same value for the clustering field. This differs from a
CONTENTS 50
primary index, which requires that the ordering field of the data file have a distinct value for each
record.
A clustering index is also an ordered file with two fields; the first field is of the same type as
the clustering field of the data file, and the second field is a disk block pointer. There is one entry
in the clustering index for each distinct value of the clustering field, and it contains the value and
a pointer to the first block in the data file that has a record with that value for its clustering field.
An example clustered index file is shown in Figure 28.
Record insertion and deletion cause problems because the data records are physically ordered.
To alleviate the problem of insertion, it is common to reserve a whole block (or a cluster of con-
tiguous blocks) for each value of the clustering field; all records with that value are placed in the
block (or block cluster). This makes insertion and deletion relatively straightforward.
Secondary Indexes
A secondary index provides a secondary means of accessing a data file for which some primary
access already exists. The data file records could be ordered, unordered, or hashed. The secondary
index may be created on a field that is a candidate key and has a unique value in every record, or
on a non-key field with duplicate values. The index is again an ordered file with two fields. The
first field is of the same data type as some nonordering field of the data file that is an indexing
field. The second field is either a block pointer or a record pointer.
A secondary index on a candidate key looks just like a dense clustering index, except that the
records pointed to by successive values in the index are not stored sequentially. In contrast, if the
search key of a secondary index is not a candidate key, it is not enough to point to just the first
record with each search-key value. The remaining records with the same search-key value could be
anywhere in the file, since the records are ordered by the search key of the clustering index, rather
than by the search key of the secondary index. Therefore, a secondary index must contain pointers
to all the records. One such representation is shown in Figure 29.
It needs an extra level of indirection to implement secondary indices on search keys that are
not candidate keys. The pointers in such a secondary index do not point directly to the file.
Instead, each points to a bucket that contains pointers to the file. Secondary indices improve the
performance of queries that use keys other than the search key of the clustering index. However,
they impose a significant overhead on modification of the database.
Multi-level indexes
If an index is small enough to be kept entirely in main memory, the search time to find an entry is
low. However, if the index is so large that not all of it can be kept in memory, index blocks must
be fetched from disk when required. The search for an entry in the index then requires several
disk-block reads. In such a case, it is always desirable to create yet another level of index. Indeed,
this process can be repeated as many times as necessary. Indices with two or more levels are called
multilevel indices.
To create multi-level indexes, treat the original index just as any other sequential file, and
construct a sparse outer index on the original index, called as the inner index, as shown in Figure
30. Note that the index entries are always in sorted order, allowing the outer index to be sparse.
To locate a record, first use a binary search on the outer index to find the record for the largest
search-key value less than or equal to the one that need to be located. The pointer points to a
block of the inner index. Scan this block until the record is found that has the largest search-key
value less than or equal to the desired record. The pointer in this record points to the block of the
CONTENTS 52
Dense index: In a dense index, an index entry appears for every search-key value in the file. In
a dense clustering index, the index record contains the search-key value and a pointer to the
first data record with that search-key value. The rest of the records with the same search-
key value would be stored sequentially after the first record, since, because the index is a
clustering one, records are sorted on the same search key. In a dense nonclustering index, the
index must store a list of pointers to all records with the same search-key value.
Sparse index: In a sparse index, an index entry appears for only some of the search-key values.
Sparse indices can be used only if the relation is stored in sorted order of the search key.
Each index entry contains a search-key value and a pointer to the first data record with that
search-key value. To locate a record, an index entry with the largest search-key value that is
less than or equal to the search-key value must be determined first. Then, start at the record
pointed to by that index entry, and follow the pointers in the file until the desired record is
found.
(a) (b)
Structure of a B+-Tree
A B+ -tree index is a multilevel index, but it has a structure that differs from that of the multilevel
index-sequential file. Figure 32 shows a typical node of a B+ -tree. It contains up to n−1 search-key
values K1 , K2 , ..., Kn−1 , and n pointers P1 , P2 , ..., Pn . The search-key values within a node are kept
in sorted order; thus, if i < j, then Ki < Kj .
In a B+ -tree, data pointers are stored only at the leaf nodes of the tree; hence, the structure
of leaf nodes differs from the structure of internal nodes. In a leaf node, a pointer Pi , for i =
1, 2, ..., n − 1, points to a file record with search-key value Ki . Since there is a linear order on the
leaves based on the search-key values that they contain, the pointer Pn is used to chain together
the leaf nodes in search-key order. This ordering allows for efficient sequential processing of the
file. This is shown in Figure 34.
Each leaf node can hold up to n − 1 values, though in practice it is better to contain as few as
d(n − 1)/2e values. The ranges of values in each leaf do not overlap, except if there are duplicate
search-key values, in which case a value may be present in more than one leaf. Specifically, if Li
and Lj are leaf nodes and i < j, then every search-key value in Li is less than or equal to every
search-key value in Lj . If the B+ -tree index is used as a dense index every search-key value must
appear in some leaf node. A typical leaf node is shown in Figure 33.
The nonleaf nodes of the B+ -tree form a multilevel (sparse) index on the leaf nodes. The
structure of nonleaf nodes is the same as that for leaf nodes, except that all pointers are pointers
to tree nodes. A nonleaf node may hold up to n pointers, and must hold at least dn/2e pointers.
Nonleaf nodes are also referred to as internal nodes. Unlike other nonleaf nodes, the root node
can hold fewer than dn/2e pointers; however, it must hold at least two pointers, unless the tree
consists of only one node.
All B+ -trees are balanced. That is, the length of every path from the root to a leaf node
is the same. This property is a requirement for a B+ - tree. Indeed, the “B” in B+ -tree stands
CONTENTS 54
for “balanced.” It is the balance property of B+ -trees that ensures good performance for lookup,
insertion, and deletion.
The number of nodes accessed in a lookup in a B-tree depends on where the search key is
CONTENTS 55
located. A lookup on a B+ -tree requires traversal of a path from the root of the tree to some leaf
node. In contrast, it is sometimes possible to find the desired value in a B-tree before reaching a
leaf node.
Static Hashing
In static hashing, when a search-key value is provided the hash function always computes the
same address. For example, if i mod 4 hash function is used then it shall generate only 5 values.
CONTENTS 56
The output address shall always be same for that function. The numbers of buckets provided
remain same at all times.
The condition of bucket-overflow is known as collision. This is a fatal state for any static
hash function. Bucket overflow can occur because of:
• Insufficient buckets
In static hashing bucket overflow is handled using a linked list called overflow chaining. In
overflow chaining, when buckets are full, a new bucket is allocated for the same hash result and
is linked after the previous one. This mechanism is called closed hashing. In an alternative
approach, called open hashing, the set of buckets is fixed, and there are no overflow chains.
Instead, if a bucket is full, the system inserts records in some other bucket. One policy is to use
the next bucket (in cyclic order) that has space; this policy is called linear probing. Other policies,
such as computing further hash functions, are also used.
An important drawback of static hashing is that the hash function chosen while implementing
the system is fixed, and it cannot be changed. Since the function h maps search-key values to a
fixed set B of bucket addresses, wastage of space will occur if B is made large. At the same time,
if B is too small, bucket overflows can occur. As the file grows, performance suffers.
Dynamic Hashing
Static hashing technique will fail for databases that grows and shrinks in size over time. So,
dynamic hashing techniques allow the hash function to be modified dynamically to accommodate
the growth or shrinkage of the database. One such kind of technique is known as extensible
hashing.
With extendable hashing, the hash function generates values over a relatively large range—namely,
b-bit binary integers. A typical value for b is 32. This scheme do not create a bucket for each hash
CONTENTS 57
value. Instead, it use only i bits of the hash function to index into a table of bucket addresses where
0 ≤ i ≤ b. These i bits are used as an offset into an additional table of bucket addresses. The value
of i grows and shrinks with the size of the database. Figure 37 shows a general extendable hash
structure.
The i appearing above the bucket address table in the Figure 37 indicates that i bits of the
hash value h(K) are required to determine the correct bucket for K. This number change as the
file grows. Multiple entries in the bucket address table may point to a bucket. Each bucket j stores
a value ij ; all the entries that point to the same bucket have the same values on the first ij bit.
The number of bucket-address-table entries that point to bucket j is 2(i−ij ) .