0% found this document useful (0 votes)
23 views40 pages

Adbms (Bca) 2 1744958912050

Unit II discusses data storage and querying in DBMS, focusing on file organization, RAID, indexing, binary trees, and hashing techniques. It explains various file organization methods, RAID configurations for data redundancy and performance, and how indexing optimizes data retrieval. Additionally, it covers tree structures like B-Trees and B+ Trees for efficient data management and the hashing technique for quick record location.

Uploaded by

janhavik053
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
23 views40 pages

Adbms (Bca) 2 1744958912050

Unit II discusses data storage and querying in DBMS, focusing on file organization, RAID, indexing, binary trees, and hashing techniques. It explains various file organization methods, RAID configurations for data redundancy and performance, and how indexing optimizes data retrieval. Additionally, it covers tree structures like B-Trees and B+ Trees for efficient data management and the hashing technique for quick record location.

Uploaded by

janhavik053
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Unit II : Data storage and

Querying
File organization
File organization within a DBMS delineates the logical arrangement of these
records, determining how they're mapped onto disk blocks.
Optimal file organization is help in operations like insertion, deletion, and
updating of records, ensuring efficiency and preventing redundancy.
Memory is partitioned into memory blocks, with each record being assigned a
specific "Bucket address.“
At the high-level it presents databases as tables, the underlying reality
underscores files, records, and their strategic organization.
This mapping also shows the logical structures and physical storage mediums.
Depending upon the access and selection of records, various methods of file
organization in DBMS are available.
 Sequential file organization
• The simplest of all file organization methods.
• Inefficient for large databases.
• As the name suggests, this method simply stores the records in files in sequential
order, one after another in a series like a sequence of books on a bookshelf, and
to access these files, we have to search through the whole sequence until we
reach our desired file.
RAID (Redundant Arrays of Independent Disks)
 Is a technique that makes use of a combination of multiple disks for storing the
data instead of using a single disk for increased performance, data redundancy,
or to protect data in the case of a drive failure.
 The term was defined by David Patterson, Garth A. Gibson, and Randy Katz at
the University of California, Berkeley in 1987.
 Is like having backup copies of your important files stored in different places on
several hard drives . If one drive stops working, your data is still safe because
you have other copies stored on the other drives.
 The main purpose of RAID is to improve data reliability, availability, and
performance.
How RAID works?
 RAID splits your data across multiple drives (similar to dividing the book
into pieces).
 Depending on the RAID configuration, it may also create duplicates (like
making extra copies). If one drive fails, the remaining drives can help
reconstruct the lost data.
A RAID controller
 Is like a boss for your hard drives in a big storage system.
 It works between your computer’s operating system and the actual hard
drives, organizing them into groups to make them easier to manage.
 This helps speed up how fast your computer can read and write data, and it
also adds a layer of protection in case one of your hard drives breaks down.
Types of RAID Controller
There are three types of RAID controller:
 Hardware Based: In hardware-based RAID, there’s a physical controller that
manages the whole array. This controller can handle the whole group of hard drives
together. Sometimes, this controller is built right into the computer’s main board,
making it easier to set up and manage your RAID system. It’s like having a captain
for your team of hard drives, making sure they work together smoothly.
 Software Based: In software-based RAID, the controller doesn’t have its own
special hardware. So it use computer’s main processor and memory to do its job. It
perform the same function as a hardware-based RAID controller, like managing the
hard drives and keeping your data safe.
 Firmware Based: Firmware-based RAID controllers are like helpers built into
the computer’s main board. They work with the main processor, just like software-
based RAID. But they only implement when the computer starts up. Once the
operating system is running, a special driver takes over the RAID job.
Different RAID Levels
• RAID-0 (Stripping) splitting data into smaller “blocks” and spreading them across multiple disks. This
process is called “striping.
• RAID-1 (Mirroring)
• RAID-2 (Bit-Level Stripping with Dedicated Parity)
• RAID-3 (Byte-Level Stripping with Dedicated Parity)
• RAID-4 (Block-Level Stripping with Dedicated Parity)
• RAID-5 (Block-Level Stripping with Distributed Parity)
• RAID-6 (Block-Level Stripping with two Parity Bits)
Indexing

 It is the retrieval of records from a database file quickly is known as Indexing. When a
query is processed, Indexing is used to optimize the performance of the database by
minimizing the number of disk accesses required.
 Indexing is used to locate and access the data in the database quickly.
Structure of Index
 An index comprises a small table that has two columns.

Search key Data reference

 The first column is the search key (comprises the copy of the primary key or candidate
key of a table), while the second column stores the set of pointers holding the address of
the disk block (or reference to it) where that specific key value is stored.
 An index takes a search key as input and returns a collection of matching records
Company_ID UNIT UNIT_COST
10 12 1.15
12 12 1.05
14 18 1.31
18 18 1.34
11 24 1.15
16 12 1.31
10 12 1.15
12 24 1.3
18 6 1.34
18 12 1.35
14 12 1.95
21 18 1.36
12 12 1.05
20 6 1.31
18 18 1.34
SELECT company_id, units, unit_cost FROM index_test WHERE
company_id = 18
The database would have to search through all 17 rows in the order they appear in
the table, from top to bottom, one at a time. So to search for all of the potential
instances of the company_id number 18, the database must look through the entire
table for all appearances of 18 in the company_id column.
It is more and more time consuming as the size of the table increases. As the
sophistication of the data increases, what could eventually happen is that a table
with one billion rows is joined with another table with one billion rows; the query
now has to search through twice the amount of rows costing twice the amount of
time.
Tables increase in size and searching increases in execution time.
COMPANY_ID UNIT UNIT_COST
10 12 1.15

10 12 1.15

11 24 1.15

11 24 1.15

12 12 1.05

12 24 1.3

12 12 1.05

14 18 1.31

14 12 1.95

14 24 1.05

16 12 1.31

18 18 1.34

18 6 1.34

18 12 1.35

18 18 1.34

20 6 1.31

21 18 1.36
Database indexes will also store pointers which are simply reference information for the location of
the additional information in memory. Basically the index holds the company_id and that particular
row’s home address on the memory disk. The index will actually look like this
OMPANY_ID POINTER
10 _123
10 _129
11 _127
11 _138
12 _124
12 _130
12 _135
14 _125
14 _131
14 _133
16 _128
18 _126
18 _131
18 _132
18 _137
20 _136
into the table to find the specific row where that pointer lives. The query can then go into the table to retrieve the fields for the
columns requested for the rows that meet the conditions.
• Indexing adds a data structure with columns for the search conditions and a
pointer
• The pointer is the address on the memory disk of the row with the rest of the
information
• The index data structure is sorted to optimize query efficiency
• The query looks for the specific row in the index; the index refers to the
pointer which will find the rest of the information.
• The index reduces the number of rows the query has to search through from
17 to 4.
Ordered Indices
• In an ordered index, index entries are stored sorted on the
search key value. E.g., author catalog in library.
• Primary index: in a sequentially ordered file, the index
whose search key specifies the sequential order of the file.
– Also called clustering index
– The search key of a primary index is usually but not
necessarily the primary key.
• Secondary index: an index whose search key specifies an
order different from the sequential order of the file. Also
called non-clustering index.
• Index-sequential file: ordered sequential file with a primary
index.
 Dense index
- index record appears for every search-key value in the file.
Binary Tree
 A binary tree is used in many applications for storing data
 A binary tree is allocated memory in two ways.
1) Sequential Representation
 This is the simplest technique to store a tree data structure. An array is used to
store the tree nodes. The number of nodes in a tree defines the size of the array.
The root node of the tree is stored at the first index in the array.
 In general, if a node is stored at the ith location then it’s left and right child is
stored at 2i and 2i+1 location respectively.
Binary
Tree

Sequential
representation of
Binary Tree
 we see that the left and right child of each node is stored at locations
2*(node_location) and 2*(node_location)+1 respectively.
 The location of node 3 in the array is 3. So it’s left child will be placed at
2*3 = 6. Its right child will be at the location 2*3 +1 = 7. As we can see in
the array, children of 3 which are 6 and 7 are placed at location 6 and 7 in
the array.
 The sequential representation of the tree is inefficient as the array which
is used to store the tree nodes takes up lots of space in memory. As the
tree grows, this representation becomes inefficient and difficult to
manage.
 2) Linked-list Representation
 In this type of representation, a linked list is used to store the tree nodes.
Several nodes are scattered in the memory in non-contiguous locations and
the nodes are connected using the parent-child relationship like a tree.
Each linked list node has three components:
• Left pointer
• Data part
• Right pointer
The left pointer has a pointer to the left child of the node; the right pointer
has a pointer to the right child of the node whereas the data part contains the
actual data of the node. If there are no children for a given node (leaf node),
then the left and right pointers for that node are set to null .
B-Tree
 It is a self-balanced tree as well as a specialized m-way tree that is used for
disk access.
 When the amount of data to be stored is very high, we cannot store the
entire data in the main memory. Hence we store data in the disk. Data access
from the disk takes more time when compared to the main memory access.
 When the number of keys of the data stored in disks is very high, the data
is usually accessed in the form of blocks. The time to access these blocks is
directly proportional to the height of the tree
 The B-Tree is a flat tree i.e. the height of the B tree is kept to a minimum.
Instead, as many keys are put in each node of the B-tree. By keeping the
height of the B-tree to the minimum, the access is faster when compared to
other balanced trees
• All leaves of B-tree are at the same level.
• A B-tree of order m can have at most m-1 keys and m children.
• Every node in B-tree has at most m children.
• Root node must have at least two nodes.
• Every node except the root node and the leaf node contain m/2
children
Searching ,Inserting, Deleting

Searching
 n the above tree if we want to look(search) for item 3, then we will
proceed as follows:
• Compare 3 with root elements. Here, 3 < 6 and 3 <15. So we proceed to
the left subtree.
• Compare 3 with 4 in the left subtree. As 3<4, we proceed to the left
subtree of node 4.
• The next node has two keys, 2 and 3. 3 > 2 while 3=3. So we have
found the key at this node. Return to the found location.
 Searching in B tree depends on the height of the tree.
Inserting

Let us insert item 70 in the tree shown below.


The given tree is with minimum degree ‘m’ = 3. Hence,
each node can accommodate, 2*m -1 = 5 keys inside it.
Now we insert item 70. As we can have 5 keys in a node,
we compare element 70 with the root element 30. Since
70 > 30, we will insert item 70 in the right subtree.
In the right subtree, we have a node with keys 40, 50, 60.
As each node can have 5 keys in it, we will insert the item
70 in this node itself.
Deletion
Let's assume that we want to delete the key 60 from
the B tree. If we check the B tree, we can find that
the key to be deleted is present in the leaf node. We
delete the key 60 from this leaf node.
After deleting the key 60, the B tree leaf node
satisfies the required properties and we need not do
any more modifications to the B tree.
However, if we had to delete key 20, then only key
10 would have remained in the left leaf node. In this
case, we would have to shift root 30 to the leaf node
and move key 40 to the root.
B+ tree
 It is an extension of the B- tree. The difference in B+ tree and B- tree is that in B-
tree the keys and records can be stored as internal as well as leaf nodes whereas
in B+ trees, the records are stored as leaf nodes and the keys are stored only in
internal nodes.
 The records are linked to each other in a linked list fashion. This arrangement
makes the searches of B+ trees faster and efficient. Internal nodes of the B+ tree are
called index nodes.
 The B+ trees have two orders i.e. one for internal nodes and other for leaf or
external nodes.
 While inserting as well as deleting, we should maintain the basic
properties of B+ Trees intact. However, deletion operation in the B+
tree is comparatively easier as the data is stored only in the leaf nodes
and it will be deleted from the leaf nodes always.
 Advantages
• We can fetch records in an equal number of disk accesses.
• Compared to the B tree, the height of the B+ tree is less and remains
balanced.
• We use keys for indexing.
• Data in the B+ tree can be accessed sequentially or directly as the leaf
nodes are arranged in a linked list.
• Search is faster as data is stored in leaf nodes only and as a linked list.
Hashing

 Hashing is a technique to quickly locate a data record in a database irrespective of the size of the
database.
 For larger databases containing thousands and millions of records, the indexing data structure
technique becomes very inefficient because searching a specific record through indexing will
consume more time. So, to counter this problem hashing technique is used.
 The hashing technique utilizes an auxiliary hash table to store the data records using a hash
function. There are 2 key components in hashing:
• Hash Table: A hash table is an array or data structure and its size is determined by the total
volume of data records present in the database. Each memory location in a hash table is called a
'bucket' or hash indice and stores a data record's exact location and can be accessed through a hash
function.
• Bucket: A bucket is a memory location (index) in the hash table that stores the data record. These
buckets generally store a disk block which further stores multiple records. It is also known as the
hash index.
• Hash Function: A hash function is a mathematical equation or algorithm that takes one data
record's primary key as input and computes the hash index as output.
Hash Function
A hash function is a mathematical algorithm that computes the index or the location
where the current data record is to be stored in the hash table so that it can be accessed
efficiently later. This hash function is the most crucial component that determines the
speed of fetching data.
Working of Hash Function
The hash function generates a hash index through the primary key of the data record.
Now, there are 2 possibilities:
1. The hash index generated isn't already occupied by any other value. So, the address of
the data record will be stored here.
2. The hash index generated is already occupied by some other value. This is called
collision so to counter this, a collision resolution technique will be applied.
3. Now whenever we query a specific record, the hash function will be applied and
returns the data record comparatively faster than indexing because we can directly reach
the exact location of the data record through the hash function rather than searching
through indices one by one.
Types of Hashing in DBMS
There are two primary hashing techniques.
1. Static Hashing
In static hashing, the hash function always generates the same bucket's address. For
example, if we have a data record for employee_id = 107, the hash function is mod-5 which
is - H(x) % 5, where x = id. Then the operation will take place like this
 The primary key is used as the input to the hash function and the hash function generates
the output as the hash index (bucket's address) which contains the address of the actual
data record on the disk block.
 Static Hashing has the following Properties
• Data Buckets: The number of buckets in memory remains constant. The size of the hash
table is decided initially and it may also implement chaining that will allow handling
some collision issues though, it's only a slight optimization and may not prove worthy if
the database size keeps fluctuating.
• Hash function: It uses the simplest hash function to map the data records to its
appropriate bucket. It is generally modulo-hash function
• Efficient for known data size: It's very efficient in terms when we know the data size
and its distribution in the database.
• It is inefficient and inaccurate when the data size dynamically varies because we have
limited space and the hash function always generates the same value for every specific
input. When the data size fluctuates very often it's not at all useful because collision will
keep happening and it will result in problems like - bucket skew, insufficient buckets etc.
2. Dynamic Hashing
Dynamic hashing is also known as extendible hashing, used to handle database that frequently changes
data sets. This method offers us a way to add and remove data buckets on demand dynamically. This
way as the number of data records varies, the buckets will also grow and shrink in size periodically
whenever a change is made.
 Properties of Dynamic Hashing
• The buckets will vary in size dynamically periodically as changes are made offering more flexibility
in making any change.
• Dynamic Hashing aids in improving overall performance by minimizing or completely preventing
collisions.
• It has the following major components: Data bucket, Flexible hash function, and directories
• A flexible hash function means that it will generate more dynamic values and will keep changing
periodically asserting to the requirements of the database.
• Directories are containers that store the pointer to buckets. If bucket overflow or bucket skew-like
problems happen to occur, then bucket splitting is done to maintain efficient retrieval time of data
records. Each directory will have a directory id.
• Global Depth: It is defined as the number of bits in each directory id. The more the number of
records, the more bits are there.
 Working of Dynamic Hashing
 Example: If global depth: k = 2, the keys will be mapped accordingly to the hash
index. K bits starting from LSB will be taken to map a key to the buckets. That
leaves us with the following 4 possibilities: 00, 11, 10, 01.
 As we can see in the above image, the k bits from LSBs are taken in
the hash index to map to their appropriate buckets through directory
IDs.
 The hash indices point to the directories, and the k bits are taken
from the directories' IDs and then mapped to the buckets.
 Each bucket holds the value corresponding to the IDs converted in
binary.

You might also like