9.CSI2004-ADBMS_Module2__part1
9.CSI2004-ADBMS_Module2__part1
Management Systems
M ODULE - 2
Parallel Databases - Architecture
Parallel Systems
Parallel systems improve processing and I/O speeds by
using multiple processors and disks in parallel.
Parallel database systems consist of multiple processors
and multiple disks connected by a fast interconnection
network.
A coarse-grain parallel machine consists of a small
number of powerful processors
A massively parallel or fine grain parallel machine
utilizes thousands of smaller processors.
Two main performance measures:
Throughput --- the number of tasks that can be
completed in a given time interval
Response time --- the amount of time it takes to
complete a single task from the time it is submitted
Speed-Up and Scale-Up
•Advantages:
•Number of processors used here is scalable. That is, the
design is flexible to add more number of computers.
•Unlike in other two architectures, only the data request which
cannot be answered by local processors need to be forwarded
through interconnection network.
•Disadvantages:
•Non-local disk accesses are costly. That is, if one server
receives the request. If the required data not available, it
must be routed to the server where the data is available. It is
slightly complex.
•Communication cost involved in transporting data among
computers.
1/11/2025 8:38:30 16
PM
Hierarchical
Combines characteristics of shared-memory, shared-disk,
and shared-nothing architectures.
Top level is a shared-nothing architecture – nodes
connected by an interconnection network, and do not share
disks or memory with each other.
Each node of the system could be a shared-memory system
with a few processors. (Base level)
Alternatively, each node could be a shared-disk system
(Middle), and each of the systems sharing a set of disks
could be a shared-memory system.
Reduce the complexity of programming such systems by
distributed virtual-memory architectures
Since access speeds differ, also called non-uniform
memory architecture (NUMA)
Data Partitioning Strategy – IO
Parallelism
1/11/2025 8:38:30 18
PM
Data partitioning strategy
among many disks such that each tuple resides on one disk.
Data partitioning strategy
•Horizontal Fragmentation – Partitioning the tables using
the conditions specified through WHERE clause of SQL query
to distribute bunch of tuples (records). For example, consider
the following schema;
STUDENT (Regno, SName, Address, Branch, Phone)
•In this table, Branch column is used to store the academic
branch in which the student admitted in. suppose, there are
various branches like ‘BTech CIVIL’, ‘BTech MECH’,
‘BTech CSE’, and so on, then the following query would be
used to partition the table using horizontal partitioning.
SELECT * FROM student WHERE Branch = branch name;
•In the result, there won’t be any changes in the schema of the
table; i.e the structure of the table is unaltered. Only, ‘BTech
ECE’ students fragmented into one partition, ‘BTech CSE’
students fragmented into one partition and so on.
1/11/2025 8:38:30 20
PM
Data partitioning strategy
Round-robin:
Send the i th tuple inserted in the relation to disk
i mod n
The round-robin scheme ensures an even distribution of
tuples across disks
Example:
• If the number of available disks n is 10,
• then first record goes to D1 (1 mod 10 = 1),
• second record goes to D2 (2 mod 10 =2), and so on
• 10th record goes to D0 (10 mod 10 = 0),
• 11th record goes to D1 (11 mod 10 = 1)
• This scheme
1/11/2025 8:38:30 distributes data evenly in all the disks. 23
PM
Data Partitioning Strategy
Hash partitioning:
Choose one or more attributes as the partitioning attributes.
Choose hash function h with range 0…n - 1
If the hash function returns i, then the tuple is placed on disk
Di
•For example, consider the following table;
EMPLOYEE(ENo, EName, DeptNo, Salary, Age)
•If we choose DeptNo attribute as the partitioning attribute and
if we have 10 disks to distribute the data, then the following
would be a hash function;
h(DeptNo) = DeptNo mod 10
If we have 10 departments, then according to the hash function,
all the employees of department 1 will go into disk 1,
department 2 to
1/11/2025 8:38:30
disk 2 and so on. 24
PM
Data Partitioning Strategy
Example:2
• If we choose the EName of the employees as
partitioning attribute, then we could have the
following hash function;
h(EName) = (Sum of ASCII value of every
character in the name) mod n,
•where n is the number of disks/partitions needed.
1/11/2025 8:38:30 25
PM
Data Partitioning Strategy
Range partitioning:
This strategy distributes tuples by assigning
contiguous attribute-value ranges to each disk.
Choose an attribute as the partitioning attribute.
A partitioning vector [vo, v1, ..., vn-2] is chosen.
Consider a tuple t such that t [A] = x.
If x < v0, then t goes on disk D0.
If x ≥ vn−2, then t goes on disk Dn−1.
If vi ≤ x < vi+1, then t goes on disk Di+1.
•Disadvantages:
• Execution Skew might occur.
1/11/2025 8:38:30 30
PM
Data Partitioning Strategy
D2
1/11/2025 8:38:30 32
PM
Data Partitioning Strategy
•Hash Partitioning:
•GRADE attribute of the Emp_table to explain Hash
partitioning.
•Let us choose a hash function as follows;
h(GRADE) = (GRADE mod n)
•where GRADE is the value of GRADE attribute of a record
and n is number of partitions which is 3 in our case.
•While applying the hash partitioning on GRADE, we will get
the following partitions of Emp_table.
•For example,
•GRADE of ‘Smith’ is 1 and hashing the function shows
partition 1 (1 mod 3 = 1).
•GRADE of ‘Blake’ is 4, then (4 mod 3=1) directs to
partition 1.
•GRADE of
1/11/2025 8:38:31
‘King’ is 5, then (5 mod 3=2) which directs
33
to
PMpartition 2.
Data Partitioning Strategy
h(GRADE) =(GRADEmod n)
D0
D1
D2
1/11/2025 8:38:31 34
PM
Data Partitioning Strategy
•Range Partitioning:
•GRADE of Emp_table to partition under range
partitioning.
•For applying range partition, we need to first identify
partitioning vector, [v0, v1, …, vn-2].
•Let us choose the following vector as range partitioning
vector for our case;
[2, 4]
•According to the vector, the records having the GRADE
value
•0 to 2 Partition 0,
•3 – 4 Partition 1
•Greater than 4 Partition 2
1/11/2025 8:38:31 35
PM
Data Partitioning Strategy
GRADE
•0 to 2 Partition 0
D0
•3 – 4 Partition 1
•Greater than 4
Partition 2
D1
D2
1/11/2025 8:38:31 36
PM
Example-2
1/11/2025 8:38:31 37
PM
Data Partitioning Strategy
1/11/2025 8:38:31 38
PM
Data Partitioning Strategy
1/11/2025 8:38:31 39
PM
Data Partitioning Strategy
•Round-Robin Technique
i mod n
•Here, variable i represents the position (physical order) of the
record in the actual table. Variable n represents the number of
target disks into which the table should be partitioned.
•For example, the records of Table 1, according to the function i
mod 4, will be sent to the 4 disks as follows;
• Disk 0 : R4, R8 (i.e, records with salary values 30011, and 21016 respectively)
• Disk 1 : R1, R5 (i.e, records with salary values 10002, and 9031 respectively)
• Disk 2 : R2, R6 (i.e, records with salary values 23000, and 29002 respectively)
• Disk 3 : R3, R7 (i.e, records with salary values 22051, and 4004 respectively)
1/11/2025 8:38:31 40
PM
Data Partitioning Strategy
•Hash Partitioning
h (attribute mod n)
h (attribute mod n)
=h (salary mod 4)
=h(4004mod 4)
=0 Disk 0
•For this function, the records will be distributed as follows
•Disk 0 :R2, R7, R8
•Disk 1 :
•Disk 2 :R1, R6
•Disk 3 :R3, R4, R5
1/11/2025 8:38:31 41
PM
Data Partitioning Strategy
•Range Partitioning
•For Range partitioning, we need a range vector.
•For this we have chosen the following vector as range vector;
•v [5000, 10000, 25000]
•These vector represents 4 partitions 5000 and less, 10000
and less, 25000 and less, and the other salary values more than
25000.
•For this range vector, our data will be distributed as follows;
•Disk 0 :R7 (only record 7 is <=5000)
•Disk 1 :R5 (only record 5 is >5000 and <=10000)
•Disk 2 :R2, R3, R8 (the range >10000 and <=25000)
•Disk 3 :R4, R6 (the range >25000)
1/11/2025 8:38:31 42
PM
Data Partitioning Strategy
1/11/2025 8:38:31 43
PM
Data Partitioning Strategy
Access to data
1.Scanning the entire relation.
1/11/2025 8:38:31 46
PM
Comparison of Partitioning Techniques
■ Round Robin
■ Advantages
• Best suited for scan of entire relation on each query.
• All disks have almost an equal number of tuples; retrieval
work is thus well balanced between disks.
• Both point and range queries are difficult to process
• No clustering -- tuples are scattered across all disks
1/11/2025 8:38:31 47
PM
Comparison of Partitioning Techniques
1/11/2025 8:38:31 48
PM
Comparison of Partitioning Techniques
1/11/2025 8:38:31 49
PM
Handling of Skew
The distribution of tuples to disks may be skewed — that is,
some disks have many tuples, while others may have fewer
tuples.
Types of skew:
Attribute-value skew.
Some values appear in the partitioning attributes of many
tuples; all the tuples with the same value for the partitioning
attribute end up in the same partition, resulting in skew
Can occur with range-partitioning and hash-partitioning.
Partition skew.
Refers to the fact that there may be load imbalance in the
partitioning, even when there is no attribute skew.
With range-partitioning, badly chosen partition vector may
assign too many tuples to some partitions and too few to others.
Less likely with hash-partitioning if a good hash-function is
1/11/2025 8:38:31 50
PM
Handling of Skew
• Even a small skew can result in a significant decrease in
performance.
• Thus, we see that the loss of speedup due to skew increases with
parallelism
1/11/2025 8:38:31 51
PM
Handling of Skew
1/11/2025 8:38:31 52
PM
Handling of Skew
• Balanced partitioning vector can be constructed from
histogram in a relatively straightforward fashion
• Assume uniform distribution within each range of the
histogram
• Histogram can be constructed by scanning relation, or
sampling (blocks containing) tuples of the relation
1/11/2025 8:38:31 53
PM
Handling of Skew
• Handling Skew Using Virtual Processor Partitioning