Parallel Databases
Motivation
Almost abandoned before, but nowadays:
Databases are growing increasingly large
PB data is collected and stored
transaction data (12306), SNS
multimedia data, increasingly stored
Byte、KB、MB、GB、TB、PB、EB、ZB、YB、DB、NB
Bank of Holland:20 DC, 7PB disks, 20PB
magnetic tape storage, 50~70% increase / year
Parallel machines, common and affordable
Prices dropped sharply: microprocessors, memory,
disks
2 Advanced Database System Principle
Motivation
Thus, large-scale parallel database systems is needed:
to store large volumes of data
to process time-consuming decision-support queries
to provide high throughput for transaction processing
3 Advanced Database System Principle
Parallel Systems
Parallel database systems consist of
multiple processors
multiple disks
connected by a fast interconnection network.
A coarse-grain parallel machine consists of a small number of
powerful processors
A massively parallel or fine grain parallel machine utilizes
thousands of smaller processors.
4 Advanced Database System Principle
Parallel Systems
Two main performance measures:
throughput --- the number of tasks that can be completed
within a given time
response time --- the amount of time it takes to complete a
single task from the time
5 Advanced Database System Principle
Speed-Up and Scale-Up
Speedup: a fixed-sized problem executing on a small
system, then is given to a system which is N-times larger.
Measured by:
speedup = small system elapsed time 小系统要更多时间
large system elapsed time 大系统完成更快
Speedup is linear if equation equals N.
6 Advanced Database System Principle
Speed-Up and Scale-Up
Scaleup: increase the size of both the problem and the
system
N-times larger system, perform N-times larger job
Measured by:
scaleup = small system small problem elapsed time
big system big problem elapsed time
Scale up is linear if equation equals 1.
7 Advanced Database System Principle
Transaction Scaleup
Transaction scaleup:
Numerous small queries submitted to a shared database
Problem: N-times (users submitting) requests, N-times
larger database, use an N-times larger computer
Well-suited to parallel execution.
8 Advanced Database System Principle
Factors Limiting Speedup and Scaleup
Speedup and scaleup are often sublinear due to:
Startup costs:
important when the degree of parallelism is high
Interference:
Different processors, etc.
Skew:
Overall execution time determined by the slowest one
9 Advanced Database System Principle
Parallel Database Architectures
Several major architectures:
Shared memory -- processors share common memory
Shared disk -- processors share common disk
Shared nothing -- processors share neither common memory
nor common disk
Hierarchical -- hybrid of the above architectures
10 Advanced Database System Principle
Parallel Database Architectures
11 Advanced Database System Principle
Shared Memory
Processors & disks access common memory
via a bus, or through an interconnection network.
Extremely efficient communication between processors
data in shared memory can be accessed by any processor
processor can send message to others thru main memory
Bottleneck
not scalable
<= 32 or 64 processors
Widely used for lower degrees of parallelism.
12 Advanced Database System Principle
Shared Disk
Access all disks via interconnection network
but processors have own memories
memory bus is not a bottleneck
+: fault-tolerance — once a processor fails, other
processors continue its tasks (database is accessible from
all processors).
Bottleneck: occurs at interconnection to the disk
Can scale up to a good number of processors
With slow communication
13 Advanced Database System Principle
Shared Nothing
Each node is complete one
Communicate thru interconnection network
Each node is treated as a Server
Examples:
Teradata, Tandem, Oracle-n CUBE
Shared-nothing can be scaled up to thousands of processors
Main drawback:
cost of communication
non-local disk access
sending data involves software interaction at both ends.
14 Advanced Database System Principle
Hierarchical
Combines :
shared-memory, shared-disk, shared-nothing
Top level is a shared-nothing architecture
Each node of the system, could be:
a shared-memory system
a shared-disk system
a shared-memory system
To reduce the complexity, use distributed virtual-memory
Logically, has a shared memory
Physically, has many independent main memory system
Also called non-uniform memory architecture
(NUMA)
15 Advanced Database System Principle
Content
I/O Parallelism
Interquery Parallelism
Intraquery Parallelism
Intraoperation Parallelism
Interoperation Parallelism
16 Advanced Database System Principle
Parallelism in Databases
Data partitioned across multiple disks, if parallel I/O
Individual relational operations is executed in parallel
E.g., sort, join, aggregation
each processor works independently
Queries can be run on different processors
expressed in high level language (SQL)
Different queries can be run in parallel
Concurrency control takes care of conflicts.
17 Advanced Database System Principle
I/O Parallelism
Partition a big relation to multiple disks
To Reduce time required to retrieve relations from disk:
Horizontal, vertical
Horizontal partitioning, row-wise:
tuples of a relation are divided to a number of disks
18 Advanced Database System Principle
I/O Parallelism
Partitioning techniques (number of disks = n):
Round-robin: (roulette)
Send the ith tuple of the relation to disk #=i mod n.
Hash partitioning:
Choose one or more attributes as the partitioning attributes.
Choose hash function h with range 0…n - 1
Let i denote result of hash function h
then send tuple to disk i.
19 Advanced Database System Principle
I/O Parallelism (Cont.)
Range partitioning:
Choose an attribute as the partitioning attribute
Choose a partitioning vector [vo, v1, ..., vn-2]
Let v be value of a tuple in the partitioning attribute
Tuples with v < v0 go to disk 0
Tuples such that vi vi+1 , go to disk i + 1
Tuples with v vn-2 go to disk n-1.
20 Advanced Database System Principle
I/O Parallelism (Cont.)
E.g., with a partitioning vector [5,11],
a tuple with partitioning attribute value of 2, go to disk 0,
a tuple with value 8, go to disk 1,
a tuple with value 20, go to disk2.
21 Advanced Database System Principle
Comparison of Partitioning Techniques
Round robin:
Advantages
Best suited for sequential scan
All disks have almost an equal number of tuples
retrieval work is thus well balanced between disks
Point, range queries are difficult to process
No clustering -- tuples are scattered across all disks
22 Advanced Database System Principle
Comparison of Partitioning Techniques(Cont.)
Hash partitioning:
Good for point queries
lookup single disk, leave others available for
answering other queries.
Can have index, making lookup & update efficient
Good for sequential scan
If hash function is good, and tuples equally
distributed over disks
Retrieval work is balanced
No clustering, so difficult to answer range queries
23 Advanced Database System Principle
Comparison of Partitioning Techniques (Cont.)
Range partitioning:
Provides data clustering, by partitioning attribute
Good for sequential access
Good for point queries: only one disk
For range queries, a few disks may be accessed
Remaining disks are available for other queries
Good if result tuples are distributed to a few blocks
Disadvantage:
If data skew, satisfying tuples, only distributed on
several blocks, cause low parallelism !
However, in Round robin, Hash Partitioning, many
blocks maybe fetched , which is a high parallelism
24 Advanced Database System Principle
Partitioning techniques need to consider
We need to consider:
Small relations:
a relation contains only a few tuples
assign the relation to a single disk.
Large relations are preferably partitioned across all
available disks.
For a relation of m disk blocks, given n disks, the
relation should be allocated min(m,n) disks.
25 Advanced Database System Principle
Handling of Skew
The tuples distribution may be skewed
some disks have many tuples,
while others have fewer tuples
Types of skew:
Attribute-value skew
all the tuples have the same value
occurs in range-partitioning & hash-partitioning
Partition skew
for range-partitioning, a bad partition vector
assign too many tuples to some partitions
Less likely happened in hash-partitioning
26 Advanced Database System Principle
Handling Skew in Range-Partitioning
To create a balanced partitioning vector:
Sort the relation on the partitioning attribute
Construct partition vector by scanning sorted relation:
After read every 1/nth of the relation
partitioning attribute value of next tuple, add to vector
Still have imbalanced partitions
if duplicates exist for the same value
Alternative technique based on histograms is used
27 Advanced Database System Principle
Handling Skew Using
Virtual Processor Partitioning
Handling skew using virtual processor partitioning:
create a large number of partitions (say 10 to 20 times
the number of processors)
Assign virtual processors to partitions either in:
round-robin fashion, or
based on estimated cost of processing each virtual partition
Basic idea:
the skew might spread over a number of virtual partitions
gets distributed evenly!
28 Advanced Database System Principle
To summarize
Partitioning Techniques
Round-robin
Hash partitioning
Range partitioning
Types of skew
Attribute-value skew
Partition skew
Handling Skew
Balanced partitioning vector
virtual processor partitioning
29 Advanced Database System Principle
Content
I/O Parallelism
Interquery Parallelism
Intraquery Parallelism
Intraoperation Parallelism
Interoperation Parallelism
30 Advanced Database System Principle
Interquery Parallelism
Interquery Parallelism:
Queries/transactions execute in parallel
Could increase transaction throughput;
Could scale up a transaction processing system
Could process a larger number of transactions per
second.
31 Advanced Database System Principle
Interquery Parallelism
Is the easiest form of parallelism in DB
particularly in a shared-memory parallel database
Is complicated, for shared-disk or shared-nothing:
Locking & logging must be coordinated by passing
messages;
Data in a local buffer may be updated by other
processor
Cache-coherency has to be maintained
reads & writes buffer must use the latest version
32 Advanced Database System Principle
Cache Coherency Protocol
Intuitive cache coherency protocol:
Before reading/writing, the page must be locked in
shared/exclusive mode
After lock, read the page from disk
Before unlock, write back the modified page
Release a lock
Many other more complex protocols
33 Advanced Database System Principle
Content
I/O Parallelism
Interquery Parallelism
Intraquery Parallelism
Intraoperation Parallelism
Interoperation Parallelism
34 Advanced Database System Principle
Intraquery Parallelism
Intraquery Parallelism:
execution of a single query in parallel on multiple processors;
important to speed up long-running queries.
Two complementary forms of intraquery parallelism :
Intraoperation Parallelism
parallelize each individual operation: sort, selection, join
Interoperation Parallelism
execute different operations in parallel
the first form is better when:
# of tuples >> # of operations
35 Advanced Database System Principle
Content
I/O Parallelism
Interquery Parallelism
Intraquery Parallelism
Intraoperation Parallelism
Interoperation Parallelism
36 Advanced Database System Principle
Parallel Sort
For intraoperation parallelism
Range-Partitioning Sort
Choose processors P0, ..., Pm, m n -1, then sorting
Then create range-partition vector with m entries
Step 1: Redistribute the relation using range partitioning
all tuples in the ith range, sent to Pi
Pi stores the tuples to disk Di
Step 2: each processor Pi sorts its partition locally
37 Advanced Database System Principle
Parallel Sort
Each processor executes same operation (sort) in parallel
Final merge operation is trivial:
for 1 ≤ j ≤ m, the key value in processor Pi ≤ that in Pj.
Piperforms merge on the data stream received, to get a
single sorted run
The sorted runs on processors P0,..., Pm-1 are
concatenated to get the final result.
38 Advanced Database System Principle
Parallel Join
The join operation test pairs of tuples, satisfying the join:
If yes, the pair is added to the join output.
Parallel join algorithms attempt to split the pairs to be
tested over several processors
Each processor computes part of the join locally
In the last step, the results from each processor is
collected to get the final results.
39 Advanced Database System Principle
Parallel Join
Types of parallel join
Partitioned join
Fragment and replicate
Partitioned Parallel Hash-Join
Parallel Nested-Loop Join
40 Advanced Database System Principle
Partitioned Join
41 Advanced Database System Principle
Partitioned Join (Cont.)
For equi-joins and natural joins:
Equi-join
SELECT * FROM employee INNER JOIN department ON
[Link] = [Link]
Natural join:
SELECT * FROM employee NATURAL JOIN department
42 Advanced Database System Principle
Partitioned Join (Cont.)
For equi-joins and natural joins:
Let r and s be the input relations, to compute r s
r.A=s.B
r , s is partitioned into n partitions: r0, r1, ..., rn-1 and s0, s1, ..., sn-1
Can use either range partitioning or hash partitioning.
r and s must be partitioned on the join attributes r.A and s.B,
use the same functions: range-partitioning vector, hash function
Partitions ri and si are sent to Pi,
Each Pi locally computes ri si ri.A=si.B
Any of the standard join methods can be used.
43 Advanced Database System Principle
Fragment-and-Replicate Join
Partitioned-join do not fit for all conditions
e.g., non-equijoin conditions, such as r.A > s.B
All tuples in r might participate join
In this case, parallelization is implemented by
fragment and replicate
44 Advanced Database System Principle
Depiction of Fragment-and-Replicate Joins
45 Advanced Database System Principle
Fragment-and-Replicate Join (Cont.)
Can reduce the sizes of the relations at each P
r is partitioned into n partitions: r0, r1, ..., r n-1
s is partitioned into m partitions: s0, s1, ..., sm-1
use any partitioning technique
there must be at least m * n processors
label the processors as
P0,0, P0,1, ..., P0,m-1, P1,0, ..., Pn-1m-1
ri is replicated to Pi,0, Pi,1, ..., Pi,m-1,
si is replicated to P0,i, P1,i, ..., Pn-1,i
computes the join of ri with sj
Pi,j
Any join technique can be used at each processor Pi,j
46 Advanced Database System Principle
Fragment-and-Replicate Join
Special case: asymmetric fragment-and-replicate:
One relation, r, is partitioned;
The other relation, s, is replicated across all the
processors;
Pi locally computes the join of ri with all s using any
join technique
47 Advanced Database System Principle
Fragment-and-Replicate Join (Cont.)
Both versions of fragment-and-replicate fit for any join
Usually has a higher cost than partitioning
At least one relation (for asymmetric version) is replicated
Sometimes asymmetric fragment-and-replicate is preferable
E.g., s is small, r is large, and is already partitioned.
It may be cheaper to replicate s to P, rather than repartition r
and s on the join attributes.
48 Advanced Database System Principle
Partitioned Parallel Hash-Join
Parallelizing partitioned hash join:
Assume s is smaller, then s is chosen as the build relation
hash function h1 maps tuples in s to one of the processors
Each Pi reads tuples of si and sends each tuple to the
processor decided by hash function h1.
On the processors, tuples si are further partitioned by hash
function h2, to compute the hash-join locally (Cont.)
49 Advanced Database System Principle
Partitioned Parallel Hash-Join (Cont.)
Then relation r is redistributed across m processors by h1
Let ri denote the tuples of relation r, sent to Pi.
After ri is received by Pi, repartition ri using h2
Each Pi executes the hash-join algorithm on partitions ri , si,
to produce part of the final results.
Note: Hash-join on each processor is independent
50 Advanced Database System Principle
Parallel Index Nested-Loop Join
Assume that
relation s << r, and r is stored in partitions
there is an index, on join attribute of relation r
Use asymmetric fragment-and-replicate
replicate s, use the existing partitioning of r
Pjreads tuples of sj, and replicates to every other Pi.
s is replicated at all sites, that store tuples of r
Piperforms an indexed nested-loop join of relation s with
the ith partition of relation r
51 Advanced Database System Principle
Other Relational Operations
Selection (r)
If takes the form ai = v
If r is partitioned on ai , the selection is performed at a single
processor.
If takes the form l <= ai <= u (range selection)
If the relation has been range-partitioned on ai
Selection is performed at each processor whose partition
overlaps with the specified range of values.
In all other cases: the selection is performed in parallel at
all processors.
52 Advanced Database System Principle
Other Relational Operations (Cont.)
Duplicate elimination
In any parallel sort technique
eliminate duplicates ASA found during sorting
Can also partition the tuples (range- or hash-
partitioning)
Then perform duplicate elimination locally
Projection
53 Advanced Database System Principle
Grouping/Aggregation
Step1: Partition the relation, on the grouping attributes,
Step2: Compute the aggregate values locally, at each P
Can reduce cost of transferring tuples
by partly computing aggregate values before partitioning
54 Advanced Database System Principle
Grouping/Aggregation
E.g., sum aggregation operation:
Perform sum at each Pi , on disk Di , stores tuples
Get tuples with partial sum value at each P
Then, local sum values are partitioned on grouping
attribute
and sum is performed again at each Pi to get final result
Fewer tuples need to be transferred
Save bandwidth as well as the transferring time
55 Advanced Database System Principle
Cost of Parallel Evaluation of Operations
If no skew: expected speed-up =1/n
If skew, the time can be estimated as
Tpart + Tasm + max (T0, T1, …, Tn-1)
Tpart is the time for partitioning the relations
Tasm is the time for assembling the results
Ti is the time taken for the operation at processor Pi
decided by the slowest one, need consider many factor
56 Advanced Database System Principle
Content
I/O Parallelism
Interquery Parallelism
Intraquery Parallelism
Intraoperation Parallelism
Interoperation Parallelism
57 Advanced Database System Principle
Interoperation Parallelism
Pipelined parallelism,Independent parallelism
Pipelined parallelism
Consider a join of four relations
r1 r2 r3 r4
Set up a pipeline that computes the three joins in parallel
Let P1 be assigned the computation of
temp1 = r1 r2
And P2 be assigned: temp2 = temp1 r3
And P3 be assigned: temp3 = temp2 r4
Each of these operations can execute in parallel,
sending results to the next operation
58 Advanced Database System Principle
Factors Limiting Utility of
Pipeline Parallelism
Pipeline parallelism is useful since it avoids writing
intermediate results to disk
Useful with a small number of processors, but does not
scale up well with more processors.
One reason: pipeline chains do not attain sufficient length.
Cannot pipeline operators which do not produce output until
all inputs have been accessed (e.g. aggregate and sort)
Little speedup when one operator's execution cost is much
higher than the others.
59 Advanced Database System Principle
Independent Parallelism
Independent parallelism
Consider a join of four relations
r1 r2 r3 r4
Let P1 be assigned the computation of
temp1 = r1 r2
And P2 be assigned: temp2 = r3 r4
And P3 be assigned: temp1 temp2
P1 and P2 can work independently in parallel
P3 has to wait for input from P1 and P2
Can pipeline output of P1 and P2 to P3, combining
independent parallelism and pipelined parallelism
Does not provide a high degree of parallelism
useful with a lower degree of parallelism.
less useful in a highly parallel system,
60 Advanced Database System Principle
Query Optimization
Query optimization in parallel databases
is much more complex than in conventional databases
Reason: cost models are more complicated, consider:
partitioning costs, skew, resource competition
When scheduling execution tree in parallel system, must
decide:
How to parallelize each operation,
how many processors,
What operations to pipeline,
what operations to execute independently in parallel,
What operations to execute sequentially.
61 Advanced Database System Principle
Query Optimization (Cont.)
Determining the amount of resources to allocate for each
operation is a problem.
E.g., allocating more processors than optimal can result
in high communication overhead.
Long pipelines should be avoided as the final operation
may wait a lot for inputs, while holding previous resources.
62 Advanced Database System Principle
Query Optimization (Cont.)
The number of parallel evaluation plans is much larger
than the number of sequential evaluation plans
therefore heuristics are needed while optimization
Two alternative heuristics for choosing parallel plans:
do not use pipelining
parallelize every operation across all processors
63 Advanced Database System Principle
Query Optimization (Cont.)
Two alternative heuristics for choosing parallel plans:
First, choose the most efficient sequential plan
Then, parallelize the operations in that plan
64 Advanced Database System Principle
End of Chapter