Distributed DB
Distributed DB
COP5711
Database 1
Database 3
Server 1
Communication Network
Server 3
Database 2
Server 2
Application
(front-end)
computer
Interface
Processor
Access
Processor
Access
Processor
Access
Processor
3. Improved Reliability/Availability:
Disadvantages of DDBSs
Cost: replication of effort (manpower).
Security: More difficult to control
Complexity:
Distributed DBMS
Architecture
NetworkTransparancy
The user should be protected from the
operational details of the network.
It is desirable to hide even the existence
of the network, if possible.
Location transparency: The command used is
independent of the system on which the data is
stored.
Naming transparency: a unique name is
provided for each object in the database.
Site A
Copy 1 of R2
Relation R
Fragment R1
Fragment R2
Copy 2 of R1
Site B
Fragment R3
Site C
Fragment R4
Copy 2 of R2
ANSI/SPARC Architecture
External Schema
External
view
External
view
Conceptual Schema
Conceptual
view
Internal Schema
Internal
view
External
view
Internal view: deals with the physical definition and organization of data.
Conceptual view: abstract definition of the database. It is the real
world view of the enterprise being modeled in the database.
Homogeneous
Federated
Loosely coupled
(interoperable DB
systems using
export schema)
Heterogeneous
(Multidatabase)
Unfederated
(no local users)
Tightly coupled
(/w global schema)
Global user
view n
A homogeneous
DDBMS resembles a
Global Schema
Fragmentation
Schema
Allocation
Schema
Local
conceptual
schema 1
Local
conceptual
schema n
Local
internal
schema 1
Local
internal
schema n
Local DB 1
Local DB n
across a number of
sites in a network.
D
E
Local
user
DBMS
Database 1
Homogeneous DDBMS
Global
No local users
user
Most systems do not have
local schemas (i.e., every user
Local
uses the same schema)
user
Multidatabase
Heterogeneous DDBMS
Management
There are both local and
system
global users
Multidatabase systems are
split into:
Tightly Coupled Systems:
DBMS DBMS DBMS
have a global schema
Loosely Coupled Systems:
do not have a global
Database 2
Database 3
Database 4
schema.
Global user
view n
Local user
view 1
Local user
view 2
An individual nodes
participation in the MDB
is defined by means of a
participation schema.
Local
Participation
Schema 1
Local
Participation
Schema 1
Auxiliary
Schema 1
Local
Conceptual
Schema 1
Local
Conceptual
Schema 1
Local user
view 1
Local
Internal
Schema 1
Local
Internal
Schema 1
Local user
view 2
Local DB 1
Local DB 1
Loosely-Coupled Systems
Local
user view 1
Local
user view 2
Global
user view 1
Global
user view 2
Global
user view 3
Local
Conceptual
schema 1
Local
Conceptual
Schema 2
Local
Conceptual
Schema n
Local
internal
schema 1
Local
internal
Schema 2
Local
internal
Schema n
Local DB 1
Local DB 2
Local DB n
Loosely-Coupled Systems
Global
user view 1
Export
schema 1
Local
user view 1
Local
user view 2
Global
user view 2
Export
schema 2
Export
Schema 3
Global
user view m
Export
Schema n
Local
Conceptual
schema 1
Local
Conceptual
Schema 2
Local
Conceptual
Schema n
Local
internal
schema 1
Local
internal
Schema 2
Local
internal
Schema n
Local DB 1
Local DB 2
Local DB n
System Requirements
(Objectives)
Conceptual
design
Global
conceptual
schema
Defining the
interfaces for
end users
View integration
View Design
Access
information
External Schema
Definitions
Distribution Design
Local Conceptual Schemas
Maps the local
conceptual
schemas to
physical storage
devices
Physical Design
Physical Schema
Fragmentation
& allocation
Fragmentation Alternatives
J
JNO
J1
J2
J3
J4
JNAME
BUDGET
Instrumental
Database Dev.
CAD/CAM
Maintenance
150,000
135,000
250,000
350,000
Montreal
New York
New York
Paris
Horizontal Partitioning
J1
JNO
J1
J2
J2
JNO
JNAME
Instrumental
Database Dev.
JNAME
J3 CAD/CAM
J4 Maintenance.
LOC
Vertical Partitioning
BUDGET
LOC
150,000
135,000
Montreal
New York
BUDGET
150,000Montreal
310,000 Paris
LOC
JNO
J1
J2
J3
J4
BUDGET
150,000
135,000
250,000
310,000
JNO JNAME
J1
J2
J3
J4
LOC
Instrumentation
Database Devl
CAD/CAM New
Maintenance
Montreal
New York
York
Paris
Disadvantages:
Vertical fragmentation may incur overhead.
Attributes participating in a dependency may be
allocated to different sites.
Degree of Fragmentation
Application views are usually subsets of
relations. Hence, it is only natural to
consider subsets of relations as
distribution units.
The appropriate degree of fragmentation
is dependent on the applications.
Correctness Rules
Vertical Partitioning
Lossless
decomposition
Dependency
preservation
Horizontal Partitioning
Disjoint fragments
Allocation Alternatives
Partitioning: No replication
Partial Replication: Some
fragments are replicated
Full Replication: Database
exists in its entirety at
each site
Notations
S
Title SAL
L1
L2
L3
Simple Predicates
Given a relation R(A1, A2, , An) where Ai has domain Di, a
simple predicate pj defined on R has the form
pj: Ai Value
where
{, , , , , } and Value Di
Example:
JNO
JNAME
J1 Instrumental
J2 Database Dev.
J3 CAD/CAM
J4 Maintenance
BUDGET
LOC
150,000 Montreal
135,000 New York
250,000 New York
350,000 Orlando
Simple predicates:
MINTERM PREDICATE
Given a set of simple predicates for relation R.
P = {p1, p2, , pm}
The set of minterm predicates
M = {m1, m2, , mn}
is defined as
M = {mi | mi =
where
p *j
p j P
TITLE
Elect. Eng.
SAL
40,000
Syst. Analy.
54,000
Mech. Eng.
32,000
Programmer
42,000
p *j p j or p *j p j
Some corresponding
minterm predicates:
m1 : TITLE " Elect .Eng." SAL 30,000
m 2 : TITLE " Elect .Eng" SAL 30,000
L2
L3
Owner(L3) = J
Horizontal Fragments
Thus, a horizontal fragment Ri of relation R
consists of all the tuples of R that satisfy a
minterm predicate mi.
There are as many horizontal fragments
(also called minterm fragments) as there are
minterm predicates.
Completeness (1)
A set of simple predicate Pr is said to be complete if and only
if there is an equal probability of access by every application
to any two tuples belonging to any minterm fragment that is
defined according to Pr.
Simple Predicates Minterm Fragments
A1 k1
A2 = k2
A3 k3
F1
F2
p1
p1
p3
Applications
A1
p3
A2
A3
F3
A4 =
k4 The fragments look homogeneous
Complete
A4
Completeness (2)
Simple Predicates Minterm Fragments
A1 k1
A2 = k2
p1
F1
F2
p1
p3
F3
= k4
A1
p3
A2
A3
A3 k3
A4
Applications
p4
p5
A4
Set of simple
predicates is
incomplete
Completeness (2)
Simple Predicates Minterm Fragments
A1 k1
A2 = k2
A3 k3
A4
= k4
A5 >
Additional
simple
k5predicate
p1
F1
F2
Applications
p1
p3
A1
p3
A2
A3
F31
F3
p4
p5
A4
F32
Now complete !
Completeness (4)
A set of simple predicate Pr is said to be complete if and only
if there is an equal probability of access by every application
to any two tuples belonging to any minterm fragment that is
defined according to Pr.
LOC=Montreal
J1
LOC=New York
J2
LOC=Orlando
J3
LOC=Montreal,
Pr = LOC=New York,
LOC=Orlando
is complete because each tuple of each
fragment has the same probability of
being accessed.
Completeness (5)
Example:
J1
J2
LOC=Montreal,
Pr = LOC=New York,
LOC=Orlando
J3
JNO
001
JNAME
Instrumental
JNO JNAME
004
GUI
007 CAD/CAM
BUDGET
150,000
LOC
Montreal
BUDGET
LOC
135,000 New York
250,000 New York
JNO JNAME
003 Database Dev.
BUDGET LOC
310,000 Orlando
Completeness (6)
BUDGET<=200,000
LOC=Montreal
J1
LOC=New York
J2
LOC=Orlando
J3
J11
J12
BUDGET>200,000
BUDGET<=200,000
J21
Small-budget applications
J22
BUDGET>200,000
BUDGET<=200,000
J31
J32
BUDGET>200,000
Note: Completeness is a
desirable property because a
complete set defines
fragments that are not only
logically uniform in that they
all satisfy the minterm
predicate, but statistically
homogeneous.
Redundant Fragmentation
Logically
uniform &
statistically
homogeneous
fragment
Fragment 1
Fragment 2
Minimality
Relevant:
Let mi and mj be two almost identical minterm predicates:
mi =
p 1 p 2 p3
fragment fi
mj =
p 1 p2 p 3
fragment fj
acc(m j )
acc(mi )
card ( f i ) card ( f j )
p1
f1
p3
f12
p2
fi
p2
fj
Access frequency
Cardinality
Prob1
Prob2
A
Prob1 Prob2
Minimality
Relevant:
Let mi and mj be two almost identical minterm predicates:
mi =
p 1 p 2 p3
mj =
p 1 p2 p 3
fragment fi
fragment fj
acc(m j )
acc(mi )
card ( f i ) card ( f j )
Access frequency
Cardinality
That is, there should be at least one application that accesses fi and fj
differently.
i.e., The simple predicate pi should be relevant in determining a
fragmentation.
Minimal:
minimal.
BUDGET<=200,000
LOC=Montreal
J1
LOC=New York
J2
LOC=Orlando
J3
JNAME = Instrument
J11
J121
J12
J122
BUDGET>200,000
JNAME! Instrument
BUDGET<=200,000
J21
[ JNAME = Instrument ]
is not relevant.
J22
BUDGET>200,000
BUDGET<=200,000
J31
J32
BUDGET>200,000
Relevant
Irrelevant
Application Information
Qualification Information
The fundamental qualification information
consists of the predicates used in user
queries (i.e., where clauses in SQL).
80/20 rule: 20% of user queries account
for 80% of the total data access.
Qualitative
information
guides the
fragmentation
activity
Quantitative Information
Minterm Selectivity sel(mi): number of
tuples that would be accessed by a query
specified according to a given minterm
predicate.
Access Freequency acc(qi): the access
frequency of queries in a given period.
Quantitative
information
guides the
allocation
activity
Implications:
i1 : ( SAL 30,000) ( SAL 30,000)
i 2 : ( SAL 30,000) ( SAL 30,000)
i 3 : ( SAL 30,000) ( SAL 30,000)
i 4 : ( SAL 30,000) ( SAL 30,000)
i1 m1 is contradictory
i 2 m 4 is contradictory
Therefore, we are left with
M = {m2, m3}
Invalid Implications
J
JNO
JNAME
J1 Instrumental
J2 Database Dev.
J3 CAD/CAM
J4 Maintenance
Simple predicates
p1: LOC = Montreal
p2: LOC = New York
p3: LOC = Orlando
p4: BUDGET 200,000
p5: BUDGET > 200,000
BUDGET
LOC
150,000
Montreal
135,000
New York
250,000
New York
350,000
Orlando
VALID Implications
i 1 : p 1 p 2 p 3
i 2 : p 2 p 1 p 3
i 3 : p 3 p 1 p 2
i 4 : p 4 p 5
i 5 : p 5 p 4
i 6 : p 4 p 5
i 7 : p 5 p 4
INVALID Implications
i 8 : LOC " Montreal" ( BUDGET 200,000)
i 9 : LOC " Orlando " ( BUDGET 200,000)
Implications should be
defined according to the
semantics of the database,
not according to the
current values.
Primary Fragmentation:
PAY 1 (TITLE "Assistant Professor")( PAY )
PAY 2 (TITLE " Associate Professor") ( PAY )
PAY1
EMP2
PAY2
EMP3
PAY3
Not using derived fragmentation: one can divide EMP into EMP1
and EMP2 based on TITLE and divide PAY into PAY1, PAY2, PAY3
based on SAL. To join EMP and PAY, we have the following
scenarios.
PAY1
EMP1
EMP2
PAY2
EMP3
PAY3
More communication
overhead !
Chain Relationships
R1 (R1PK, )
R2 (R2PK, R1FK, )
R3 (R3PK, R2FK, )
...
Derived Fragmentation
EMP (ENO, ENAME, TITLE)
Join might
be required
VERTICAL FRAGMENTATION
Purpose: Identify fragments Ri such that
many applications can be executed using
just one fragment.
Advantage: When many applications which
use R1 and many applications which use R2
are issued at different sites, fragmenting
R avoids communication overhead.
A1
A7
R2
R1
Site 1
Site 2
Correctness:
Each attribute of R belongs to at least one
fragment.
Each fragment includes either a key of R or a
tuple identifier.
EMP(ENUM,NAME,SAL,TAX,MGRNUM,DNUM)
Administrative Applications
at Site 1
Applications
at all sites
NAME is
relatively
stable
Split Approach
A1
A2
A3
use(qi,Aj) =
A4
1 if Aj is referenced by qi
0 otherwise
A 1 A2 A 3 A4
q1
q3 0 1 0 1
q2
1 0 1 0
0 1 1 0
q4 0 0 1 1
Attribute Usage Matrix
aff ( Ai, Aj )
ref (q ) acc (q )
s
k , use ( qk , Ai ) 1 use ( qk , A j ) 1 s
For each query qk that uses both Ai and Aj
Popularity
of using
Ai and Aj
together
Relation R
qk
Ak
Site n
qi
qi
ref s (qk )
Site m
Ai
Aj
qi
Site s
qk
qi
accs (qk )
k , use ( qk , Ai ) s use ( qk , A j ) s
ref (q ) acc (q )
s
A1
A4
A 1 A 2 A 3 A4
A2
A3
Attribute Affinity Matrix
q1
A1 A 2
A3
A4
1 0 1 0
0 1 1 0
A1 45 0 45 0
A2 0 80 5 75
q3 0 1 0 1
q4 0 0 1 1
A3 45 5 53 3
A4 0 75 3 78
q2
A 1 A2
A3
A4
A1
A2
A1 45 0 45 0
A2 0 80 5 75
A1
A3 45 5 53 3
A4 0 75 3 78
A3 45 5
A4 0 75
A2
A3 A 4
45 0
0 80
A0
A 1 A2
A3
A 1 A2 A 3
A0 A3 A 1
A4
A5
A1 45 0 45 0
A2 0 80 5 75
A3 45 5 53 3
A4 0 75 3 78
Attribute Affinity Matrix (AA)
A1 A3 A2
A1
A2
A1
45
A2
A3
0 80
45 5
A4
A0
A3 A 4
A5
75
Contribution
Cont(A1,A3,A2) = 10150
Cont(A2,A3,A4) = 1780
A 1 A2
A3
A4
A1
A3
A2 A 4
A1 45 0 45 0
A2 0 80 5 75
A1 45 45 0
A2 0 5 80
A3 45 5 53 3
A4 0 75 3 78
A3 45 53 5
A4 0 3 75
A 1 A2
A3
A4
A1
A3
A2
A4
A1 45 0 45 0
A2 0 80 5 75
A1 45 45 0 0
A2 0 5 80 75
A3 45 5 53 3
A4 0 75 3 78
A3 45 53 5 3
A4 0 3 75 78
A3
A2
A4
A1
A3
A2
A4
0
3
A1 45 45 0 0
A2 0 5 80 75
A1 45 45
A3 45 53
0
5
A3 45 53 5 3
A4 0 3 75 78
A2 0
A4 0
80 75
75 78
5
3
Partitioning
Find the sets of attributes
that are accessed, for the
most part, by distinct sets
of applications
We look for a good dividing
points along the diagnose
A1
A3
A2
A4
A1 45 45
A3 45 53
0
5
0
3
A2
80 75
75 78
A4
0
0
5
3
Cluster 1:
Cluster 2:
A 1 & A3
A2 & A4
A4 and A3
are
usually
not
accessed
together
A4 and A2
are often
accessed
together
MIXED FRAGMENTATION
Apply horizontal fragmentation to vertical fragments.
Apply vertical fragmentation to horizontal fragments.
Example: Applications about work at each department reference tuples
of employees in the departments located around the site with 80%
probability.
EMP(ENUM,NAME,SAL,TAX,MGRNUM,DNUM)
ENUM NAME TAX SAL
ED
T
LA K
E
R OR
T
W
NO
ENUM
TO
Vertical fragmentation
NAME
MGRNUM
Jacksonville
Orlando
Miami
R
O
W
DNUM
TE
A
L
RE
Horizontal
Fragmentation
(local work)
i:
fragment index
j:
site index
k:
application index
fkj:
the frequency of
application k at site j
rki:
uki:
ALLOCATION
Notations
Site j
Fragment i
rki
uki
Application k
/w freq. fkj
Bij f kj nki
k
All applications k
at Site j
Number of
Access by k
Frequency of
application k
Fragment i
Site j
Savings due to
retrieval
references
j ' j k
Cost of update
references from
other sites
Fi
(di)
(d i ) (1 21 d i ) F i
j ' j
Fi
di
A1
A3
Ri
Rs
Application type A1
at site PSr , that
accesses only Rs
B ist
At
A4
PSs
PSt
PS4
Applications
of type As
at PSs
As
As
f ks n ks
A2
At
f kt n kt
k
f kt n kt
A2
Rt
A3
PSr
...
2 f
A1
ki
An
As
PSn
PSs
A1
4 l n k
Rs
Al
f kl n ki
A2
Rt
A4
f ks n ks
n ki
A3
..
.
An
At
PS4
PSn
PSt
SUMMARY
Design of a distributed DB consists of four phases:
Vertical Fragmentation
Phase 3: Allocation
Database Integration
Bottom-up Design
Overview
The design process in
multidatabase systems is
bottomup.
The individual databases
actually exists
Designing the global
conceptual schema (GCS)
involves integrating these
local databases into a
multidatabase.
Database 1
Database 2
Database 3
Translator 1
Translator 2
Translator 3
InS1
InS2
Intermediate
schema in
canonical
representation
INTEGRATOR
GCS
InS3
EMPLOYEE
(member records)
Vu, K.
Work
Worksin
Dummy
Record Type
Engineer
Name
ENGINEER
Title
Project
No.
Responsibility
WORKS
IN
Project
Name
Budget
PROJECT
N
Salary
Location
CONTRACTED
BY
Duration
Contract
Date
CLIENT
Client
Name
Address
ENAME
N
E
N
ENAME
E
TITLE
SAL
JNO
JNAME
BUDGET
LOC
CNAME
S
TITLE
ENO
G
DUR
PAY
E & J have a many-tomany relationship
E & S have a 1-to-many
relationship
RESP
RESP
G
DUR
JNO
JNAME
SAL
Treat salary as
an attribute of
an engineer
entity
BUDGET
CNAME
LOC
WORK
EMPLOYEE
N
Works-in
Employs
WORK
EMPLOYS
Dummy
record type
DEPARTMENT
DEPARTMENT
N
EMPLOYS
M
WORKS-IN
1
EMPLOYEE
EMPLOYEE
After Translation
Engineer
No.
Database 3
Engineer
Name
N
ENGINEER
Title
Project
No.
Responsibility
Budget
WORKS
IN
PROJECT
N
Salary
ENO
ENAME
Contract
Date
E
TITLE
RESP
SAL
G
DUR
JNO
Location
CONTRACTED
BY
Duration
Database 1
Project
Name
CLIENT
JNAME
Client
Name
BUDGET
Address
LOC
CNAME
Database 2
DEPARTMENT
EMPLOYS
EMPLOYEE
Schema Integration
Schema integration follows the translation
process and generates the GCS by
integrating the intermediate schemas.
Identify the components of a database which
are related to one another.
Two components can be related as (1) equivalent, (2)
one contained in the other one, (3) overlapped, or (4)
disjoint.
Integration Methodologies
Integration
Process
Binary
Ladder
Balanced
N-ary
One-shot
Iterative
Integration Process
Schema integration occurs in a sequence of four
steps:
1) Preintegration: establish the rules of the integration
process before actual integration occurs.
2) Comparison: naming and structural conflicts are identified.
3) Conformation: resolve naming and structural conflicts
4) Merging and restructuring: all schemas must be merged
into a single database schema and then restructured to
create the best integrated schema.
Step 1: Preintegration
1. An integration method (binary or n-ary) must be
selected and the schema integration order defined.
The order implicitly defines priorities.
Preintegration
Example
InS1
InS2
InS3
Integration method
InS2: E# in EMPLOYEE
Dept-name in DEPARTMENT
InS3: Eno in E
Jno in J
InS3
E
Engineering No
Eno
Engineer Name
Ename
Salary
Sal
WORKSIN
Responsibility
Resp
Duration
Dur
PROJECT
Project No
Jno
Project Name
Jname
Location
Loc
InS1
In InS2, EMPLOYEE.Title
refers to the title of all
employees.
domain (EMPLOYEE.Title) >> domain (ENIGNEREER.Title)
Name
EMPLOYEE
Title
Address
IS-A relationship
EMPLOYS
Salary
DEPARTMENT
EMPLOYS
Dur
JNO
M
Jname
CONTRACTED
BY
Budget
J
Cname
InS3
Loc
Contract
Date
CLIENT
Client
Name
Address
PROJECT
InS1
Dependency conflicts:
occur when different
relationship modes are
used to represent the
same thing in different
schemas.
Eno
Engineer
No.
Title
ENGINEER
Title
ENGINEER
Sal
WORKS
IN
EMPLOYS
Dur
PROJECT
InS1
Salary
This is
many-to-many
Resp
Ename
Project
No.
Engineer
Name
InS3
Step 3: Conformation
Naming Conflicts
Naming conflicts are resolved simply by renaming
conflict ones.
Synonyms: rename the schema of InS3
to conform to the naming of InS1.
InS3
E
InS1
ENGINEER
Eno
Engineering No
Engineering No
Engineer Name
Sal
Salary
Salary
WORKSIN
Resp
Responsibility
Responsibility
Dur
Duration
Duration
PROJECT
Jno
Project No
Project No
Project Name
Loc
Location
Location
Homonyms:
Prefix each attribute
by the name of the
entity to which it
belong,
e.g., ENGINEER.Title
EMPLOYEE.Title
InS3
ENGINEER
Title
Salary
Engineer
No.
WORKS
IN
PROJECT
Salary
Location
Client
Name
Project
No.
Responsibility
Engineer
Name
Project
Name
Budget
Duration
ENGINEER
Title
Project
No.
Responsibility
Engineer
Name
WORKS
IN
Budget
PROJECT
N
C-P
Duration
Example:
Transform the attribute Client name in
InS3 to an entity C to make InS3
conform to the presentation of InS1.
Project
Name
M
Client
Name
Location
New
InS3
InS2
(Employees)
InS3
InS1
(Engineers) (Engineers)
InS1
Engineer
Name
ENGINEER
Title
Project
No.
Responsibility
WORKS
IN
Salary
ENGINEER
Title
Responsibility
Engineer
Name
Salary
WORKS
IN
PROJECT
N
CONTRACTED
BY
Duration
M
Client
Name
CLIENT
Client
Name
Project
Name
Address
Budget
Location
InS3
Budget
Location
CONTRACTED
BY
Contract
Date
Engineer
No.
PROJECT
N
Duration
Project
No.
Project
Name
InS3 is
more
general
Duration
Responsibility
WORKS
IN
Project
No.
Project
Name
Budget
PROJECT
Location
E#
Name
EMPLOYEE
N
EMPLOYS
InS2
Title
Address
Manager
CLIENT
Client
name
InS1/InS3
Address
SAL
DEPARTMENT
Budget
CONTRACTED
BY
Dept-name
Query Processing in
Multidatabase Systems
Schema Integration
Local Schema 1
Local Schema 2
Local Schema 3
Translator 1
Translator 2
Translator 3
InS1
InS2
Q1,1
Q1,2
InS3
Q1,3
INTEGRATOR
Q1
GCS
Schema Integration
Local Schema 1
Q1,1
Local Schema 2
Q1,3
Q1,2
Translator 1
Translator 2
InS1
InS2
Q1,1
Local Schema 3
Q1,2
Translator 3
InS3
Q1,3
INTEGRATOR
Q1
GCS
Final
answer
Combine
Schema Integration
Local Schema 1
Q1,1
Local Schema 2
Q1,3
Q1,2
Translator 1
Translator 2
InS1
InS2
Q1,1
Local Schema 3
Q1,2
Translator 3
InS3
Q1,3
INTEGRATOR
Q1
GCS
Global query is
decomposed into local
queries
Schema Integration
Local Schema 1
Local Schema 2
Local Schema 3
Translator 1
Translator 2
Translator 3
InS1
InS2
INTEGRATOR
GCS
InS3
Outline
Overview of major query processing
components in multidatabase systems:
Query Decomposition
Query Translation
Global Query Optimization
Query Decomposition
Query Decomposition
Overview
Global Query
SQ1
Query
translator 1
TQ1
DB1
SQ2
...
Query
translator 2
TQ2
...
DB2
SQn
Query
translator n
TQn
DBn
PQ1
PQn
Assumptions
We use the object-oriented data model to
present a query decomposition algorithm
To simplify the discussion, we assume that
there are only two export schemas:
ES1
Emp1: SSN
Name
Salary
Age
ES2
Emp2: SSN
Name
Salary
Rank
Definitions
type: Given a class C, the type
World
Type
Extension
A Class
Schema Integration
Integration through outerjoin
Integration through outerunion
(generalization)
Review: Outerjoin
The outerjoin of relation R1 and R2
(R1 o R2 ) is the union of three
components:
the join of R1 and R2,
dangling tuples of R1 padded with null
values, and
dangling tuples of R2 padded with null
values.
Outerjoin Example
EmpO = Emp1 o Emp2
Emp1
OID
SSN
Name
Salary
Age
OID
SSN
Name
Salary
Age
Rank
6789
Smith
90,000
40
2222
Ahad
98,000
null
S. Mgr.
4321
Chang
62,000
30
8642
Patel
75,000
35
7531
Wang
95,000
mull
S. Mgr.
6789
Smith
Inconsistent
40
Mgr.
4321
Chang
62,000
30
null
8642
Patel
75,000
35
null
Emp2
OID
SSN
Name
Salary
Rank
2222
Ahad
98,000
S. Mgr.
7531
Wang
95,000
S. Mgr.
6789
Smith
25,000
Mgr.
Dangling Tuple
Dangling Tuple
Outerunion
Emp1
OID
SSN
Name
Salary
Age
6789
Smith
90,000
40
4321
Chang
62,000
30
8642
Patel
75,000
35
Emp2
OID
SSN
Name
Salary
Age
Rank
2222
Ahad
98,000
null
S. Mgr.
7531
Wang
95,000
mull
S. Mgr.
6789
Smith
Conflict
null
Mgr.
6789
Smith
Conflict
40
null
OID
SSN
Name
Salary
Rank
4321
Chang
62,000
30
null
2222
Ahad
98,000
S. Mgr.
8642
Patel
75,000
35
null
7531
Wang
95,000
S. Mgr.
6789
Smith
25,000
Mgr.
Outer
union
Generalization Example
Emp1:
SSN
Name
Salary
Age
Emp2: SSN
Name
Salary
Rank
EmpG: SSN
Name
Salary
Generalization
More
specific
Rank Emp2
Inconsistency Resolution
The schema
integration techniques
work as long as there
is no data
inconsistency
If data inconsistency
occurs, aggregate
functions may be used
to resolve the
problem.
Integrated Schema
EmpG: SSN
EmpO:
Generalization
Name
or
Salary
Outer
join
Rank
(Emp1)
World (Emp2)
Query Decomposition
Assume
Outerjoin is
used for
schema
integration
world(Emp1)
2
world(Emp2)
= Sum(Emp1.Salary,Emp2.Salary), if
EmpO is in world(Emp1) world(Emp2)
Query Decomposition
world(Emp1)
2
world(Emp2)
Query Decomposition
2
world(Emp2)
2
world(Emp2)
We use Option 1 since it is the finest partition among all the partitions.
Query Decomposition
Another Example
Option 1:
Option 2:
world(Emp1)
world(Emp1)
world(Emp2)
world(Emp2)
2
world(Emp2)
Query Decomposition
world(Emp1)
2
world(Emp2)
Query Decomposition:
part. 3: Select Emp1.Name,
Obtain a query for each
Emp2.Rank
subset in the chosen
From Emp1, Emp2
partition.
EmpO.Age
= Emp1.Age, if EmpO is in world(Emp1)
= Null, if EmpO is in world(Emp2) world(Emp1)
Where Sum(Emp1.Salary,
>
Query Decomposition
partition.
Emp1.Salary
3
Emp1.Age Emp1.Salary +
Emp2.Salary
Emp2.Salary
Emp1.Age
Age = null
world(Emp1)
world(Emp2)
Query Decomposition
Select Emp1.Name
From Emp1
Where Emp1.Salary > 80,000
and
Emp1. Age > 35 and
Emp1.SSN NOT IN X
X
Insert INTO X
Select Emp2.SSN
From Emp2)
Query Decomposition
Step 4: Query Optimization
Query Translation
THEN
Translator
Local
Query
Language
Relation-to-OO Translation
OODB Schema:
Auto
OID
Color
Manufacturer
Company
OID
Name
Profit
Headquarter
President
People
OID
Name
Hometown
Automobile
Age
Foreign
key
Equivalent Relational Schema:
City
OID
Name
State
Relational-to-OO
Example
Global Query:
1
2
3
4
5
6
Select Auto1.*
Relational Predicate Graph:
From
Auto Auto1, Auto Auto2,
Company, People,
1) Company-OID
Auto1
Company
City City1, City City2
D
Where Auto1.Company-OID =
OI
2) People-OID
ity
C
Company.Company-OID AND
6)
n)
i
o
Company.People-OID =
(J
People
City2
People.People-OID AND
Age=52
D
People.Age = 52 AND
OI
3) Auto-OID
y
ame
it
People.Auto-OID =
C
5) N
4)
Auto2.Auto-OID AND
Auto2
Auto2.Color = red AND
City1
Color=red
People.City-OID =
City1.City-OID AND
City1.Name = City2.Name AND
Find all red cars own by a 52 year old
Company.City-OID =
who is the President 1+2+3of the car
City2.City-OID
Company-OID
City1
Relational Predicate Graph:
Auto1
1) Company-OID
6)
City2
ame
5) N
City1
O
yt
i
C
ID
(Jo
in)
Company
2) People-OID
People
Age=52
C
4)
I
-O
y
t
i
3) Auto-OID
Auto2
Color=red
Name
City2
Company
ID
r)
y-O arte
t
i
C
qu
ad
e
(H
D
OI n)
y
Cit etow
om
(H
People-OID
People
Age=52
Auto-OID
Auto2
Color=red
Company-OID
City1
Predicate 3
Name
City2
Company
ID
r)
y-O arte
t
i
C
qu
ad
e
(H
D
OI n)
y
Cit etow
om
(H
People-OID
People
Predicate 1
Age=52
Auto-OID
Auto2
Color=red
Predicate 2
OO Query:
1
Where Auto.Manufacturer.President.Age = 52 AND
Auto.Manufacturer.President.Automobile.Color = red
2
AND
Auto.Manufacturer.Headquarter.Name =
3
Auto.Manufacturer.President.Hometown.Name
Site 1
Emp1
Age > 35
Emp1
Site 2
Site 1
Emp2
Emp1
form
result
form
result
1+2
form
result
Emp2
Emp2
Site 3
Site 2
Site 2
Data Inconsistency
If C is integrated from C1 and C2 with no
data inconsistency on attribute A, then
A op a
(C) =
A op a (C1) A op a (C2)
EmpO
OID
SSN
Name
Salary
Age
Rank
2222
Ahad
98,000
null
S. Mgr.
7531
Wang
95,000
mull
S. Mgr.
6789
Smith
Inconsistent
40
Mgr.
4321
Chang
62,000
30
null
8642
Patel
75,000
35
null
Expensive operation
Attribute A is defined by
an aggregate function to
resolve inconsistency,
e.g., sum(Emp1.Salary, Emp2.Salary)
Example:
max(Emp1-C.Salary,
An aggregate
function
2. f(A1,A2) op a f(A1 op a, A2 op a) op a:
A op a(C1-C C2-C) = A op a(A1 op a(C1-C) A2 op a(C2-C))
Example:
sum(A1, A2)
Not
> < = in
in
4 4 2 2 3 4 4 4
avg(A1, A2)
4 4 2 2 3 4 4 4
max(A1, A2)
4 4 1 1 3 4 4 4
min(A1, A2)
1 1 4 4 3 4 4 4
No
improvement
possible
Techniques:
Examples:
A local database system may not allow update operations
For many nonrelational systems, the instances of one
entity set are more likely to be clustered with the
instances of other entity sets. Such clustering makes it
very expensive to extract data for one entity set.
Need more sophisticated decomposition algorithms.