0% found this document useful (0 votes)
97 views

Distributed DB

A distributed database system is a collection of databases distributed over different computers in a network. Each database has local processing capabilities and participates in global applications requiring data from multiple sites. Distributed databases provide benefits like improved reliability, availability, performance and allow for incremental growth but also introduce challenges around complexity, security and cost of replication. Their design involves decisions around data fragmentation, allocation, and schema integration to support both local and global access needs.
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
97 views

Distributed DB

A distributed database system is a collection of databases distributed over different computers in a network. Each database has local processing capabilities and participates in global applications requiring data from multiple sites. Distributed databases provide benefits like improved reliability, availability, performance and allow for incremental growth but also introduce challenges around complexity, security and cost of replication. Their design involves decisions around data fragmentation, allocation, and schema integration to support both local and global access needs.
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 146

Distributed Database Systems

COP5711

What is a Distributed Database System ?


A distributed database is a collection of databases which are distributed
over different computers of a computer network.
Each site has autonomous processing capability and can perform local
applications.
Each site also participates in the execution of at least one global
application which requires accessing data at several sites.

Database 1
Database 3

Server 1

Communication Network
Server 3

Database 2

Server 2

Multiprocessor Database Computers


Cannot run an
application
by itself

Application
(front-end)
computer

Interface
Processor

Access
Processor
Access
Processor
Access
Processor

What we miss here is the existence of local


applications, in the sense that the integration of the
system has reached the point where no one of the
computers (i.e., IFPs & ACPs) is capable of executing
an application by itself.

Why Distributed Databases ?


1.

Local Autonomy: permits setting and enforcing local policies regarding


the use of local data (suitable for organization that are inherently
decentralized).

2. Improved Performance: The regularly used data is proximate to the


users and given the parallelism inherent in distributed systems.

3. Improved Reliability/Availability:

Data replication can be used to obtain higher reliability and


availability.
The autonomous processing capability of the different sites
ensures a graceful degradation property.

4. Incremental Growth: supports a smooth incremental growth with a


minimum degree of impact on the already existing sites.

5. Shareability: allows preexisting sites to share data.


6. Reduced Communication Overhead: The fact that many applications
are local clearly reduces the communication overhead with respect to
centralized databases.

Disadvantages of DDBSs
Cost: replication of effort (manpower).
Security: More difficult to control
Complexity:

The possible duplication is mainly due to reliability and


efficiency considerations. Data redundancy, however,
complicates update operations.

If some sites fail while an update is being executed, the


system must make sure that the effects will be
reflected on the data residing at the failing sites as
soon as the system can recover from the failure.

The synchronization of transactions on multiple sites is


considerably harder than for a centralized system.

Distributed DBMS
Architecture

NetworkTransparancy
The user should be protected from the
operational details of the network.
It is desirable to hide even the existence
of the network, if possible.
Location transparency: The command used is
independent of the system on which the data is
stored.
Naming transparency: a unique name is
provided for each object in the database.

Replication & Fragmentation


Transparancy
The user is unaware of the replication of
framents
Queries are specified on the relations
(rather than the fragments).
Copy 1 of R1

Site A

Copy 1 of R2

Relation R

Fragment R1
Fragment R2

Copy 2 of R1

Site B

Fragment R3

Site C

Fragment R4
Copy 2 of R2

ANSI/SPARC Architecture
External Schema

External
view

External
view

Conceptual Schema

Conceptual
view

Internal Schema

Internal
view

External
view

Internal view: deals with the physical definition and organization of data.
Conceptual view: abstract definition of the database. It is the real
world view of the enterprise being modeled in the database.

External view: individual users view of the database.

A Taxonomy of Distributed Data Systems


A distributed database
can be defined as
a logically integrated
collection of shared
data which is
physically distributed
across the nodes of a
computer network.

Distributed data systems

Homogeneous

Federated

Loosely coupled
(interoperable DB
systems using
export schema)

Heterogeneous
(Multidatabase)

Unfederated
(no local users)

Tightly coupled
(/w global schema)

Architecture of a Homogeneous DDBMS


Global user
view 1

Global user
view n

A homogeneous
DDBMS resembles a

Global Schema

centralized DB, but

Fragmentation
Schema

instead of storing all

Allocation
Schema

the data at one site,


the data is distributed

Local
conceptual
schema 1

Local
conceptual
schema n

Local
internal
schema 1

Local
internal
schema n

Local DB 1

Local DB n

across a number of
sites in a network.

Fragmentation Schema & Allocation Schema


Fragmentation Schema: describes how the global
relations are divided into fragments.
Allocation Schema: specifies at which sites each
fragment is stored.
Example: Fragmentation of global relation R.
A

D
E

To materialize R, the following


operations are required:
R = (A B) U ( C D) U E

Homogeneous vs. Heterogeneous

Local
user

DBMS

Database 1

Homogeneous DDBMS
Global
No local users
user
Most systems do not have
local schemas (i.e., every user
Local
uses the same schema)
user
Multidatabase
Heterogeneous DDBMS
Management
There are both local and
system
global users
Multidatabase systems are
split into:
Tightly Coupled Systems:
DBMS DBMS DBMS
have a global schema
Loosely Coupled Systems:
do not have a global
Database 2
Database 3
Database 4
schema.

Schema Architecture of a TightlyCoupled System


Global user
view 1

Global user
view n

Global Conceptual Schema


Auxiliary
Schema 1

Local user
view 1

Local user
view 2

An individual nodes
participation in the MDB
is defined by means of a
participation schema.

Local
Participation
Schema 1

Local
Participation
Schema 1

Auxiliary
Schema 1

Local
Conceptual
Schema 1

Local
Conceptual
Schema 1

Local user
view 1

Local
Internal
Schema 1

Local
Internal
Schema 1

Local user
view 2

Local DB 1

Local DB 1

Auxiliary Schema (1)


Auxiliary schema describes the rules which
govern the mappings between the local and
global levels.
Rules for unit conversion: may be required when
one site expresses distance in kilometers and
another in miles,
Rules for handling null values: may be necessary
where one site stores additional information which
is not stored at another site.
Example: One site stores the name, home address and
telephone number of its employees, whereas another just
stores names and addresses.

Auxiliary Schema (2)


Rules for naming conflicts: naming conflicts occur when:
semantically identical data items are named differently
DNAME Department name (at Site 1)
DEPTNAME Department name (at Site 2)

semantically different data items are named identically.


NAME Department name (at Site 1)
NAME Manager name (at Site 2)

Rules for handling data representation conflicts: Such


conflicts occur when semantically identical data items
are represented differently in different data source.
Example: Data represented as a character string in one
database may be represented as a real number in the other
database.

Auxiliary Schema (3)


Rules for handling data scaling conflicts: Such
conflicts occur when semantically identical
data items stored in different databases using
different units of measure.
Example: Large, New, Good, etc.

These problems are called


domain mismatch problems

Loosely-Coupled Systems

(Interoperable Database Systems)

Local
user view 1
Local
user view 2

Global
user view 1

Global
user view 2

Global
user view 3

Local
Conceptual
schema 1

Local
Conceptual
Schema 2

Local
Conceptual
Schema n

Local
internal
schema 1

Local
internal
Schema 2

Local
internal
Schema n

Local DB 1

Local DB 2

Local DB n

Loosely-Coupled Systems
Global
user view 1

Export
schema 1
Local
user view 1
Local
user view 2

Global
user view 2

Export
schema 2

Export
Schema 3

Global
user view m

Export
Schema n

Local
Conceptual
schema 1

Local
Conceptual
Schema 2

Local
Conceptual
Schema n

Local
internal
schema 1

Local
internal
Schema 2

Local
internal
Schema n

Local DB 1

Local DB 2

Local DB n

Integration of Heterogeneous Data Models


Provide bidirectional translators between all
pairs of models

Advantage: support multiple models at the global level.


No need to learn another data model and language
Disadvantage: requires n(n-1) translators, where n is
the number of different models.

Adopt a single model (called canonical model) at


the global level and map all the local models onto
this model

Advantage: requires only 2n translators


Disadvantage: translations must go through the global
model.
(The 2nd approach is more widely used)

Distributed Database Design


Top-Down Approach: The database system is
being designed from scratch.
Issues: fragmentation & allocation
Bottom-up Approach: Integrating existing
databases into one database
Issues: Design of the export and global
schemas.

TOP-DOWN DESIGN PROCESS


Requirements Analysis
Entity analysis +
functional
analysis

System Requirements
(Objectives)

Conceptual
design
Global
conceptual
schema

Defining the
interfaces for
end users

View integration

View Design

Access
information

External Schema
Definitions

Distribution Design
Local Conceptual Schemas
Maps the local
conceptual
schemas to
physical storage
devices

Physical Design
Physical Schema

Fragmentation
& allocation

Design Consideration (1)


The organization of distributed systems can be
investigated along three dimensions:
Level of sharing
1. No sharing: Each application and its data
execute at one site.
2. Data sharing: Programs are replicated at all
sites, but data files are not.
3. Data + Program Sharing: Both data and
programs may be shared.

Design Consideration (2)


Access Pattern
1. Static: Access patterns do not change.
2. Dynamic: Access patterns change over
time.
Level of Knowledge
3. No information
4. Partial information: Access patterns may
deviate from the predictions.
5. Complete information: Access patterns
can reasonably be predicted.

Fragmentation Alternatives
J

JNO
J1
J2
J3
J4

JNAME

BUDGET

Instrumental
Database Dev.
CAD/CAM
Maintenance

150,000
135,000
250,000
350,000

Montreal
New York
New York
Paris

Horizontal Partitioning
J1

JNO
J1
J2

J2

JNO

JNAME
Instrumental
Database Dev.
JNAME

J3 CAD/CAM
J4 Maintenance.

LOC

Vertical Partitioning

BUDGET

LOC

150,000
135,000

Montreal
New York

BUDGET
150,000Montreal
310,000 Paris

LOC

JNO
J1
J2
J3
J4

BUDGET
150,000
135,000
250,000
310,000

JNO JNAME
J1
J2
J3
J4

LOC

Instrumentation
Database Devl
CAD/CAM New
Maintenance

Montreal
New York
York
Paris

Why fragment at all?


Reasons:
Interquery concurrency
Intraquery concurrency

Disadvantages:
Vertical fragmentation may incur overhead.
Attributes participating in a dependency may be
allocated to different sites.

Integrity checking is more costly.

Degree of Fragmentation
Application views are usually subsets of
relations. Hence, it is only natural to
consider subsets of relations as
distribution units.
The appropriate degree of fragmentation
is dependent on the applications.

Correctness Rules
Vertical Partitioning
Lossless
decomposition
Dependency
preservation
Horizontal Partitioning
Disjoint fragments

Allocation Alternatives
Partitioning: No replication
Partial Replication: Some
fragments are replicated
Full Replication: Database
exists in its entirety at
each site

Notations
S

Title SAL

L1

ENO ENAME TITLE

JNO JNAME BUDGET LOC

L2

L3

ENO JNO RESP DUR

L1: 1-to-many relationship


S: Owner(L1), Source relation
E: Member(L1), Target relation

Simple Predicates
Given a relation R(A1, A2, , An) where Ai has domain Di, a
simple predicate pj defined on R has the form
pj: Ai Value
where

{, , , , , } and Value Di

Example:

JNO
JNAME
J1 Instrumental
J2 Database Dev.
J3 CAD/CAM
J4 Maintenance

BUDGET
LOC
150,000 Montreal
135,000 New York
250,000 New York
350,000 Orlando

Simple predicates:

p1: JNAME = Maintenance


P2: BUDGET < 200,000

Note: A simple predicate defines a data

MINTERM PREDICATE
Given a set of simple predicates for relation R.
P = {p1, p2, , pm}
The set of minterm predicates
M = {m1, m2, , mn}
is defined as

M = {mi | mi =
where

p *j

p j P

TITLE
Elect. Eng.

SAL
40,000

Syst. Analy.

54,000

Mech. Eng.

32,000

Programmer

42,000

p *j p j or p *j p j

Possible simple predicates:


P1: TITLE=Elect. Eng.
P2: TITLE=Syst. Analy
P3: TITLE=Mech. Eng.
P4: TITLE=Programmer
P5: SAL 35,000
P6: SAL > 35,000

Some corresponding
minterm predicates:
m1 : TITLE " Elect .Eng." SAL 30,000
m 2 : TITLE " Elect .Eng" SAL 30,000

A minterm predicate defines


a data fragment

Primary Horizontal Fragmentation


A primary horizontal fragmentation is defined by a selection
operation on the owner relations of a database schema.

ENO ENAME TITLE

JNO JNAME BUDGET LOC

L2

ENO JNO RESP DUR

L3

Owner(L3) = J

A possible fragmentation of J is defined as follows:


J1 BUDGET 200, 000 ( J )
J 2 BUDGET 200, 000 ( J )

Horizontal Fragments
Thus, a horizontal fragment Ri of relation R
consists of all the tuples of R that satisfy a
minterm predicate mi.
There are as many horizontal fragments
(also called minterm fragments) as there are
minterm predicates.

Completeness (1)
A set of simple predicate Pr is said to be complete if and only
if there is an equal probability of access by every application
to any two tuples belonging to any minterm fragment that is
defined according to Pr.
Simple Predicates Minterm Fragments
A1 k1

A2 = k2

A3 k3

F1
F2

p1
p1

p3

Applications

A1
p3

A2
A3

F3

A4 =
k4 The fragments look homogeneous
Complete

A4

Completeness (2)
Simple Predicates Minterm Fragments
A1 k1

A2 = k2

p1

F1
F2

p1
p3

F3
= k4

A1
p3

A2
A3

A3 k3

A4

Applications

p4
p5

A4
Set of simple
predicates is
incomplete

Completeness (2)
Simple Predicates Minterm Fragments
A1 k1

A2 = k2

A3 k3

A4

= k4

A5 >

Additional
simple
k5predicate

p1

F1
F2

Applications

p1
p3

A1
p3

A2
A3

F31
F3

p4
p5

A4

F32

Now complete !

Completeness (4)
A set of simple predicate Pr is said to be complete if and only
if there is an equal probability of access by every application
to any two tuples belonging to any minterm fragment that is
defined according to Pr.

J 1 LOC " MONTREAL " ( J )


J 2 LOC " NewYork " ( J )

Case 1: The only application that accesses


J wants to access the tuples according to
the location.

J 3 LOC " Orlando " ( J )

The set of simple predicates

LOC=Montreal

J1

LOC=New York

J2
LOC=Orlando

J3

LOC=Montreal,
Pr = LOC=New York,
LOC=Orlando
is complete because each tuple of each
fragment has the same probability of
being accessed.

Completeness (5)
Example:

J1

J2

LOC=Montreal,
Pr = LOC=New York,
LOC=Orlando

J3

JNO
001

JNAME
Instrumental

JNO JNAME
004
GUI
007 CAD/CAM

BUDGET
150,000

LOC
Montreal

BUDGET
LOC
135,000 New York
250,000 New York

JNO JNAME
003 Database Dev.

BUDGET LOC
310,000 Orlando

Case 2: There is a second application which accesses only those


project tuples where the budget is less than $200,000.
Since tuple 004 is accessed more frequently than tuple
007, Pr is not complete.
To make the the set complete, we need to add
(BUDGET< 200,000) to Pr.

Completeness (6)
BUDGET<=200,000

LOC=Montreal

J1

LOC=New York

J2
LOC=Orlando

J3

J11
J12
BUDGET>200,000
BUDGET<=200,000

J21

Small-budget applications

J22
BUDGET>200,000
BUDGET<=200,000

J31
J32
BUDGET>200,000

Note: Completeness is a
desirable property because a
complete set defines
fragments that are not only
logically uniform in that they
all satisfy the minterm
predicate, but statistically
homogeneous.

Redundant Fragmentation
Logically
uniform &
statistically
homogeneous
fragment

Fragment 1

Fragment 2

Fragments 1 and 2 have the same


characteristics
The fragmentation is unnecessary

Minimality
Relevant:
Let mi and mj be two almost identical minterm predicates:

mi =

p 1 p 2 p3

fragment fi

mj =

p 1 p2 p 3

fragment fj

p2 is relevant if and only if

acc(m j )
acc(mi )

card ( f i ) card ( f j )

p1

f1

p3

f12

p2

fi

p2

fj

Access frequency

Cardinality

Prob1
Prob2

A
Prob1 Prob2

Minimality
Relevant:
Let mi and mj be two almost identical minterm predicates:

mi =

p 1 p 2 p3

mj =

p 1 p2 p 3

fragment fi
fragment fj

p2 is relevant if and only if

acc(m j )
acc(mi )

card ( f i ) card ( f j )

Access frequency

Cardinality

That is, there should be at least one application that accesses fi and fj
differently.
i.e., The simple predicate pi should be relevant in determining a
fragmentation.

Minimal:

minimal.

If all the predicates of a set Pr are relevant, Pr is

A Complete and Minimal Example


Two applications:
1. One application accesses the tuples according
to location.
2. Another application accesses only those project
tuples where the budget is less than $200,000.
Case 1: Pr={Loc=Montreal, Loc=New York, Loc=Orlando,
BUDGET<=200,000,BUDGET>200,000} is

complete and minimal.

Case 2: If, however, we were to add the predicate


JNAME= Instrumentation to Pr, the resulting
set would not be minimal since the new predicate
is not relevant with respect to the applications.

BUDGET<=200,000

LOC=Montreal

J1

LOC=New York

J2
LOC=Orlando

J3

JNAME = Instrument

J11

J121

J12

J122

BUDGET>200,000

JNAME! Instrument

BUDGET<=200,000

J21

[ JNAME = Instrument ]
is not relevant.

J22
BUDGET>200,000
BUDGET<=200,000

J31
J32
BUDGET>200,000

Relevant

Irrelevant

Application Information
Qualification Information
The fundamental qualification information
consists of the predicates used in user
queries (i.e., where clauses in SQL).
80/20 rule: 20% of user queries account
for 80% of the total data access.

One should investigate the more


important queries.

Qualitative
information
guides the
fragmentation
activity

Quantitative Information
Minterm Selectivity sel(mi): number of
tuples that would be accessed by a query
specified according to a given minterm
predicate.
Access Freequency acc(qi): the access
frequency of queries in a given period.

Quantitative
information
guides the
allocation
activity

Determine the set of meaningful minterm predicates


Applications:
Take the salary and determine a raise accordingly.
The employee records are managed in two places, one handling the
records of those with salary less than or equal to $30,000 and the other
handling the records of those who earn more than $30,000.

Pr={p1: SAL<=30,000, p2: SAL>30,000} is complete and minimal.


The minterm predicates:
m1 : ( SAL 30,000) ( SAL 30,000)
m 2 : ( SAL 30,000) ( SAL 30,000)
m3 : ( SAL 30,000) ( SAL 30,000)
m 4 : ( SAL 30,000) ( SAL 30,000)

Implications:
i1 : ( SAL 30,000) ( SAL 30,000)
i 2 : ( SAL 30,000) ( SAL 30,000)
i 3 : ( SAL 30,000) ( SAL 30,000)
i 4 : ( SAL 30,000) ( SAL 30,000)

i1 m1 is contradictory
i 2 m 4 is contradictory
Therefore, we are left with
M = {m2, m3}

Invalid Implications
J

JNO
JNAME
J1 Instrumental
J2 Database Dev.
J3 CAD/CAM
J4 Maintenance

Simple predicates
p1: LOC = Montreal
p2: LOC = New York
p3: LOC = Orlando
p4: BUDGET 200,000
p5: BUDGET > 200,000

BUDGET
LOC
150,000
Montreal
135,000
New York
250,000
New York
350,000
Orlando

VALID Implications

i 1 : p 1 p 2 p 3
i 2 : p 2 p 1 p 3
i 3 : p 3 p 1 p 2
i 4 : p 4 p 5
i 5 : p 5 p 4
i 6 : p 4 p 5
i 7 : p 5 p 4

INVALID Implications
i 8 : LOC " Montreal" ( BUDGET 200,000)
i 9 : LOC " Orlando " ( BUDGET 200,000)

Implications should be
defined according to the
semantics of the database,
not according to the
current values.

Compute Complete & Minimal Set


Rule: a relation or fragment is partitioned into at least two parts which are
accessed differently by at least one application.
Relevant: a simple predicate which satisfies the above rule, is relevant.

Repeat until the predicate set is complete

Find a simple predicate pi that is relevant


Determine minterm fragments fi and fj according to pi
Accept pi , fi , and fj
Remove any pk and fk from acceptance list if pk becomes
irrelevant /* the list is minimal */

Determine the set of minterm predicates M (using


the acceptance list)
Determine the set of implications I (among the
acceptance list)
For each mi in M, remove mi if it is contradictory
according to I

Derived Horizontal Fragmentation


Derived fragmentation is used to facilitate the
join between fragments.
In some cases, the horizontal fragmentation of a
relation cannot be based on a property of its own
attributes, but is derived from the horizontal
fragmentation of another relation.

Benefits of Derived Fragmentation


PAY (TITLE, SAL)

Primary Fragmentation:
PAY 1 (TITLE "Assistant Professor")( PAY )
PAY 2 (TITLE " Associate Professor") ( PAY )

EMP (ENO, ENAME, TITLE)

PAY 3 ( TITLE " Full Professor")( PAY )

Using Derived Fragmentation:


EMP1

PAY1

EMP2

PAY2

EMP3

PAY3

EMP1 = EMP SJ PAY1


EMP2 = EMP SJ PAY2
EMP
SJ PAY
3 = EMP
3
EMPi
and PAYi
can be
allocated
to the same site.

Not using derived fragmentation: one can divide EMP into EMP1
and EMP2 based on TITLE and divide PAY into PAY1, PAY2, PAY3
based on SAL. To join EMP and PAY, we have the following
scenarios.
PAY1
EMP1

EMP2

PAY2

EMP3

PAY3

More communication
overhead !

Chain Relationships

R1 (R1PK, )
R2 (R2PK, R1FK, )
R3 (R3PK, R2FK, )
...

Design the primary


fragmenation for R1.
Derive the derived
fragmentation for Rk as
follows:
Rk = Rk SJRKFK=R(k-1)PK R(k-1)

for 2 k n in that order.

Derived Fragmentation
EMP (ENO, ENAME, TITLE)
Join might
be required

PROJ (PNO, PNAME, BUDGET)

EMP_PROJ (ENO, PNO, RESP, DUR)

How do we fragment EMP_PROJ ?


Semi-Join with EMP, or
Semi-Join with PROJ

Criterion: Suport the more-frequent join


operation

VERTICAL FRAGMENTATION
Purpose: Identify fragments Ri such that
many applications can be executed using
just one fragment.
Advantage: When many applications which
use R1 and many applications which use R2
are issued at different sites, fragmenting
R avoids communication overhead.

A1

A7

R2

R1

Site 1

Site 2

Vertical partitioning is more complicated than horizontal


partitioning:
Vertical Partitioning: The number of possible fragments is
equal to mm where m is the number of nonprimary key
attributes
Horizontal Partitioning: 2n possible minterm predicates can
be defined, where n is the number of simple predicates in the
complete and minimal set Pr.

Vertical Fragmentation Approaches


Greedy Heuristic Approaches:
Split Approach: Global relations are
progressively split into fragments.
Grouping Approach: Attributes are
progressively aggregated to constitute
fragments.

Correctness:
Each attribute of R belongs to at least one
fragment.
Each fragment includes either a key of R or a
tuple identifier.

Vertical Clustering - Replication


In evaluating the convenience of vertical
clustering, it is important that overlapping
attributes are not heavily updated.
Example:

EMP(ENUM,NAME,SAL,TAX,MGRNUM,DNUM)

Administrative Applications
at Site 1

Applications
at all sites

Bad Fragmentation: NAME not available in EMP2


1.
EMP1(ENUM,NAME,TAX,SAL)
2. EMP2(ENUM,MGRNUM,DNUM)
Good Fragmentation:
1.
EMP1(ENUM, NAME, TAX, SAL)
2. EMP2(ENUM, NAME, MGRNUM, DNUM)

NAME is
relatively
stable

Split Approach

Splitting is considered only for attributes that do


not participate in the primary key.

The split approach involves three steps:


1.

Obtain attribute affinity matrix.

2. Use a clustering algorithm to group some attributes


together based on the attribute affinity matrix. This
algorithm produces a clustered affinity matrix.
3. Use a partitioning algorithm to partition attributes
such that set of attributes are accessed solely or for
the most part by distinct set of applications.

Attribute Usage Matrix


PROJ

PNO PNAME BUDGET LOC

A1

q1: SELECT BUDGET


FROM PROJ
WHERE PNO=Value;

A2

A3

use(qi,Aj) =

A4

1 if Aj is referenced by qi
0 otherwise

A 1 A2 A 3 A4

q2: SELECT PNAME, BUDGET


FROM PROJ;

q1

q3: SELECT PNAME


FROM PROJ
WHERE LOC=Value;

q3 0 1 0 1

q4: SELECT SUM(BUDGET)


FROM PROJ
WHERE Loc=Value

q2

1 0 1 0
0 1 1 0

q4 0 0 1 1
Attribute Usage Matrix

Attribute Affinity Measure

aff ( Ai, Aj )

ref (q ) acc (q )
s

k , use ( qk , Ai ) 1 use ( qk , A j ) 1 s
For each query qk that uses both Ai and Aj
Popularity
of using
Ai and Aj
together

Relation R

qk

Ak

Site n

qi
qi

ref s (qk )

refs(qk) : Number of accesses to


attributes (Ai,Aj) for each
execution of qk at site s

Popularity of such Ai-Aj pair at


all sites

Site m

Ai

Aj

qi

Site s

qk

qi

accs (qk )

accs (qk) : Application access


frequency of qk at site s.

Attribute Affinity Matrix


aff ( Ai, Aj )

k , use ( qk , Ai ) s use ( qk , A j ) s

ref (q ) acc (q )
s

For each query qk that uses both Ai and Aj

Popularity of such Ai-Aj pair at


all sites

refs (qk): Number of accesses


to attributes (Ai,Aj)
for each execution
of qk at site s

A1

accs (qk): Application access


frequency of qk at
site s.

A4

A 1 A 2 A 3 A4

A2

aff ( A2, A3)

A3
Attribute Affinity Matrix

Attribute Affinity Matrix Example


A 1 A 2 A3 A 4

q1

A1 A 2

A3

A4

1 0 1 0
0 1 1 0

A1 45 0 45 0
A2 0 80 5 75

q3 0 1 0 1
q4 0 0 1 1

A3 45 5 53 3
A4 0 75 3 78

q2

Attribute Usage Matrix

Attribute Affinity Matrix (AA)

Next Step - Determine clustered affinity (CA) matrix

Clustered Affinity Matrix


Step 1: Initialize CA
Copy first 2 columns

A 1 A2

A3

A4

A1

A2

A1 45 0 45 0
A2 0 80 5 75

A1

A3 45 5 53 3
A4 0 75 3 78

A3 45 5
A4 0 75

Attribute Affinity Matrix (AA)

A2

A3 A 4

45 0
0 80

Clustered Affinity Matrix (CA)

Clustered Affinity Matrix


Step 2: Determine Location for A3
3 possible
positions
for A3

A0

A 1 A2

A3

A 1 A2 A 3
A0 A3 A 1

A4

A5

A1 45 0 45 0
A2 0 80 5 75
A3 45 5 53 3
A4 0 75 3 78
Attribute Affinity Matrix (AA)

A1 A3 A2

A1

A2

A1

45

A2
A3

0 80
45 5

A4

A0

A3 A 4

A5

75

Clustered Affinity Matrix (CA)

Clustered Affinity Matrix


Step 2: Determine the order for A3
n

Contribution

bond ( Ax , Ay ) aff ( Az , Ax ) aff ( Az , Ay )


z 1

cont ( Ai , Ak , A j ) 2 bond ( Ai , Ak ) 2 bond ( Ak , A j ) 2 bond ( Ai , A j )


Cont(A0,A3,A1) = 8820

Cont(A1,A3,A2) = 10150

Cont(A2,A3,A4) = 1780

Since Cont(A1,A3,A2) is the greatest, [A1,A3,A2] is the best order.

A 1 A2

A3

A4

A1

A3

A2 A 4

A1 45 0 45 0
A2 0 80 5 75

A1 45 45 0
A2 0 5 80

A3 45 5 53 3
A4 0 75 3 78

A3 45 53 5
A4 0 3 75

Attribute Affinity Matrix (AA)

Clustered Affinity Matrix (CA)

Note: aff(A0,Ai)=aff(Ai,A0)=aff(A5,Ai)=aff(Ai,A5)=0 by definition

Clustered Affinity Matrix


Step 2: Determine the order for A4
Since Cont(A3,A2,A4) is the biggest, [A3,A2,A4] is the best order.

A 1 A2

A3

A4

A1

A3

A2

A4

A1 45 0 45 0
A2 0 80 5 75

A1 45 45 0 0
A2 0 5 80 75

A3 45 5 53 3
A4 0 75 3 78

A3 45 53 5 3
A4 0 3 75 78

Attribute Affinity Matrix (AA)

Clustered Affinity Matrix (CA)

Clustered Affinity Matrix


Step 3: Re-order the Rows
The rows are organized in the same order as the columns.
A1

A3

A2

A4

A1

A3

A2

A4

0
3

A1 45 45 0 0
A2 0 5 80 75

A1 45 45
A3 45 53

0
5

A3 45 53 5 3
A4 0 3 75 78

A2 0
A4 0

80 75
75 78

Clustered Affinity Matrix (CA)

5
3

Clustered Affinity Matrix (CA)

Partitioning
Find the sets of attributes
that are accessed, for the
most part, by distinct sets
of applications
We look for a good dividing
points along the diagnose

Bad grouping since


A1 and A2 are never
accessed together

A1

A3

A2

A4

A1 45 45
A3 45 53

0
5

0
3

A2

80 75
75 78

A4

0
0

5
3

Clustered Affinity Matrix (CA)

Cluster 1:
Cluster 2:

A 1 & A3
A2 & A4

Two vertical fragments:


PROJ1(A1, A3) and PROJ2(A2, A4)

A4 and A3
are
usually
not
accessed
together
A4 and A2
are often
accessed
together

MIXED FRAGMENTATION
Apply horizontal fragmentation to vertical fragments.
Apply vertical fragmentation to horizontal fragments.
Example: Applications about work at each department reference tuples
of employees in the departments located around the site with 80%
probability.
EMP(ENUM,NAME,SAL,TAX,MGRNUM,DNUM)
ENUM NAME TAX SAL

ED
T
LA K
E
R OR
T
W
NO

ENUM

TO

Vertical fragmentation

NAME

MGRNUM

Jacksonville
Orlando
Miami

R
O
W

DNUM

TE
A
L
RE

Horizontal
Fragmentation
(local work)

i:

fragment index

j:

site index

k:

application index

fkj:

the frequency of
application k at site j

rki:

uki:

ALLOCATION
Notations

the number of retrieval


references of application k
to fragment i.
the number of update
references of application k
to fragment i.

nki = rki + uki

Site j
Fragment i

rki

uki

Application k
/w freq. fkj

Allocation of Horizontal Fragments (1)


No replication: Best Fit Strategy
The number of local references of Ri at site j is
Benefit to
Site j

Bij f kj nki
k

All applications k
at Site j

Number of
Access by k

Frequency of
application k

Ri is allocated at site j* such that Bij* is maximum.


Advantage: A fragment is allocated to a site that needs it most.
Disadvantage: It disregards the mutual effect of placing a
fragment at a given site if a related fragment is also at that
site.

Allocation of Horizontal Fragments (2)


All beneficial sites approach (replication)

Fragment i
Site j

Bij f kj rki c f kj 'uki


k

Savings due to
retrieval
references

j ' j k

Cost of update
references from
other sites

Ri is allocated at all sites j* such that Bij* > 0.


When all Bijs are negative, a single copy of Ri is
placed at the site such that Bij* is maximum.

Allocation of Horizontal Fragments (3)


Considering reliability and availability:
di

The degree of redundancy of Ri

Fi

The reliability and availability benefit of having R i fully replicated.

(di)

The reliability and availability benefit when the fragment has d i


copies.

(d i ) (1 21 d i ) F i

(1) 0, (2) F i , (3) 3 F i ,


2
4

The benefit of introducing a new copy of R i at site j :

Bij f kj rki c f kj 'u ki (d i )


k

j ' j

Same as All Beneficial


Sites approach

Fi

Also takes into


account the benefit
of availability

di

Allocation of Vertical Fragments


PSr

A1

A3

Ri

Rs

Application type A1
at site PSr , that
accesses only Rs

B ist

At

A4

PSs

PSt

PS4

Applications
of type As
at PSs

As

As

f ks n ks

A2

At
f kt n kt
k

f kt n kt

A2
Rt

A3

PSr

...

2 f

Should we allocate fragment Rs


to site PSs , and fragment Rt to
site PSt ?

A1

ki

An

As

PSn

PSs

A1

4 l n k

Rs

Al

f kl n ki

A2

Rt

A4

f ks n ks

n ki

A3

..
.

An

At

PS4

PSn

This formula can be used within an exhaustive splitting


algorithm by trying all possible combinations of sites s and t.

PSt

SUMMARY
Design of a distributed DB consists of four phases:

Phase 1: Global schema design (same as in centralized DB


design)
Phase 2: Fragmentation
Horizontal Fragmentation

Primary: Determent a complete and minimal set of predicates


Derived: Use semijoin

Vertical Fragmentation

Identify fragments such that many applications can be executed


using just one fragment.

Phase 3: Allocation

The primary goal is to minize the number of remote accesses.

Phase 4: Physical schema design (same as in centralized DB


design).

Database Integration
Bottom-up Design

Overview
The design process in
multidatabase systems is
bottomup.
The individual databases
actually exists
Designing the global
conceptual schema (GCS)
involves integrating these
local databases into a
multidatabase.

Database integration can


occur in two steps:
Schema Translation and
Schema Integration.

Database 1

Database 2

Database 3

Translator 1

Translator 2

Translator 3

InS1

InS2

Intermediate
schema in
canonical
representation

INTEGRATOR

GCS

InS3

Network Data Model (Review)


There are two basic data structures in the network
model: records and sets.
Record type: a group of records of the same type.
Set type: indicates a many-to-one relationship in the direction of the arrow.
DEPARTMENT (DEPT-NAME, BUDGET, MANAGER)
Employs

EMPLOYEE (E#, NAME, ADDRESS, TITLE, SALARY)

owner record type


set type

Implementation of set instances:


DEPARTMENT (owner record)
Database
Jones, L.
Patel, J.

EMPLOYEE
(member records)
Vu, K.

member record type

Example: Three Local Databases


Database 1 (Relational Model):
S (TITLE, SAL)
E (ENO, ENAME, TITLE)
J (JNO, JNAME, BUDGET, LOC, CNAME)
G (ENO, JNO, RESP, DUR)

Database 2 (Network Model):


DEPARTMENT (DEPT_NAME, BUDGET, MANAGER)
Employs

Work
Worksin

Dummy
Record Type

EMPLOYEE (E#, NAME, ADDRESS, TITLE, SALARY)

Example: Three Local Databases


Database 3 (ER Model):
Engineer
No.

Engineer
Name

ENGINEER
Title

Project
No.

Responsibility

WORKS
IN

Project
Name
Budget

PROJECT
N

Salary

Location

CONTRACTED
BY

Duration
Contract
Date

CLIENT
Client
Name

Address

Schema Translation: Relational to ER


S (TITLE, SAL)
ENO

E (ENO, ENAME, TITLE)


J (JNO, JNAME, BUDGET, LOC, CNAME)

ENAME
N

E
N

G (ENO, JNO, RESP, DUR)

ENAME

E
TITLE

SAL

JNO

JNAME

BUDGET

LOC

CNAME

S
TITLE

ENO

G
DUR

PAY
E & J have a many-tomany relationship
E & S have a 1-to-many
relationship

RESP

RESP

G
DUR

JNO

JNAME

SAL
Treat salary as
an attribute of
an engineer
entity

BUDGET
CNAME

LOC

Relationships may be identified from


the foreign keys defined for each
relation.

Schema Translation: Network to ER


DEPARTMENT

WORK

EMPLOYEE
N

Works-in

Employs

WORK

EMPLOYS

Dummy
record type

DEPARTMENT

DEPARTMENT
N

EMPLOYS

M
WORKS-IN
1

EMPLOYEE

EMPLOYEE

Map each record type in the network schema to an entity


and each set type to a relationship.
Network model uses dummy records in its representation of
many-to-many relationships that need to be recognized
during mapping.

After Translation
Engineer
No.

Database 3

Engineer
Name
N

ENGINEER
Title

Project
No.

Responsibility

Budget

WORKS
IN

PROJECT
N

Salary

ENO

ENAME

Contract
Date

E
TITLE

RESP

SAL

G
DUR

JNO

Location

CONTRACTED
BY

Duration

Database 1

Project
Name

CLIENT

JNAME

Client
Name

BUDGET

Address

LOC

CNAME

Database 2

DEPARTMENT

EMPLOYS

EMPLOYEE

Schema Integration
Schema integration follows the translation
process and generates the GCS by
integrating the intermediate schemas.
Identify the components of a database which
are related to one another.
Two components can be related as (1) equivalent, (2)
one contained in the other one, (3) overlapped, or (4)
disjoint.

Select the best representation for the GCS.


Integrate the components of each
intermediate schema.

Integration Methodologies
Integration
Process
Binary

Ladder

Balanced

N-ary

One-shot

Iterative

Binary: Decreases the


potential integration
complexity and lead toward
automation techniques.
One-shot: There is no
implied priority for
integration order of
schemas, and the trade-off
can be made among all
schemas rather than among
a few.

Integration Process
Schema integration occurs in a sequence of four
steps:
1) Preintegration: establish the rules of the integration
process before actual integration occurs.
2) Comparison: naming and structural conflicts are identified.
3) Conformation: resolve naming and structural conflicts
4) Merging and restructuring: all schemas must be merged
into a single database schema and then restructured to
create the best integrated schema.

Step 1: Preintegration
1. An integration method (binary or n-ary) must be
selected and the schema integration order defined.
The order implicitly defines priorities.

2. Candidate keys in each schema are identified to


enable the integrator to determine dependencies
implied by the schemas.
3. The mapping or transformation rules should be
described before integration begins.
e.g., mapping from degree Celsius in one schema to
degrees Fahrenheit in another.

Preintegration
Example

InS1

Keys & Integration Order


KEYS
InS1: Engineer No. in ENGINEER
Project No. in PROJECT
Client name in CLIENT
InS1

InS2

InS3

Integration method

InS2: E# in EMPLOYEE
Dept-name in DEPARTMENT
InS3: Eno in E
Jno in J

Step 2: Schema Comparison


Naming Conflict (1)
Synonyms: two identical entities that have
different names.
InS1
ENGINEER

InS3
E

Engineering No

Eno

Engineer Name

Ename

Salary

Sal

WORKSIN

Responsibility

Resp

Duration

Dur

PROJECT

Project No

Jno

Project Name

Jname

Location

Loc

Step 2: Schema Comparison


Naming Conflict (2)
Homonyms: Two different entities that have
identical names.
In InS1, ENGINEER.Title
refers to the title of
engineers.

InS1

In InS2, EMPLOYEE.Title
refers to the title of all
employees.
domain (EMPLOYEE.Title) >> domain (ENIGNEREER.Title)

Step 2: Schema Comparison


Relation between Schemas
Two schemas can be related in four
possible ways:

They can be identical to one another.


One can be a subset of the other.
Some components from one may occur in other
while retaining some unique features
They could be completely different with no
overlap.

An attribute in one schema may represent


the same information as an entity in
another one

Schema Comparison Example


E#

InS3 is a subset of InS2


ENGINEER

Name

EMPLOYEE

Title
Address

IS-A relationship

EMPLOYS

Salary

DEPARTMENT

Some parts of InS1 (about engineers) and InS3


(about engineers) occur in InS2 (about employees)

Schema Comparison Structural


Conflicts (1)
Type conflicts: occur when the same object is
represented by an attribute in one schema and by an
entity in another schema.
The client of a project is modeled as an entity in InS1,
however

the client is included as an


Resp

EMPLOYS

Dur

JNO
M

attribute of the J entity in InS3


N

Jname

CONTRACTED
BY

Budget

J
Cname

InS3
Loc

Contract
Date

CLIENT
Client
Name

Address

PROJECT

InS1

Schema Comparison Structural


Conflicts (2)
This is
1-to-many

Dependency conflicts:
occur when different
relationship modes are
used to represent the
same thing in different
schemas.
Eno

Engineer
No.

Title

ENGINEER
Title

ENGINEER

Sal

WORKS
IN

EMPLOYS

Dur

PROJECT

InS1

Salary

This is
many-to-many

Resp

Ename

Project
No.

Engineer
Name

InS3

Schema Comparison: Structural


Conflicts (3)
Key conflicts: occur when different candidate
keys are available and different primary keys are
selected in different schemas
Behavioral conflicts: are implied by the modeling
mechanism,
e.g., deletion of the last employee causes the dissolution
of the department.

Step 3: Conformation
Naming Conflicts
Naming conflicts are resolved simply by renaming
conflict ones.
Synonyms: rename the schema of InS3
to conform to the naming of InS1.
InS3
E

InS1
ENGINEER

Eno

Engineering No

Engineering No

Ename Engineering Name

Engineer Name

Sal

Salary

Salary

WORKSIN
Resp

Responsibility

Responsibility

Dur

Duration

Duration

PROJECT
Jno

Project No

Project No

Jname Project Name

Project Name

Loc

Location

Location

Homonyms:
Prefix each attribute
by the name of the
entity to which it
belong,
e.g., ENGINEER.Title
EMPLOYEE.Title

and prefix each entity


by the name of the
schema to which it
belongs.
e.g., InS1.ENGINEER
InS2.EMPLOYEE

Step 3: Resolving Structural Conflicts

Transforming entities/attributes/relationships among one another


Engineer
No.

InS3

ENGINEER
Title

Salary

Engineer
No.

WORKS
IN

PROJECT

Salary

Location

Client
Name

Project
No.

Responsibility

Engineer
Name

Project
Name
Budget

Duration

ENGINEER
Title

Project
No.

Responsibility

Engineer
Name

WORKS
IN

Budget

PROJECT
N

C-P

Duration

Example:
Transform the attribute Client name in
InS3 to an entity C to make InS3
conform to the presentation of InS1.

Project
Name

M
Client
Name

Location

New
InS3

Step 3: Schema Integration


Merging & Restructuring
Merging requires that the information contained in
the participating schemas be retained in the
integrated schema.
Merging using the IS-A
relationship

InS2
(Employees)

InS3

InS1

(Engineers) (Engineers)

Use InS3 as the final schema


since it is more general in
terms of the C-P relationship
(i.e., many-to-many)
(next page)

Integrate InS1 & InS3


Engineer
No.

InS1

Engineer
Name

ENGINEER
Title

Project
No.

Responsibility

WORKS
IN

Salary

ENGINEER
Title

Responsibility

Engineer
Name

Salary

WORKS
IN

PROJECT
N
CONTRACTED
BY

Duration

M
Client
Name

CLIENT
Client
Name

Project
Name

Address

Budget

Location

InS3

Budget

Location

CONTRACTED
BY

Contract
Date

Engineer
No.

PROJECT
N

Duration

Project
No.

Project
Name

InS3 is
more
general

Merging & Restructuring Example


Final Result:
ENGINEER

Duration

Responsibility

WORKS
IN

Project
No.

Project
Name
Budget

PROJECT
Location

E#
Name

EMPLOYEE
N
EMPLOYS

InS2

Title
Address

Manager

CLIENT
Client
name

InS1/InS3

Address

SAL

DEPARTMENT
Budget

CONTRACTED
BY

Dept-name

Unfortunately, Conformation and


restructuring stages are an art
rather then a science

Query Processing in
Multidatabase Systems

Query Processing in Three Steps


1. Global query is
decomposed into local
queries

Schema Integration
Local Schema 1

Local Schema 2

Local Schema 3

Translator 1

Translator 2

Translator 3

InS1

InS2

Q1,1

Q1,2

InS3

Q1,3

INTEGRATOR

Q1

GCS

Query Processing in Three Steps


2. Each local query is
translated into
queries over the
corresponding local
database system

Schema Integration
Local Schema 1

Q1,1

Local Schema 2

Q1,3

Q1,2

Translator 1

Translator 2

InS1

InS2

Q1,1

Local Schema 3

Q1,2

Translator 3
InS3

Q1,3

INTEGRATOR

Q1

GCS

Query Processing in Three Steps


3. Results of the local
queries are combined
into the answer

Final
answer

Combine

Schema Integration
Local Schema 1

Q1,1

Local Schema 2

Q1,3

Q1,2

Translator 1

Translator 2

InS1

InS2

Q1,1

Local Schema 3

Q1,2

Translator 3
InS3

Q1,3

INTEGRATOR

Q1

GCS

Query Processing in Three Steps


1.

Global query is
decomposed into local
queries

2. Each local query is


translated into
queries over the
corresponding local
database system
3. Results of the local
queries are combined
into the answer

Schema Integration
Local Schema 1

Local Schema 2

Local Schema 3

Translator 1

Translator 2

Translator 3

InS1

InS2

INTEGRATOR

GCS

InS3

Outline
Overview of major query processing
components in multidatabase systems:
Query Decomposition
Query Translation
Global Query Optimization

Techniques for each of the above


components

Query Decomposition

Query Decomposition
Overview

Global Query

Query decomposition &


global optimization

SQ1
Query
translator 1

TQ1
DB1

SQ2

...

Query
translator 2

TQ2

...

DB2

SQn

Query
translator n

TQn

DBn

PQ1

PQn

SQi export-schema subquery


in global query language
TQi target query (local
subquery) in local query
language
PQi postprocessing query
used to combine results
returned by subqueries
to form the answer

Assumptions
We use the object-oriented data model to
present a query decomposition algorithm
To simplify the discussion, we assume that
there are only two export schemas:
ES1
Emp1: SSN
Name
Salary
Age

ES2
Emp2: SSN
Name
Salary
Rank

Definitions
type: Given a class C, the type

of C denoted by type(C ), is the


set of attributes defined for C
and their corresponding
domains.

world: the world of C, denoted


by world(C ), is the set of realworld objects described by C.

extension: the extension of C,


denoted by extension(C ), is the
set of instances contained in C.

World

Type

Extension

A Class

Schema Integration
Integration through outerjoin
Integration through outerunion
(generalization)

Review: Outerjoin
The outerjoin of relation R1 and R2
(R1 o R2 ) is the union of three
components:
the join of R1 and R2,
dangling tuples of R1 padded with null
values, and
dangling tuples of R2 padded with null
values.

Outerjoin Example
EmpO = Emp1 o Emp2

Emp1
OID

SSN

Name

Salary

Age

OID

SSN

Name

Salary

Age

Rank

6789

Smith

90,000

40

2222

Ahad

98,000

null

S. Mgr.

4321

Chang

62,000

30

8642

Patel

75,000

35

7531

Wang

95,000

mull

S. Mgr.

6789

Smith

Inconsistent

40

Mgr.

4321

Chang

62,000

30

null

8642

Patel

75,000

35

null

Emp2
OID

SSN

Name

Salary

Rank

2222

Ahad

98,000

S. Mgr.

7531

Wang

95,000

S. Mgr.

6789

Smith

25,000

Mgr.

Dangling Tuple

Dangling Tuple

Outerunion
Emp1

EmpG = Emp1 Uo Emp2

OID

SSN

Name

Salary

Age

6789

Smith

90,000

40

4321

Chang

62,000

30

8642

Patel

75,000

35

Emp2

OID

SSN

Name

Salary

Age

Rank

2222

Ahad

98,000

null

S. Mgr.

7531

Wang

95,000

mull

S. Mgr.

6789

Smith

Conflict

null

Mgr.

6789

Smith

Conflict

40

null

OID

SSN

Name

Salary

Rank

4321

Chang

62,000

30

null

2222

Ahad

98,000

S. Mgr.

8642

Patel

75,000

35

null

7531

Wang

95,000

S. Mgr.

6789

Smith

25,000

Mgr.

Schema Integration Using Outerjoin


Two classes C1 and C2 can be integrated by
equi-outerjoining the two classes on the
OID to form a new class C.
extension(C ) = extension(C1 ) o
extension(C2 )
type(C ) = type(C1 ) type(C2 )
world(C ) = world(C1 ) world(C2 )

Schema Integration thru Generalization


Two classes C1 and C2 can be integrated by
generalizing the two classes to form the
superclass C.
Generalization

type(C ) = type(C1 ) type(C2 )

Outer
union

extension(C ) = type(C) [extension(C1 ) o


extension(C2 )]
world(C ) = world(C1 ) world(C2 )

Generalization Example
Emp1:
SSN
Name
Salary
Age

Emp2: SSN
Name
Salary
Rank

EmpG: SSN
Name
Salary

Generalization

Emp1 and Emp2 will also appear in the


global schema since not all information in
Emp1 and Emp2 is retained in EmpG
EmpG SSN
Name
Salary
Emp1 Age

More
specific

Rank Emp2

Inconsistency Resolution
The schema
integration techniques
work as long as there
is no data
inconsistency
If data inconsistency
occurs, aggregate
functions may be used
to resolve the
problem.

Inconsistency Resolution Example


Export Schemas
Emp1: SSN
Emp2: SSN
SSN
Name
Name
Name
Salary
Salary
Salary
Age
Rank
Age

Integrated Schema
EmpG: SSN
EmpO:
Generalization

Name

or

Salary

Outer
join

Rank

Aggregate Functions - Examples:


EmpG.Name = Emp1.Name, if EmpG is in world(Emp1)
= Emp2.Name, if EmpG is in world(Emp2) world(Emp1)
EmpG.Salary = Emp1.Salary, if EmpG is in world(Emp1) world(Emp2)
world(Emp2)
= Emp2.Salary, ifEmpG is in world(Emp2) world(Emp1) world(Emp1) world(Emp1)

= Sum(Emp1.Salary, Emp2.Salary), if EmpG is in world(Emp1)


world(Emp2)
world(Emp1)
world(Emp2)
world(Emp2)
EmpO.Age = Emp1.Age, if EmpO is in world(Emp1)
World
= Null, if EmpO is in world(Emp2) world(Emp1)

(Emp1)

World (Emp2)

Query Decomposition

Step 1: Determine Number of Subqueries


Global
Query

Select EmpO.Name, EmpO.Rank


From
EmpO
Where EmpO.Salary > 80,000 AND
EmpO.Age > 35

Assume
Outerjoin is
used for
schema
integration

Obtain a partition of world(EmpO) based on the aggregate


function used to resolve the data inconsistency.
Inconsistency Function:
Option 1 (based on Salary)
EmpO.Salary = Emp1.Salary, if
part. 1: world(Emp1) world(Emp2)
EmpO is in world(Emp1) world(Emp2)
part. 2: world(Emp2) world(Emp1)
= Emp2.Salary, if
part. 3: world(Emp1) world(Emp2)
EmpO is in world(Emp2) world(Emp1)

world(Emp1)

2
world(Emp2)

= Sum(Emp1.Salary,Emp2.Salary), if
EmpO is in world(Emp1) world(Emp2)

Query Decomposition

Step 1: Determine Number of Subqueries


Global
Query

Select EmpO.Name, EmpO.Rank


From
EmpO
Where EmpO.Salary > 80,000 AND
EmpO.Age > 35

Obtain a partition of world(EmpO) based on the aggregate


function used to resolve the data inconsistency.
Inconsistency Function:
EmpO.Age
= Emp1.Age, if EmpO is in world(Emp1)

Option 2 (based on Age)


part. 1: world(Emp1)
part. 2: world(Emp2) world(Emp1)

= Null, if EmpO is in world(Emp2) world(Emp1)

world(Emp1)

2
world(Emp2)

Query Decomposition

Step 1: Determine Number of Subqueries


Global
Query

Select EmpO.Name, EmpO.Rank


From
EmpO
Where EmpO.Salary > 80,000 AND
EmpO.Age > 35

Obtain a partition of world(EmpO) based on the aggregate


function used to resolve the data inconsistency.
Option 1 (based on Salary)
part. 1: world(Emp1) world(Emp2)
part. 2: world(Emp2) world(Emp1)
part. 3: world(Emp1) world(Emp2)
world(Emp1)

Option 2 (based on Age)


part. 1: world(Emp1)
part. 2: world(Emp2)
world(Emp1)
world(Emp1)

2
world(Emp2)

2
world(Emp2)

We use Option 1 since it is the finest partition among all the partitions.

Query Decomposition
Another Example
Option 1:

Option 2:

world(Emp1)

world(Emp1)

world(Emp2)

world(Emp2)

Use finer partition (Option 3):


world(Emp1)

2
world(Emp2)

Query Decomposition

Step 2: Query Decomposition


Global Query:
Select EmpO.Name, EmpO.Rank
From EmpO
Where EmpO.Salary > 80,000
AND
EmpO.Age > 35
Partition:

world(Emp1)

2
world(Emp2)

part. 1: Select Emp1.Name


From Emp1
Where Emp1.Salary > 80,000
AND
Emp1.Age > 35 AND
Emp1.SSN NOT IN
(Select
Emp2.SSN
From
Emp2)
part. 2: This subquery is discarded
because EmpO.Age is Null.

Query Decomposition:
part. 3: Select Emp1.Name,
Obtain a query for each
Emp2.Rank
subset in the chosen
From Emp1, Emp2
partition.
EmpO.Age
= Emp1.Age, if EmpO is in world(Emp1)
= Null, if EmpO is in world(Emp2) world(Emp1)

Where Sum(Emp1.Salary,

EmpO.Salary = Emp1.Salary, if EmpG is in world(Emp1) world(Emp2)


Emp2.Salary)
= Emp2.Salary, ifEmpG is in world(Emp2) world(Emp1)
80,000 AND
= Sum(Emp1.Salary, Emp2.Salary), if EmpG is in world(Emp1) world(Emp2)

>

Emp1.Age > 35 AND

Query Decomposition

Step 2: Query Decomposition


Global Query:
Select EmpO.Name, EmpO.Rank
From EmpO
Where EmpO.Salary > 80,000
AND
EmpO.Age > 35

part. 1: Select Emp1.Name


From Emp1
Where Emp1.Salary > 80,000
AND
Emp1.Age > 35 AND
Emp1.SSN NOT IN
n
tio
(Select
a
c
Query Decomposition:
i
if
Emp2.SSN
d
o
Obtain a query for each y M
From
r
e
subset in the chosenQu
Emp2)

partition.

Emp1.Salary
3
Emp1.Age Emp1.Salary +

Emp2.Salary
Emp2.Salary
Emp1.Age
Age = null
world(Emp1)

world(Emp2)

part. 2: This subquery is discarded


because EmpO.Age is Null.
part. 3: Select Emp1.Name,
Emp2.Rank
From Emp1, Emp2
Where Sum(Emp1.Salary,
Emp2.Salary) >
80,000 AND
Emp1.Age > 35 AND

Query Decomposition

Step 3: Further Decomposition


STEP 3: Some resulting query may still reference
data from more than one database. They need to be
further decomposed into subqueries and possibly also
postprocessing queries
Before STEP 3:
Select Emp1.Name
From Emp1
Where Emp1.Salary > 80,000
and
Emp1. Age > 35 and
Emp1.SSN NOT IN
(Select Emp2.SSN
From Emp2)

Select Emp1.Name
From Emp1
Where Emp1.Salary > 80,000
and
Emp1. Age > 35 and
Emp1.SSN NOT IN X

X
Insert INTO X
Select Emp2.SSN
From Emp2)

Query Decomposition
Step 4: Query Optimization

STEP 4: It may be desirable to reduce


the number of subqueries by
combining subqueries for the same
database.

Query Translation

Query Translation (1)


IF

THEN

Global Query Language


Local Query Language
Export
Schema
Subquery

Translator

Local
Query
Language

Query Translation (2)


IF the source query language has a higher
expressive power THEN EITHER
Some source queries cannot be translated; or
they must be translated using both

the syntax of the target query language, and


some facilities of a high-level programming language.

Example: A recursive OODB query may not be


translated into a relational query using SQL
alone.

Relation-to-OO Translation
OODB Schema:
Auto
OID
Color
Manufacturer

Company
OID
Name
Profit
Headquarter
President

People
OID
Name
Hometown
Automobile
Age

Foreign
key
Equivalent Relational Schema:

Auto (Auto-OID, Color, Company-OID)


Company (Company-OID, Name, Profit, City-OID,
People-OID)
People (People-OID, Name, Age, City-OID, Auto-OID)
City (City-OID, Name, State)

City
OID
Name
State

Relational-to-OO
Example
Global Query:

1
2
3
4
5
6

Auto (Auto-OID, Color, Company-OID)


Company (Company-OID, Name, Profit, CityOID, People-OID)
People (People-OID, Name, Age, City-OID,
Auto-OID)
City (City-OID, Name, State)

Select Auto1.*
Relational Predicate Graph:
From
Auto Auto1, Auto Auto2,
Company, People,
1) Company-OID
Auto1
Company
City City1, City City2
D
Where Auto1.Company-OID =
OI
2) People-OID
ity
C
Company.Company-OID AND
6)
n)
i
o
Company.People-OID =
(J
People
City2
People.People-OID AND
Age=52
D
People.Age = 52 AND
OI
3) Auto-OID
y
ame
it
People.Auto-OID =
C
5) N
4)
Auto2.Auto-OID AND
Auto2
Auto2.Color = red AND
City1
Color=red
People.City-OID =
City1.City-OID AND
City1.Name = City2.Name AND
Find all red cars own by a 52 year old
Company.City-OID =
who is the President 1+2+3of the car
City2.City-OID

manufacturer and lives in the same city


of the car manufacturer 4+5+6

Relational-to-OO Example (2)


OO Predicate Graph:
Auto1

Company-OID

City1
Relational Predicate Graph:
Auto1

1) Company-OID

6)

City2
ame
5) N

City1

O
yt
i
C

ID

(Jo

in)

Company
2) People-OID

People
Age=52

C
4)

I
-O
y
t
i

3) Auto-OID

Auto2

Color=red

Name

City2

Company

ID
r)
y-O arte
t
i
C
qu
ad
e
(H
D
OI n)
y
Cit etow
om
(H

People-OID

People
Age=52

Auto-OID

Auto2

Color=red

Relational-to-OO Example (3)


OO Predicate Graph:
Auto1

Company-OID

City1

Predicate 3

Name

City2

Company

ID
r)
y-O arte
t
i
C
qu
ad
e
(H
D
OI n)
y
Cit etow
om
(H

People-OID

People

Predicate 1

Age=52

Auto-OID

Auto2

Color=red

Predicate 2

OO Query:
1
Where Auto.Manufacturer.President.Age = 52 AND
Auto.Manufacturer.President.Automobile.Color = red
2
AND
Auto.Manufacturer.Headquarter.Name =
3

Auto.Manufacturer.President.Hometown.Name

Global Query Optimization

Query Optimization (1)


CASE 1: A single target query is generated
IF the target database system has a query
optimizer
THEN the query optimizer can be used
to optimize the translated query
ELSE the translator has to consider the
performance issues

Query Optimization (2)


CASE 2: A set of target queries is needed.
It might pay to have the minimum number of
queries

It minimizes the number of invocations of the target


system
It may also reduce the cost of combining the partial
results

It might pay for a set to contain target queries


that can be well coordinated
The results or intermediate results of the queries
processed earlier can be used to reduce the cost of
processing the remaining queries

Global Query Optimization (1)


A query obtained by the query modification
process may still reference data from more
than one database.
Example: part. 3 on page 123
Select Emp1.Name, Emp2.Rank
From
Emp1, Emp2
/* access two databases
Where sum(Emp1.Salary, Emp2.Salary) > 80,000 AND
Emp1.Age > 35 AND
Emp1.SSN = Emp2.SSN
Some global strategy is needed to process such queries

Global Query Optimization (2)


Select Emp1.Name, Emp2.Rank
From
Emp1, Emp2
/* access two databases
Where sum(Emp1.Salary, Emp2.Salary) > 80,000
AND
Emp1.Age > 35 AND
Emp1.SSN = Emp2.SSN
Some global
is needed to process such
Sitestrategy
1
queries

Site 1

Emp1

Age > 35

Emp1

Site 2

Site 1

Emp2

Emp1

form
result

form
result

1+2
form
result

Emp2

Emp2

Site 3
Site 2

Site 2

Data Inconsistency
If C is integrated from C1 and C2 with no
data inconsistency on attribute A, then
A op a

(C) =

A op a (C1) A op a (C2)

If A has data inconsistency, then the above


equality may no longer hold.
Example: Consider the select operation

EmpO
OID

SSN

Name

Salary

Age

Rank

2222

Ahad

98,000

null

S. Mgr.

7531

Wang

95,000

mull

S. Mgr.

6789

Smith

Inconsistent

40

Mgr.

4321

Chang

62,000

30

null

8642

Patel

75,000

35

null

EmpO.Salary > 100,000 (EmpO)

The correct answer should


have the record for Smith.
However, the above query
returns an empty set

Smith does have a combined salary greater than 100,000

Data Inconsistency - Optimization


Express an outerjoin (or a generalization) as
outer-unions as follows:
C1 o C2 = C1-O o C2-O o (C1-C OID C2-C)
C1-O: Those tuples of C1 that have no matching tuples
in C2 (private part)
C1-C: Those tuples of C1 that have matching tuples in
C2 (overlap part)

A op a (C1 o C2 ) = A op a (C1-O) o A op a (C2-O)


o A op a (C1-C C2-C)

Can we improve this term


by distributing

Distribution of Selections (1)


A op a (C1 o C2 ) = A op a (C1-O) o A op a (C2-O)
o A op a (C1-C C2-C)

When can we distribute


over ?

Expensive operation

Attribute A is defined by
an aggregate function to
resolve inconsistency,
e.g., sum(Emp1.Salary, Emp2.Salary)

Distribution of Selection (2)


Four cases were identified when all arguments of the aggregate
function (for resolving conflicts) are non-negative
1. f(A1,A2) op a A1 op a AND A2 op a:
Aggregate A op a (C1-C C2-C) = A op a (C1-C) A op a ( C2-C)
function

Example:

max(Emp1-C.Salary,

Emp2-C.Salary) < 30K

Emp1-C.Salary < 30K AND


Emp2-C.Salary < 30K

An aggregate
function

2. f(A1,A2) op a f(A1 op a, A2 op a) op a:
A op a(C1-C C2-C) = A op a(A1 op a(C1-C) A2 op a(C2-C))
Example:

sum(Emp1-C.Salary, Emp2-C.Salary) < 30K


sum(Emp1-C.Salary < 30K, Emp2-C.Salary < 30K) < 30K

Distribution of Selection (3)


3. f(A1,A2) op a f(A1 op a, A2 op a) op a:
A op a(C1-C C2-C) = A op a(A1 op a(C1-C)
A2 op a(C2-C))
Example: sum(Emp1-C.Salary, Emp2-C.Salary) = 30K
sum(Emp1-C.Salary 30K,
Emp2-C.Salary 30K) = 30K
4. No improvement is possible:
Example: sum(Emp1-C.Salary, Emp2-C.Salary) > 30K

Distribution Rules for over


A op a(C1-C C2-C)
op

sum(A1, A2)

Not
> < = in
in
4 4 2 2 3 4 4 4

avg(A1, A2)

4 4 2 2 3 4 4 4

max(A1, A2)

4 4 1 1 3 4 4 4

min(A1, A2)

1 1 4 4 3 4 4 4

No
improvement
possible

Problem in Global Query


Optimization (1)
Important information about local entity sets that is
needed to determine global query processing plans
may not be provided by the local database systems.
Example: cardinalities
availability of fast access paths

Techniques:

Sampling queries may be designed to collect statistics


about the local databases.
A monitoring system can be used to collect the
completion time for subqueries. This can be used to
better estimate subsequent subqueries.

Problems in Global Query Optimization (2)


Different query processing algorithms may have
been used in different local database systems.
Cooperation across different systems difficult
Examples: Semijoin may not be supported on some
local systems.

Data transmission between different local


database systems may not be fully supported.

Examples:
A local database system may not allow update operations
For many nonrelational systems, the instances of one
entity set are more likely to be clustered with the
instances of other entity sets. Such clustering makes it
very expensive to extract data for one entity set.
Need more sophisticated decomposition algorithms.

You might also like