Principles of Distributed Database
Systems
M. Tamer Özsu
Patrick Valduriez
© 2020, M.T. Özsu & P. Valduriez 1
Outline
Introduction
Distributed and Parallel Database Design
Distributed Data Control
Distributed Query Processing
Distributed Transaction Processing
Data Replication
Database Integration – Multidatabase Systems
Parallel Database Systems
Peer-to-Peer Data Management
Big Data Processing
NoSQL, NewSQL and Polystores
Web Data Management
© 2020, M.T. Özsu & P. Valduriez 2
Outline
Distributed Data Control
View management
Data security
Semantic integrity control
© 2020, M.T. Özsu & P. Valduriez 3
Semantic Data Control
Involves:
View management
Security control
Integrity control
Objective :
Ensure that authorized users perform correct operations on the
database, contributing to the maintenance of the database
integrity.
© 2020, M.T. Özsu & P. Valduriez 4
Outline
Distributed Data Control
View management
Data security
Semantic integrity control
© 2020, M.T. Özsu & P. Valduriez 5
View Management
View – virtual relation EMP
generated from base relation(s) by a ENO ENAME TITLE
query
E1 J. Doe Elect. Eng
not stored as base relations E2 M. Smith Syst. Anal.
E3 A. Lee Mech. Eng.
Example 3.1 : E4 J. Miller Programmer
CREATE VIEW SYSAN(ENO,ENAME) E5 B. Casey Syst. Anal.
E6 L. Chu Elect. Eng.
AS SELECT ENO,ENAME E7 R. Davis Mech. Eng.
E8 J. Jones Syst. Anal.
FROM EMP
WHERE TITLE= "Syst.
Anal."
© 2020, M.T. Özsu & P. Valduriez 6
View Management
Views can be manipulated as base relations
Example 3.2 :
SELECT ENAME, PNO, RESP
FROM SYSAN, ASG
WHERE SYSAN.ENO = ASG.ENO
© 2020, M.T. Özsu & P. Valduriez 7
Query Modification
Queries expressed on views
Queries expressed on base relations
Example 3.3 :
SELECT ENAME, PNO, RESP
FROM SYSAN, ASG
WHERE SYSAN.ENO = ASG.ENO
SELECT ENAME,PNO,RESP
FROM EMP, ASG
WHERE EMP.ENO = ASG.ENO
AND TITLE = "Syst. Anal."
© 2020, M.T. Özsu & P. Valduriez 8
View Management
To restrict access, Example 3.4
CREATE VIEW ESAME
AS SELECT *
FROM EMP E1, EMP E2
WHERE E1.TITLE = E2.TITLE
AND E1.ENO = USER
Query
SELECT *
FROM ESAME
© 2020, M.T. Özsu & P. Valduriez 9
View Updates
Updatable
CREATE VIEW SYSAN(ENO,ENAME)
AS SELECT ENO,ENAME
FROM EMP
WHERE TITLE="Syst. Anal.“
Example 3.5
Non-updatable
CREATE VIEW EG(ENAME,RESP)
AS SELECT ENAME,RESP
FROM EMP, ASG
WHERE EMP.ENO=ASG.ENO
© 2020, M.T. Özsu & P. Valduriez 10
View Management in Distributed DBMS
Views might be derived from fragments.
View definition storage should be treated as database
storage (stored in the catalog)
View definitions can be centralized at one site,
partially duplicated or fully duplicated.
Query modification results in a distributed query (be
processed by the distributed query processor => maps the
distributed query into a query on physical fragments )
View evaluations might be costly if base relations are
distributed (page 96)
Use materialized views
© 2020, M.T. Özsu & P. Valduriez 11
Materialized View
An alternative solution is to avoid view derivation by
maintaining actual versions of the views, called
materialized views
Origin: snapshot in the 1980’s
Static copy of the view, avoid view derivation for each query
But periodic recomputing of the view may be expensive
Actual version of a view
Stored as a database relation, possibly with indices
Used much in practice
DDBMS: No need to access remote, base relations
Data warehouse: to speed up OLAP
Use aggregate (SUM, COUNT, etc.) and GROUP BY
© 2020, M.T. Özsu & P. Valduriez 12
Materialized View
-- Example 3.6
CREATE VIEW PL(LOC, NBPROJ, TBUDGET)
WITH SCHEMABINDING AS
SELECT LOC, COUNT_BIG(*),SUM(isnull(BUDGET,0))
FROM dbo.PROJ
GROUP BY LOC
-- Create index for view (materialized views for sql server)
go
CREATE UNIQUE CLUSTERED INDEX CLI_PL ON
PL(LOC)
© 2020, M.T. Özsu & P. Valduriez 13
Materialized View Maintenance
View maintenance is process of updating (refreshing) the
view to reflect changes to base data
Resembles data replication but there are differences
View expressions typically more complex
in particular, for data warehousing, are typically more complex
than replica definitions and may include join, group by, and
aggregate operator.
Replication configurations more general
e.g., with multiple copies of the same base data at multiple
sites
View maintenance policy to specify:
When to refresh
How to refresh
© 2020, M.T. Özsu & P. Valduriez 14
When to Refresh a View
Immediate mode
As part of the updating transaction, e.g. through 2PC (two phase
commit)
The main advantages : View always consistent with base data
and fast queries
But increased transaction time to update base data
Deferred mode (preferred in practice)
Through separate refresh transactions
No penalty on the updating transactions
Triggered at different times with different trade-offs
Lazily: just before evaluating a query on the view
Periodically: every hour, every day, etc.
Forcedly: after a number of predefined updates
© 2020, M.T. Özsu & P. Valduriez 15
How to Refresh a View
Full computing from base data
Efficient if there has been many changes
Incremental computing by applying only the changes to
the view
Better if a small subset has been changed
Uses differential relations which reflect updated data only
© 2020, M.T. Özsu & P. Valduriez 16
Differential Relations
Given relation R and update u
R+ contains tuples inserted by u
R- contains tuples deleted by u
Type of u
insert R- empty
delete R+ empty
modifyR+ (R – R- )
Refreshing a view V is then done by computing
V+ (V – V- )
computing V+ and V- may require accessing base data
© 2020, M.T. Özsu & P. Valduriez 17
Example 3.7
EG = SELECT DISTINCT ENAME, RESP
FROM EMP, ASG
WHERE EMP.ENO=ASG.ENO
The changes to the view EG can be computed as:
EG+= (SELECT DISTINCT ENAME, RESP
FROM EMP, ASG+
WHERE EMP.ENO=ASG+.ENO) UNION
(SELECT DISTINCT ENAME, RESP
FROM EMP+, ASG
WHERE EMP+.ENO=ASG.ENO) UNION
(SELECT DISTINCT ENAME, RESP
FROM EMP+, ASG+
WHERE EMP+.ENO=ASG+.ENO)
© 2020, M.T. Özsu & P. Valduriez 18
Techniques for Incremental View
Maintenance
Different techniques depending on:
View expressiveness
Non recursive views: SPJ (select-project-join) with duplicate
elimination, union and aggregation
Views with outerjoin
Recursive views
Most frequent case is non recursive views
Problem: an individual tuple in the view may be derived from
several base tuples
Example: tuple M. Smith, Analyst in EG corresponding to
E2, M. Smith, … in EMP
E2,P1,Analyst,24 and E2,P2,Analyst,6 in ASG
Makes deletion difficult
Solution: Counting
© 2020, M.T. Özsu & P. Valduriez 19
Counting Algorithm (Example 3.8)
Basic idea
Maintain a count of the number of derivations for each tuple in
the view
Increment (resp. decrement) tuple counts based on insertions
(resp. deletions)
A tuple in the view whose count is zero can be deleted
Algorithm
1. Compute V+ and V- using V, base relations and diff. relations
2. Compute positive in V+ and negative counts in V-
3. Compute V+ (V – V- ), deleting each tuple in V with count=0
Optimal: computes exactly the view tuples that are
inserted or deleted
© 2020, M.T. Özsu & P. Valduriez 20
Counting Algorithm (Example 3.8)
Then only tuple (A. Lee,
Consultant) needs to be Assume now that tuples (E2, P1,
deleted from Analyst, 24) and (E3, P3,
EG Consultant, 10) are deleted
from ASG
© 2020, M.T. Özsu & P. Valduriez 21
Outline
Distributed Data Control
View management
Data security
Semantic integrity control
© 2020, M.T. Özsu & P. Valduriez 22
Data Security
Data protection
Prevents the physical content of data to be understood by
unauthorized users
Uses encryption/decryption techniques (Public key)
Access control
Only authorized users perform operations they are allowed to on
database objects
There are two main approaches to database access control
Discretionary access control (DAC)
Long been provided by DBMS with authorization rules
DAC defines access rights based on the users, the type of access
mandatory access control (MAC)
Increases security with security levels by restricting access
to classified data to cleared users.
© 2020, M.T. Özsu & P. Valduriez 23
Discretionary Access Control
Main actors
Subjects (users, groups of users) who execute operations
Operations (in queries or application programs)
Objects, on which operations are performed
Checking whether a subject may perform an op. (operation)
on an object
Authorization= (subject, op. type, object def.)
Defined using GRANT OR REVOKE
Centralized: one single user class (admin.) may grant or revoke
Decentralized, with op. type GRANT
More flexible but recursive revoking process which needs the hierarchy
of grants
© 2020, M.T. Özsu & P. Valduriez 24
Problem with DAC
A malicious user can access unauthorized data through
an authorized user
Example
User A has authorized access to R and S
User B has authorized access to S only
B somehow manages to modify an application program used by
A so it writes R data in S
Then B can read unauthorized data (in S) without violating
authorization rules
Solution: multilevel security based on the famous Bell
and Lapuda model for OS security
© 2020, M.T. Özsu & P. Valduriez 25
Multilevel Access Control
Different security levels (clearances)
Top Secret > Secret > Confidential > Unclassified
Access controlled by 2 rules:
No read up
subject S is allowed to read an object of level L only if level(S) ≥ L
Protect data from unauthorized disclosure, e.g. a subject with secret
clearance cannot read top secret data
No write down:
subject S is allowed to write an object of level L only if level(S) ≤ L
Protect data from unauthorized change, e.g. a subject with top secret
clearance can only write top secret data but not secret data (which could
then contain top secret data)
© 2020, M.T. Özsu & P. Valduriez 26
MAC in Relational DB
A relation can be classified at different levels:
Relation: all tuples have the same clearance
Tuple: every tuple has a clearance
Attribute: every attribute has a clearance
A classified relation is thus multilevel
Appears differently (with different data) to subjects with different
clearances
© 2020, M.T. Özsu & P. Valduriez 27
Example
PROJ*: classified at attribute level
PNO SL1 PNAME SL2 BUDGET SL3 LOC SL4
P1 C Instrumentatio C 150000 C Montreal C
P2 C n C 135000 S New York S
P3 S DB Develop. S 250000 S New York S
CAD/CAM
ROJ* as seen by a subject with confidential clearance
PNO SL1 PNAME SL2 BUDGET SL3 LOC SL4
P1 C Instrumentatio C 150000 C Montreal C
P2 C n C Null C Null C
DB Develop.
© 2020, M.T. Özsu & P. Valduriez 28
Distributed Access Control
Additional problems in a distributed environment
Remote user authentication
Typically using a directory service
Should be replicated at some sites for availability
Management of DAC rules
Problem if users’ group can span multiple sites
Rules stored at some directory based on user groups location
Accessing rules may incur remote queries
Covert channels in MAC
© 2020, M.T. Özsu & P. Valduriez 29
Covert Channels
Indirect means to access unauthorized data
Example
Consider a simple DDB with 2 sites: C (confidential) and S
(secret)
Following the “no write down” rule, an update from a subject with
secret clearance can only be sent to S
Following the “no read up” rule, a read query from the same
subject can be sent to both C and S
But the query may contain secret information (e.g. in a select
predicate), so is a potential covert channel
Solution: replicate part of the DB
So that a site at security level L contains all data that a subject at
level L can access (e.g. S above would replicate the confidential
data so it can entirely process secret queries)
© 2020, M.T. Özsu & P. Valduriez 30
Outline
Distributed Data Control
View management
Data security
Semantic integrity control
© 2020, M.T. Özsu & P. Valduriez 31
Semantic Integrity Control
Maintain database consistency by enforcing a set of
constraints defined on the database.
Structural constraints
Basic semantic properties inherent to a data model e.g., unique
key constraint in relational model
Behavioral constraints
Regulate application behavior, e.g., dependencies in the
relational model (example : between primary key and foreign key)
Two components
Integrity constraint specification
Integrity constraint enforcement
© 2020, M.T. Özsu & P. Valduriez 32
Semantic Integrity Control
Procedural
Control embedded in each application program
Declarative
Assertions in predicate calculus
Easy to define constraints
Definition of database consistency clear
But inefficient to check assertions for each update
Limit the search space
Decrease the number of data accesses/assertion
Preventive strategies
Checking at compile time
© 2020, M.T. Özsu & P. Valduriez 33
Constraint Specification Language
Predefined constraints
specify the more common constraints of the relational model
Not-null attribute
ENO NOT NULL IN EMP
Unique key
(ENO, PNO) UNIQUE IN ASG
Foreign key
A key in a relation R is a foreign key if it is a primary key of another
relation S and the existence of any of its values in R is dependent
upon the existence of the same value in S
PNO IN ASG REFERENCES PNO IN PROJ
Functional dependency
ENO IN EMP DETERMINES ENAME
© 2020, M.T. Özsu & P. Valduriez 34
Constraint Specification Language
Precompiled constraints
Express preconditions that must be satisfied by all tuples in a
relation for a given update type
(INSERT, DELETE, MODIFY)
NEW - ranges over new tuples to be inserted
OLD - ranges over old tuples to be deleted
General Form
CHECK ON <relation> [WHEN <update type>]
<qualification>
© 2020, M.T. Özsu & P. Valduriez 35
Constraint Specification Language
Precompiled constraints
Domain constraint
CHECK ON PROJ (BUDGET≥500000 AND BUDGET≤1000000)
Domain constraint on deletion
CHECK ON PROJ WHEN DELETE (BUDGET = 0)
Transition constraint
CHECK ON PROJ (NEW.BUDGET > OLD.BUDGET AND
NEW.PNO = OLD.PNO)
© 2020, M.T. Özsu & P. Valduriez 36
Constraint Specification Language
General constraints
Constraints that must always be true. Formulae of tuple
relational calculus where all variables are quantified.
General Form
CHECK ON <variable>:<relation>,(<qualification>)
Functional dependency
CHECK ON e1:EMP, e2:EMP
(e1.ENAME = e2.ENAME IF e1.ENO = e2.ENO)
Constraint with aggregate function
CHECK ON g:ASG, j:PROJ
(SUM(g.DUR WHERE g.PNO = j.PNO) < 100 IF
j.PNAME = "CAD/CAM")
© 2020, M.T. Özsu & P. Valduriez 37
Integrity Enforcement
Two methods
Detection : applied after having changed the
database state (posttests)
Execute update u: D Du
If Du is inconsistent then
if possible: compensate Du Du’
else
undo Du D
Preventive : are applied before the database
state is changed (pretests)
Execute u: D Du only if Du will be consistent
Determine valid programs
Determine valid states
© 2020, M.T. Özsu & P. Valduriez 38
Query Modification
Preventive
Add the assertion qualification to the update query
Only applicable to tuple calculus formulae with
universally quantified variables
UPDATE PROJ
SET BUDGET = BUDGET*1.1
WHERE PNAME = "CAD/CAM"
UPDATE PROJ
SET BUDGET = BUDGET*1.1
WHERE PNAME = "CAD/CAM"
AND NEW.BUDGET ≥ 500000
AND NEW.BUDGET ≤ 1000000
© 2020, M.T. Özsu & P. Valduriez 39
Compiled Assertions
Triple (R,T,C) where
R relation
T update type (insert, delete, modify)
C assertion on differential relations
Example: Foreign key assertion
g ASG, j PROJ : g.PNO = j.PNO
Compiled assertions:
(ASG, INSERT, C1), (PROJ, DELETE, C2), (PROJ, MODIFY, C3)
where
C1:NEW ASG+ j PROJ: NEW.PNO = j.PNO
C2:g ASG, OLD PROJ- : g.PNO ≠ OLD.PNO
C3:g ASG, OLD PROJ- NEW PROJ+:
g.PNO ≠OLD.PNO OR OLD.PNO = NEW.PNO
© 2020, M.T. Özsu & P. Valduriez 40
Differential Relations
Given relation R and update u
R+ contains tuples inserted by u
R- contains tuples deleted by u
Type of u
insert R- empty
delete R+ empty
modifyR+ (R – R-)
© 2020, M.T. Özsu & P. Valduriez 41
Differential Relations
Algorithm:
Input: Relation R, update u, compiled assertion Ci
Step 1: Generate differential relations R+ and R–
Step 2: Retrieve the tuples of R+ and R– which do not
satisfy Ci
Step 3: If retrieval is not successful, then the assertion is
valid.
Example :
u is delete on J. Enforcing (EMP, DELETE, C2) :
retrieve all tuples of EMP-
into RESULT
where not(C2)
If RESULT = {}, the assertion is verified
© 2020, M.T. Özsu & P. Valduriez 42
Distributed Integrity Control
Problems:
Definition of constraints
Consideration for fragments
Where to store
Replication
Non-replicated : fragments
Enforcement
Minimize costs
© 2020, M.T. Özsu & P. Valduriez 43
Types of Distributed Assertions
Individual assertions
Single relation, single variable
Domain constraint
Set oriented assertions
Single relation, multi-variable
functional dependency
Multi-relation, multi-variable
foreign key
Assertions involving aggregates
© 2020, M.T. Özsu & P. Valduriez 44
Distributed Integrity Control
Assertion Definition
Similar to the centralized techniques
Transform the assertions to compiled assertions
Assertion Storage
Individual assertions
One relation, only fragments
At each fragment site, check for compatibility
If compatible, store; otherwise reject
If all the sites reject, globally reject
Set-oriented assertions
Involves joins (between fragments or relations)
May be necessary to perform joins to check for compatibility
Store if compatible
© 2020, M.T. Özsu & P. Valduriez 45
Distributed Integrity Control
Assertion Enforcement
Where to enforce each assertion depends on
Type of assertion
Type of update and where update is issued
Individual Assertions
If update = insert
Enforce at the site where the update is issued
If update = qualified
Send the assertions to all the sites involved
Execute the qualification to obtain R+ and R-
Each site enforces its own assertion
Set-oriented Assertions
Single relation
Similar to individual assertions with qualified updates
Multi-relation
Move data to perform joins; then send the result to query master site
© 2020, M.T. Özsu & P. Valduriez 46
Conclusion
Solutions initially designed for centralized systems have
been significantly extended for distributed systems
Materialized views and group-based discretionary access control
Semantic integrity control has received less attention
and is generally not well supported by distributed DBMS
products
Full data control is more complex and costly in
distributed systems
Definition and storage of the rules (site selection)
Design of enforcement algorithms which minimize
communication costs
© 2020, M.T. Özsu & P. Valduriez 47