Distributed Database
Source:
1. Principles of Distributed Database Systems
By Tanner Ozsu, Patric Valdureitz
2. Slides available
Survey of advanced topics in Database Systems
By 1998, centralized database managers (DBMSs) would be an
“antique curiosity” and most organizations would move towards
distributed database managers. Distribution was slowly starting
and “client/server” had just started.
These systems were generally multiple client/single server
systems in which the distribution was mostly in terms of
functionality, not data. If multiple servers were used, clients
were responsible for managing the connections to these servers.
Transparency of access was not widely supported, and each
client had to “know” the location of the required data. The
distribution of data among multiple servers was very primitive,
systems did not support fragmentation or replication of data.
Systems of the time were “homogeneous” in that each system
could manage only data that were stored in its own database,
with no linkage to other repositories.
Today’s client/server systems provide significant
transparency in accessing data from multiple servers, support
distributed transactions to facilitate transparency, and
execute queries over (horizontally) fragmented data.
Further, new systems implement both synchronous and
asynchronous replication protocols, and many vendors have
introduced gateways to access other databases.
Significant achievements have taken place in the development
and deployment of parallel database servers.
Object database managers have entered the marketplace and
have found a niche market in some classes of applications
which are inherently distributed.
Distributed Database System
Distributed database system (DDBS) technology is one of the
major recent developments in the database systems area.
DDBS technology is the union of what appear to be two
diametrically opposed approaches to data processing:
database system and computer network technologies.
One of the major motivations behind the use of database
systems is the desire to integrate the operational data of an
enterprise and to provide centralized, thus controlled access
to that data.
Fig. 1.2: Database Processing
The technology of computer networks promotes a mode of
work that goes against all centralization efforts.
The most important objective of the database technology is
integration, not centralization. It is possible to achieve
integration without centralization, and that is exactly what
the DDB technology attempts to achieve.
Distributed Data Processing
Distributed computing system: It is a number of autonomous
processing elements (not necessarily homogeneous) that are
interconnected by a computer network and that cooperate in
performing their assigned tasks. The “processing element” is
a computing device that can execute a program on its own.
What is being distributed?
- Processing logic: Processing logic or processing elements
are distributed.
- Function: Various functions of a computer system could be
delegated to various pieces of hardware or software.
- Data: Data used by a number of applications may be
distributed to a number of processing sites.
- Control: The control of the execution of various tasks might
be distributed instead of being performed by one computer
system.
Why do we distribute at all?
- Distributed processing better corresponds to the
organizational structure of today’s widely distributed
enterprises, and that such a system is more reliable and more
responsive.
- Many of the current applications of computer technology
are inherently distributed. Electronic commerce over the
internet, multimedia applications such as news-on-demand,
manufacturing control systems are all examples of such
applications.
- The fundamental reason behind distributed processing is to
be better able to solve the big and complicated problems
simply by dividing them into smaller pieces (using a variation
of the divide-and-conquer rule) and assigning them to
different software groups.
Advantages
Distributed computing provides an economical method of
harnessing more computing power by employing multiple
processing elements optimally.
By attacking these problems in smaller groups working more or
less autonomously, it might be possible to discipline the cost of
software development.
Distributed Database System
Distributed database: It is a collection of multiple, logically
interrelated databases distributed over a computer network.
Distributed database management system (Distributed DBMS):
A distributed DBMS is defined as the software system that
permits the management of the DDBS and makes the
distribution transparent to the users.
Fig. 1.6: Central database on a network (Not a DDBS)
Fig. 1.7: DDBS environment
What is a Distributed Database System?
• A distributed database System is a collection of databases which
are distributed over different computers of a computer network.
• Each site has autonomous processing capability and can perform
local applications.
• Each site also participates in the execution of at least one global
application which requires accessing data at several sites.
Database 1
Database 3
Communication Network
Server 1
Server 3
Database 2
Server 2
Promises of DDBSs
1) Transparent management of distributed and replicated data
Transparency refers to separation of the higher level semantics of
a system from lower level implementation issues. A transparent
system hides the implementation details from users.
Consider an engineering firm that has offices in Boston,
Edmonton, Paris, and San Francisco. They run projects at each of
these sites and would like to maintain a database of their
employees, the projects and other related data.
- Assuming that the database is relational, we need the following
relations:
EMP(ENO, ENAME, TITLE)
PROJ(PNO, PNAME, BUDGET)
PAY(TITLE, SAL)
ASG(ENO, PNO, DUR, RESP)
Fig. 1.8: A distributed application
- For a DDBS, to localize each data such that data about the
employees in Edmonton office are stored in Edmonton, those
in the Boston office are stored in Boston, and so forth. The
same applies to the project and salary information.
We partition each of the relations and store each partition at a
different site. This is known as fragmentation. It may be
preferable to duplicate some of this data at other sites for
performance and reliability reasons. The result is a distributed
database which is fragmented and replicated.
Fig. 1.8: A distributed application
2) Reliability through distributed transactions
Distributed DBMS are intended to improve reliability since they
have replicated components and thereby, eliminate single
points of failure.
The failure of a single site or the failure of a communication
link which makes one or more sites unreachable, is not
sufficient to bring down the entire system.
3) Improved Performance
1. A DDBMS fragments the conceptual database, enabling data
to be stored in close proximity to its points of use (data
localization).
• Since each site handles only a portion of the database,
contention for CPU and I/O services is not as severe as
for centralized databases
• Localization reduces remote access
2. The inherent parallelism of distributed systems may be
exploited for inter-query and intra-query parallelism.
• Inter-query parallelism results from the ability to
execute multiple queries at the same time.
• On the other hand, intra-query parallelism is achieved
by breaking up a single query into a number of
subqueries each of which is executed at a different site,
accessing a different part of the distributed database.
Potential Problems
Complexity: More complex than centralized database
management ones.
Cost: Distributed systems require additional hardware
(communication mechanisms etc.), thus have increased hardware
costs.
Distribution of control: Distribution creates problems of
synchronization and coordination.
Security: In a distributed database system, a network is involved
which is a medium that has its own security requirements. Thus,
the security problems in distributed database systems are by
nature more complicated than in centralized ones.
Problem Areas
Distributed database design
Distributed query processing
Distributed directory management
Distributed concurrency control
Distributed deadlock management
Reliability of distributed DBMS
Operating system support
Heterogeneous databases
Relationship among problems
Distributed Database Design
To place the database and applications across different sites,
there are two alternatives: i) partitioned (or non-replicated) and
ii) replicated.
◦ Partitioned scheme: Database is divided into a number of
disjoint partitions each of which is placed at a different site.
◦ Replicated scheme: It can be fully replicated where the entire
database is stored at each site, or partially replicated where
each partition of the database is stored at more than one site
but not at all the sites.
Two fundamental design issues are fragmentation- the
separation of the database into partitions called fragments, and
distribution- the optimum distribution of the database.
Distributed Query Processing
Query processing deals with designing algorithms that analyze
queries and convert them into a series of data manipulation
operations. The problem is how to decide on a strategy for
executing each query over the network in the most cost-effective
way.
The objective is to optimize where the inherent parallelism is
used to improve the performance of executing the transaction.
Distributed Directory Management
A directory contains information (such as descriptions and
locations) about data items in the database.
A directory may be global to the entire DDBS or local to each site;
it can be centralized at one site or distributed over several sites.
Distributed Concurrency Control
Concurrency control involves the synchronization of accesses to
the distributed database, such that the integrity of the database is
maintained.
We have to worry about both the integrity of a single database,
and about the consistency of multiple copies of the database
(mutual consistency).
Two solution classes are pessimistic and optimistic. In
pessimistic, synchronizing the execution of user requests before
the execution starts. In optimistic, executing the requests and
then checking if the execution compromised the consistency of
the database.
Locking can be used in both cases. It is based on the mutual
exclusion of accesses to data items. Timestamping ensures the
execution of the transactions in some order.