Distributed Database System
Distributed Database System
Distributed database system (DDBS) technology is the union of what appear to be two
diametrically opposed approaches to data processing: database system and computer
network technologies.
What are the bening distribute :
1 data
2 logics
3 functionality
4 introduce s/w
Why distribute importance:
1 reliable
2 most orginzation
3 most application
4 Data management.
pull-only:
In the pullonly mode of data delivery, the transfer of data from servers to clients is
initiated by a client pull.
In the push-only mode of data delivery, the transfer of data from servers to clients is
initiated by a server push in the absence of any specific request from clients.
Push:
The main difficulty of the push-based approach is in deciding which data would be of
common interest, and when to send them to clients – alternatives are periodic, irregular,
or conditional. Thus, the usefulness of server push depends heavily upon the accuracy of
a server to predict the needs of clients.
Hybrid:
The hybrid mode of data delivery combines the client-pull and server-push
mechanisms. The continuous (or continual) query approach
Promises of DDBSs:
There are four fundamentals which may also be viewed as promises of DDBS
technology:
Transparent management of distributed and replicated data.
Reliable access to data through distributed transactions,
Improved performance.
Easier system expansion.
Transparent Management of Distributed and Replicated Data:
Distribution transparency is the property of distributed databases by the virtue of
which the internal details of the distribution are hidden from the users. The DDBMS
designer may choose to fragment tables, replicate the fragments and store them at
different sites.
Data Independence:
Data independence is the type of data transparency that matters for a centralized
DBMS. It refers to the immunity of user applications to changes made in the definition
and organization of data. Application programs should not, ideally, be exposed to
details of data representation and storage.
Network Transparency:
Network transparency is the process of sending or accessing data over a network in
such a way that the information is not visible to users communicating with a local or
remote host, system, network or software. It can provide remote data and computing
resources to a local user without providing intermediate network information.
Replication Transparency:
3|Page
Replication transparency is the ability to create multiple copies of objects without any
effect of the replication seen by applications that use the objects. It should not be
possible for an application to determine the number of replicas, or to be able to see the
identities of specific replica instances.
Fragmentation Transparency:
Fragmentation transparency enables users to query upon any table as if it were
unfragmented. Thus, it hides the fact that the table the user is querying on is actually a
fragment or union of some fragments. It also conceals the fact that the fragments are
located at diverse sites.
Who Should Provide Transparency?
A DDBMS that supports local mapping transparency requires that the user or
programmer specify both the name and location of data fragments stored at remote
locations. Having to specify data locations is undesirable, but this is complicated further
when data is moved from one location to another.
Reliability Through Distributed Transactions:
Distributed database systems are potentially more reliable since there are multiple
copies of each system component, which eliminates single points-of-failure, and data
may be replicated to ensure that access can be provided in case of system failures.
Improved Performance:
The case for the improved performance of distributed DBMSs is typically made based
on two points. First, a distributed DBMS fragments the conceptual database, enabling
data to be stored in close proximity to its points of use (also called data localization).
This has two potential advantages:
1. Since each site handles only a portion of the database, contention for CPU and
I/O services is not as severe as for centralized databases.
2. Localization reduces remote access delays that are usually involved in wide area
networks (for example, the minimum round-trip message propagation delay in
satellite-based systems is about 1 second).
Distributed Database Design:
Definition. Distributed database design refers to the following problem: given a
database and its workload, how should the database be split and allocated to sites so as
to optimize certain objective function (e.g., to minimize the resource consumption in
processing the query workload).
What are the advantages of distributed database?
Distributed databases basically provide us the advantages of distributed
computing to the database management domain. ...
Management of data with different level of transparency – ...
Increased Reliability and availability – ...
Easier Expansion – ...
4|Page
Improved Performance –
What is distributed database architecture?
A distributed database system allows applications to access data from local and
remote databases. In a homogenous distributed database system, each database
is an Oracle Database. In a heterogeneous distributed database system, at least
one of the databases is not an Oracle Database.
Distributed Directory Management:
What is distributed directory management?
A distributed directory is a directory environment in which data is partitioned across
multiple directory servers. A distributed directory must have a collection of machines
including Relational Database Management (RDBM) servers, and proxy servers that
manage the topology.
A distributed database is a database that consists of two or more files located in
different sites either on the same network or on entirely different networks. Portions of
the database are stored in multiple physical locations and processing is distributed
among multiple database nodes.
The distributed database can have the data arranged according to different levels of
transparency i.e data with different transparency levels can be stored at different
locations. The database can be stored according to the departmental information in an
organisation.
Advantages of DDBMS
Disadvantages of DDBMS
ANSI/SPARC Architecture:
What is ANSI-SPARC database architecture?
Peer-to-Peer Systems:
8|Page
The relation schemas for this database can be defined as follows: EMP(ENO, ENAME,
TITLE, SAL, PNO, RESP, DUR) PROJ(PNO,PNAME, BUDGET) In relation scheme
EMP, there are seven attributes: ENO, ENAME, TITLE, SAL, PNO, RESP, DUR. The
values of ENO
Normalization:
Normalization is the process of organizing data in a database. This includes creating
tables and establishing relationships between those tables according to rules designed
both to protect the data and to make the database more flexible by eliminating
redundancy and inconsistent dependency.
The following four problems might exist in a relation scheme:
1. Repetition anomaly. Certain information may be repeated unnecessarily.
Consider, for example, the EMP relation in Figure 2.2. The name, title, and
salary of an employee are repeated for each project on which this person serves.
This is obviously a waste of storage and is contrary to the spirit of databases.
2. 2. Update anomaly. As a consequence of the repetition of data, performing
updates may be troublesome. For example, if the salary of an employee changes,
multiple tuples have to be updated to reflect this change.
3. 3. Insertion anomaly. It may not be possible to add new information to the
database. For example, when a new employee joins the company, we cannot add
personal information (name, title, salary) to the EMP relation unless an
appointment to a project is made. This is because the key of EMP includes the
attribute PNO, and null values cannot be part of the key
4. 4. Deletion anomaly. This is the converse of the insertion anomaly. If an
employee works on only one project, and that project is terminated, it is not
possible to delete the project information from the EMP relation. To do so would
result in deleting the only tuple about the employee, thereby resulting in the loss
of personal information we might want to retain.
Relational Data Languages:
Data manipulation languages developed for the relational model (commonly called
query languages) fall into two fundamental groups: relational algebra languages and
relational calculus languages. The difference between them is based on how the user
query is formulated.
Relational Algebra:
Relational algebra consists of a set of operators that operate on relations. Each operator
takes one or two relations as operands and produces a result relation, which, in turn,
may be an operand to another operator. These operations permit the querying and
updating of a relational database.
Selection:
Selection produces a horizontal subset of a given relation. The subset consists of all the
tuples that satisfy a formula (condition). The selection from a relation R
10 | P a g e
Projection.:
Projection produces a vertical subset of a relation. The result relation contains only
those attributes of the original relation over which projection is performed. Thus the
degree of the result is less than or equal to the degree of the original relation. The
projection of relation R over attributes A and B
Union.:
The union of two relations R and S (denoted as R ∪ S) is the set of all tuples that are in
R, or in S, or in both. We should note that R and S should be union compatible. As in
the case of projection, the duplicate tuples are normally eliminated. Union may be used
to insert new tuples into an existing relation, where these tuples form one of the operand
relations.
Set Difference.:
The set difference of two relations R and S (R− S) is the set of all tuples that are in R but
not in S. In this case, not only should R and S be union compatible, but the operation is
also asymmetric (i.e., R − S 6= S − R). This operation allows the deletion of tuples from
a relation. Together with the union operation, we can perform modification of tuples by
deletion followed by insertion
Cartesian Product.:
The Cartesian product of two relations R of degree k1 and S of degree k2 is the set of
(k1 +k2)-tuples, where each result tuple is a concatenation of one tuple of R with one
tuple of S, for all tuples of R and S. The Cartesian product of R and S is denoted as
R×S.
Intersection:
Intersection of two relations R and S (R ∩ S) consists of the set of all tuples that are in
both R and S. In terms of the basic operators, it can be specified as follows: R∩S = R−
(R−S)
θ-Join.:
Join is a derivative of Cartesian product. There are various forms of join; the primary
classification is between inner join and outer join. We first discuss inner join and its
variants and then describe outer join. The most general type of inner join is the θ-join.
The θ-join of two relations R and S is denoted as R ✶F S
Semi join.:
The semijoin of relation R, defined over the set of attributes A, by relation S, defined
over the set of attributes B, is the subset of the tuples of R that participate in the join of
R with S. It is denoted as RnF S (where F is a predicate as defined before)
Division.:
The division of relation R of degree r with relation S of degree s (where r > s and s 6= 0)
is the set of (r −s)-tuples t such that for all s-tuples u in S, the tuple tu is in R. The
11 | P a g e
division operation is denoted as R÷S and can be specified in terms of the fundamental
operators
Relational Calculus:
In relational calculus-based languages, instead of specifying how to obtain the result,
one specifies what the result is by stating the relationship that is supposed to hold for
the result. Relational calculus languages fall into two groups: tuple relational calculus
and domain relational calculus.
Tuple relational calculus.:
The primitive variable used in tuple relational calculus is a tuple variable which
specifies a tuple of a relation. In other words, it ranges over the tuples of a relation In
tuple relational calculus queries are specified as {t|F(t)}, where t is a tuple variable and
F is a well-formed formula. The atomic formulas are of two forms:
1. Tuple-variable membership expressions. If t is a tuple variable ranging over the
tuples of relation R (predicate symbol), the expression “tuple t belongs to
relation R” is an atomic formula, which is usually specified as R.t or R(t).
2. 2. Conditions. These can be defined as follows:
(a) s[A]θt[B], where s and t are tuple variables and A and B are
components of s and t, respectively. θ is one of the arithmetic
comparison operators , =, 6=, ≤, and ≥. This condition specifies
that component A of s stands in relation θ to the B component of t:
for example, s[SAL] > t[SAL].
(b) (b) s[A]θc, where s, A, and θ are as defined above and c is a
constant. For example, s[ENAME] = “Smith”.
Communication Schemes:
The communication scheme described in [1,2] enables the overlapped execution of
computations and data transfers to reduce waiting times based on a parallel matrix
pattern analysis and, subsequently, a block rearranging of the matrix data.
In terms of the physical communication schemes employed, networks can be either
point-to-point (also called unicast) networks, or broadcast (sometimes also called multi-
point) networks.
the database are stored in multiple physical locations and processing is distributed
among multiple database nodes.
Top-Down Design Process:
The top-down approach aims to structure the creation process of a distributed
database. Defining and separating the construction stages in a correct manner, the
database architects and other people involved in the construction of a distributed
database will have more chances of achieving success in a given project.
The requirements document is input to two parallel activities:
(1) view design:
The view design activity deals with defining the interfaces for end users.
(2) conceptual design:
The conceptual design, on the other hand, is the process by which the enterprise is
examined to determine entity types and relationships among these entities. One can
possibly divide this process into two related activity groups entity analysis and
functional analysis.
Entity analysis:
Entity analysis is concerned with determining the entities, their attributes, and the
relationships among them.
Functional analysis:
Functional analysis, on the other hand, is concerned with determining the fundamental
functions with which the modeled enterprise is involved. The results of these two steps
need to be cross-referenced to get a better understanding of which functions deal with
which entities.
Fragments:
Rather than distributing relations, it is quite common to divide them into subrelations,
called fragments, which are then distributed.
the distribution design activity consists of two steps:
(1) fragmentation: Fragmentation is a process of dividing the whole or full
database into various subtables or sub relations so that data can be stored in
different systems. The small pieces of sub relations or subtables are
called fragments. These fragments are called logical data units and are stored at
various sites. It must be made sure that the fragments are such that they can be
used to reconstruct the original relation (i.e, there isn’t any loss of data).
(2) allocation: Allocation is the process of requesting access to a data set. If you
allocate a data set that exists, the system allows you to open the data set. If you
allocate a data set that does not exist, the system creates space for it on an
available device and allows you to open that space.
13 | P a g e
Advantages :
As the data is stored close to the usage site, the efficiency of the database
system will increase
Local query optimization methods are sufficient for some queries as the
data is available locally
In order to maintain the security and privacy of the database system,
fragmentation is advantageous
Disadvantages :
Access speeds may be very high if data from different fragments are needed
If we are using recursive fragmentation, then it will be very expensive
Reconstruction
Reconstruction: it should be possible to define a relational operator such that the
originalrelation can be reconstructed
R = g(R1, R2, …, Rn)
Reconstruction ensures that the dependencies of the original relation are preserved in
fragmentation Must be possible to define a relational operation that will reconstruct R
from the fragments.Reconstruction for horizontal fragmentation is union operation and
join for vertical
Disjoint ness
If data item d1 appears in fragment R1 .
14 | P a g e
Information Requirements:
The information needed for distribution design can be divided into four categories:
database information, application information, communication network information,
and computer system information. The latter two categories are completely quantitative
in nature and are used in allocation models rather than in fragmentation algorithms.
We do not consider them in detail here. Instead, the detailed information requirements
of the fragmentation and allocation algorithms are discussed in their respective sections.
Horizontal Fragmentation:
As we explained earlier, horizontal fragmentation partitions a relation along its tuples.
Thus each fragment has a subset of the tuples of the relation.
In terms of quantitative information about user applications, we need to have two sets
of data:
1. Minterm selectivity: number of tuples of the relation that would be
accessed by a user query specified according to a given minterm
predicate. For example, the selectivity of m1 of Example 3.6 is 0 since
there are no tuples in PAY that satisfy the minterm predicate. The
selectivity of m2, on the other hand, is 0.25. since one of the four tuples in
PAY satisfy m2. We denote the selectivity of a minterm mi as sel(mi).
2. 2. Access frequency: frequency with which user applications access data.
If Q = {q1, q2, ..., qq} is a set of user queries, acc(qi) indicates the access
frequency of query qi in a given period. Note that minterm access
frequencies can be determined from the query frequencies. We refer to
the access frequency of a minterm mi as acc(mi).
PHORIZONTAL Algorithm:
Input: R: relation; Pr: set of simple predicates
Output: M: set of minterm fragments
Begin
Pr0 ←COM MIN(R,Pr) ;
determine the set M of minterm predicates ;
determine the set I of implications among pi ∈ Pr ;
foreach mi ∈ M do
if mi is contradictory according to I then
M ← M −mi
End