DDB-distribution Database Important.
DDB-distribution Database Important.
UNIT - I
------------------------------------------------------------------------------------------------
A database is an ordered collection of related data that is built for a specific purpose. A
database may be organized as a collection of multiple tables, where a table represents a real
world element or entity. Each table has several different fields that represent the characteristic
features of the entity.
For example, a company database may include tables for projects, employees, departments,
products, and financial records. The fields in the Employee table may be Name,
Company_Id, Date_of_Joining, and so forth.
A database management system is a collection of programs that enables creation and
maintenance of a database. DBMS is available as a software package that facilitates
definition, construction, manipulation and sharing of data in a database. Definition of a
database includes description of the structure of a database. Construction of a database
involves actual storing of the data in any storage medium. Manipulation refers to the
retrieving information from the database, updating the database and generating reports.
Sharing of data facilitates data to be accessed by different users or programs.
Examples of DBMS Application Areas
Automatic Teller Machines Train Reservation System Employee Management System
Student Information System
Examples of DBMS Packages
MySQL
Oracle
SQL Server dBASE
FoxPro PostgreSQL, etc.
Distributed Database
A distributed database is a set of interconnected databases that is distributed over the
computer network or internet. A Distributed Database Management System (DDBMS)
manages the distributed database and provides mechanisms to make the databases transparent
to the users. In these systems, data is intentionally distributed among multiple nodes so that
all computing resources of the organization can be optimally used.
A distributed database is a collection of multiple interconnected databases, which are
spread physically across various locations that communicate via a computer network.
Features
Databases in the collection are logically interrelated with each other. Often,
theyrepresent a single logical database.
Data is physically stored across multiple sites. Data in each site can be managed by
a DBMS independent of the other sites.
The processors in the sites are connected via a network. They do not have any
multiprocessor configuration.
A distributed database is not a loosely connected file system.
A distributed database incorporates transaction processing, but it is not synonymous
with a transaction processing system.
Distributed database
Centralized database
First, we introduce a framework for the design of distributed databases, by stressing what should be
designed. We also indicate the objectives of the design of data distribution, and we present a top-down and a
bottom-up approach. In the rest of the chapter, we will concentrate on the top-down approach.
The distribution of the database adds to the above problems two new ones:
3. Designing the fragmentation, i.e., determining how global relations are subdivided into horizontal,
vertical, or mixed fragments.
4. Designing the allocation of fragments, i.e., determining how fragments are mapped to physical images; in
this way, also the replication of fragments is determined. These two problems fully characterize the design
of data distribution.
In the design of a distributed database, sufficiently precise knowledge of application requirements is
needed; clearly, this knowledge is required only for the more "important" applications, i.e., those
which will be executed frequently or whose performances are critical. In the application requirements
we include:
1. The site from which the application is issued (also called site of origin of the application).
2. The frequency of activation of the application (i.e., the number of activation requests in the unit time); in
the general case of applications which can be issued at multiple sites, we need to know the frequency of
activation of each application at each site.
3. The number, type, and the statistical distribution of accesses made by each application to each required
data "object.
Bottom-Up Approach
The design of fragmentation is the first problem that must be solved in the top-down design of data distribution. The
purpose of fragmentation design is to determine nonoverlapping fragments which are "logical units of allocation," i.e., that
are appropriate startpoints for the following data allocation problem.
Horizontal Fragmentation
we have introduced two types of horizontal fragmentation, called primary and derived;
Primary fragmentation: primary horizontal fragments are defined using selections on global relations; the correctness
of primary fragmentation requires that each tuple of the global relation be selected in one and only one fragment.
Let R be the global relation for which we want to produce a horizontal primary fragmentation. We introduce the following
definitions:
A simple predicate is a predicate of the type: Attribute comparison_operator value (Ex: RollNo = 1).
A minterm predicate y for a set P of simple predicates is the conjunction of all predicates appearing in P, either
taken in natural form or negated, provided that this expression is not a contradiction. Thus,
Improves the performance of the applications processing in the Distributed Database systems
to reduce the communication cost during the applications execution and handling their operational processing.
Fragments are not properly modeled as individual files,
‒ do not consider the fact that they have the same structure or behavior.
There are many more fragments than original global relations, and many analytic models cannot compute the
solution of problems involving too many variables.
Modeling application behavior in file systems is very simple while in distributed databases applications can make a
sophisticated use of data
Using the "additional replication" method for replicated allocation. Here, di denote the degree of redundancy of Ri
and Fi denote the benefit of having Ri fully replicated at each site
Vertical fragmentation:
Here we measure the benefit of vertically partitioning a fragment Ri, allocated at site r, into two
fragments Rs and Rt , allocated at sites s and t. By the effect of this partitioning:
1. There are two sets A3 and At of applications, issued at sites s or t, which use only attributes of Rs or Rt
and become local to sites s and t, respectively; these applications save one remote reference.
2. There is a set A\ of applications formerly local to r which use only attributes of Rs or Rt ) these
applications now need to make an additional remote reference.
3. There is a set A2 of applications formerly local to r which reference attributes of both Rs and Rt ; these
applications make two additional remote references.
4. There is a set A3 of applications at sites different than r, 5, or t which reference attributes of both Rs and
Rt; these applications make one additional remote reference.
We evaluate the benefit of this partitioning as
Vertical clustering:
We measure the benefit of the vertical clustering of a fragment Ri, allocated at site r, into two fragments Rs
and Rt , allocated at sites s and t, with overlapping attributes J. The clustering requires reconsidering the
groups of applications introduced for vertical partitioning:
1. As includes applications which are local to site s because they either: • Read any attribute of RSJ or •
Update attributes of Rs which are not in the overlapping part / The same holds for At .
2. A2 includes update applications formerly local to r which make an update to attributes of I, since now
they need to access both Rs and Rt .
3. As includes the applications at sites different than r, s, or t which update attributes of /, which also need
to access both Rs and Rt . We evaluate the benefit of this clustering using the above expression for Bist .
DESIGN ALTERNATIVES
The distribution design alternatives for the tables in a DDBMS are as follows −
Non-replicated and non-fragmented
Fully replicated
Partially replicated Fragmented
Mixed
Non-replicated & Non-fragmented
In this design alternative, different tables are placed at different sites. Data is placed so that it
is at a close proximity to the site where it is used most. It is most suitable for database
systems where the percentage of queries needed to join information in tables placed at
different sites is low. If an appropriate distribution strategy is adopted, then this design
alternative helps to reduce the communication cost during data processing.
Fully Replicated
In this design alternative, at each site, one copy of all the database tables is stored. Since,
each site has its own copy of the entire database, queries are very fast requiring negligible
communication cost. On the contrary, the massive redundancy in data requires huge cost
during update operations. Hence, this is suitable for systems where a large number of
queries is required to be handled whereas the number of database updates is low.
Partially Replicated
Copies of tables or portions of tables are stored at different sites. The distribution of the
tables is done in accordance to the frequency of access. This takes into consideration the
fact that the frequency of accessing the tables vary considerably from site to site. The
number of copies of the tables (or portions) depends on how frequently the access queries
execute and the site which generate the access queries.
Fragmented
In this design, a table is divided into two or more pieces referred to as fragments or partitions,
and each fragment can be stored at different sites. This considers the fact that it seldom
happens that all data stored in a table is required at a given site. Moreover, fragmentation
increases parallelism and provides better disaster recovery. Here, there is only one copy of
each fragment in the system, i.e. no redundant data.
The three fragmentation techniques are −
Vertical fragmentation
Horizontal fragmentation
Hybrid fragmentation
Mixed Distribution: This is a combination of fragmentation and partial replications. Here, the
tables are initially fragmented in any form (horizontal or vertical), and then these fragments
are partially replicated across the different sites according to the frequency of accessing the
fragments.
Design Strategies
In the last chapter, we had introduced different design alternatives. In this chapter, we will
study the strategies that aid in adopting the designs. The strategies can be broadly divided
into replication and fragmentation. However, in most cases, a combination of the two is
used.
Data Replication
Data replication is the process of storing separate copies of the database at two or more
sites. It is a popular fault tolerance technique of distributed databases.
Advantages of Data Replication
Reliability − In case of failure of any site, the database system continues to work
since a copy is available at another site(s).
Reduction in Network Load − Since local copies of data are available, query
processing can be done with reduced network usage, particularly during prime hours.
Data updating can be done at non-prime hours.
Quicker Response − Availability of local copies of data ensures quick query
processing and consequently quick response time.
Simpler Transactions − Transactions require less number of joins of tables located at
different sites and minimal coordination across the network. Thus, they become
simpler in nature.
Disadvantages
1. Applications whose views are defined on more than one fragment may suffer
performance degradation, if applications have conflicting requirements.
2. Simple tasks like checking for dependencies, would result in chasing after data in a
number of sites
3. When data from different fragments are required, the access speeds may be very
high.
4. In case of recursive fragmentations, the job of reconstruction will need expensive
techniques.
5. Lack of back-up copies of data in different sites may render the database ineffective in
case of failure of a site.
For example, let us consider that a University database keeps records of all registeredstudents in a Student
table having the following schema.
STUDENT
Regd_No Name Course Address Semester Fees Ma
rks
Now, the fees details are maintained in the accounts section. In this case, the designer will fragment
Horizontal Fragmentation
Horizontal fragmentation groups the tuples of a table in accordance to values of one or more
fields. Horizontal fragmentation should also confirm to the rule of reconstructiveness. Each
horizontal fragment must have all columns of the original base table.
Link between the owner and the member relations is defined as equi-join
Given a link L where owner (L) = S and member (L) = R, the derived horizontal
fragments of R are defined as
Ri = R α Si, 1 <= I <= w
Where,
Si = σ Fi (S)
w is the max number of fragments that will be defined on
Fi is the formula using which the primary horizontal fragment Si is defined
Hybrid Fragmentation
In hybrid fragmentation, a combination of horizontal and vertical fragmentation techniques
are used. This is the most flexible fragmentation technique since it generates fragments with
minimal extraneous information. However, reconstruction of the original table is often an
expensive task.
Hybrid fragmentation can be done in two alternative ways −
At first, generate a set of horizontal fragments; then generate vertical fragments from one or
more of the horizontal fragments.
At first, generate a set of vertical fragments; then generate horizontal fragments from one or
more of the vertical fragments.
Transparency
Transparency in DBMS stands for the separation of high level semantics of the system from
the low-level implementation issue. High-level semantics stands for the endpoint user, and
low level implementation concerns with complicated hardware implementation of data or
how the data has been stored in the database. Using data independence in various layers of
the database, transparency can be implemented in DBMS.
Distribution transparency is the property of distributed databases by the virtue of which the
internal details of the distribution are hidden from the users. The DDBMS designer may
choose to fragment tables, replicate the fragments and store them at different sites.
However, since users are oblivious of these details, they find the distributed database easy to
use like any centralized database.
Unlike normal DBMS, DDBMS deals with communication network, replicas and fragments
of data. Thus, transparency also involves these three factors.
Following are three types of transparency:
1. Location transparency
2. Fragmentation transparency
3. Replication transparency
Location Transparency
Location transparency ensures that the user can query on any table(s) or fragment(s) of a
table as if they were stored locally in the user’s site. The fact that the table or its fragments
are stored at remote site in the distributed database system, should be completely oblivious to
the end user. The address of the remote site(s) and the access mechanisms are completely
hidden.In order to incorporate location transparency, DDBMS should have access to updated
and accurate data dictionary and DDBMS directory which contains the details of locations
of data.
Fragmentation Transparency
Fragmentation transparency enables users to query upon any table as if it were unfragmented.
Thus, it hides the fact that the table the user is querying on is actually a fragment or union of
some fragments. It also conceals the fact that the fragments are located at diverse sites.This is
somewhat similar to users of SQL views, where the user may not know that they are using a
view of a table instead of the table itself.
Replication Transparency
Replication transparency ensures that replication of databases are hidden from the users. It
enables users to query upon a table as if only a single copy of the table exists.Replication
transparency is associated with concurrency transparency and failure transparency. Whenever
a user updates a data item, the update is reflected in all the copies of the table. However, this
operation should not be known to the user. This is concurrency transparency. Also, in case of
failure of a site, the user can still proceed with his queries using replicated copies without
any knowledge of failure. This is failure transparency.