0% found this document useful (0 votes)
159 views15 pages

Distributed Database System

Distributed database systems allow data to be stored across multiple sites connected by a network. This provides benefits like increased reliability if data is replicated across sites, improved performance through data localization near sites of use, and easier expansion by adding new sites. Key challenges include providing various types of transparency to hide the distributed nature from users, managing distributed transactions reliably, and processing distributed queries efficiently.

Uploaded by

water khan
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
Download as docx, pdf, or txt
0% found this document useful (0 votes)
159 views15 pages

Distributed Database System

Distributed database systems allow data to be stored across multiple sites connected by a network. This provides benefits like increased reliability if data is replicated across sites, improved performance through data localization near sites of use, and easier expansion by adding new sites. Key challenges include providing various types of transparency to hide the distributed nature from users, managing distributed transactions reliably, and processing distributed queries efficiently.

Uploaded by

water khan
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1/ 15

1|Page

Distributed database system (DDBS) technology is the union of what appear to be two
diametrically opposed approaches to data processing: database system and computer
network technologies.
What are the bening distribute :
1 data
2 logics
3 functionality
4 introduce s/w
Why distribute importance:
1 reliable
2 most orginzation
3 most application
4 Data management.

Distributed Data Processing:


The term distributed processing (or distributed computing) is hard to define precisely.
Obviously, some degree of distributed processing goes on in any computer system, even
on single-processor computers where the central processing unit (CPU) and
input/output (I/O) functions are separated and overlapped.
What is a Distributed Database System?
We define a distributed database as a collection of multiple, logically interrelated
databases distributed over a computer network.
A distributed database management system (distributed DBMS:
. A distributed database management system (distributed DBMS) is then defined as the
software system that permits the management of the distributed database and makes
the distribution transparent to the users. Sometimes “distributed database system”
(DDBS) is used to refer jointly to the distributed database and the distributed DBMS.
The two important terms in these definitions are “logically interrelated” and
“distributed over a computer network.” A DDBS is not a “collection of files” that can be
individually stored at each node of a computer network
Data Delivery Alternatives:
In distributed databases, data are “delivered” from the sites where they are stored to
where the query is posed. We characterize the data delivery alternatives along three
orthogonal dimensions: delivery modes, frequency and communication methods. The
combinations of alternatives along each of these dimensions (that we discuss next)
provide a rich design space.
The alternative delivery modes:
2|Page

pull-only:
In the pullonly mode of data delivery, the transfer of data from servers to clients is
initiated by a client pull.
In the push-only mode of data delivery, the transfer of data from servers to clients is
initiated by a server push in the absence of any specific request from clients.
Push:
The main difficulty of the push-based approach is in deciding which data would be of
common interest, and when to send them to clients – alternatives are periodic, irregular,
or conditional. Thus, the usefulness of server push depends heavily upon the accuracy of
a server to predict the needs of clients.
Hybrid:
The hybrid mode of data delivery combines the client-pull and server-push
mechanisms. The continuous (or continual) query approach
Promises of DDBSs:
There are four fundamentals which may also be viewed as promises of DDBS
technology:
 Transparent management of distributed and replicated data.
 Reliable access to data through distributed transactions,
 Improved performance.
 Easier system expansion.
Transparent Management of Distributed and Replicated Data:
Distribution transparency is the property of distributed databases by the virtue of
which the internal details of the distribution are hidden from the users. The DDBMS
designer may choose to fragment tables, replicate the fragments and store them at
different sites.
Data Independence:
Data independence is the type of data transparency that matters for a centralized
DBMS. It refers to the immunity of user applications to changes made in the definition
and organization of data. Application programs should not, ideally, be exposed to
details of data representation and storage.

Network Transparency:
Network transparency is the process of sending or accessing data over a network in
such a way that the information is not visible to users communicating with a local or
remote host, system, network or software. It can provide remote data and computing
resources to a local user without providing intermediate network information.
Replication Transparency:
3|Page

Replication transparency is the ability to create multiple copies of objects without any
effect of the replication seen by applications that use the objects. It should not be
possible for an application to determine the number of replicas, or to be able to see the
identities of specific replica instances.
Fragmentation Transparency:
Fragmentation transparency enables users to query upon any table as if it were
unfragmented. Thus, it hides the fact that the table the user is querying on is actually a
fragment or union of some fragments. It also conceals the fact that the fragments are
located at diverse sites.
Who Should Provide Transparency?
A DDBMS that supports local mapping transparency requires that the user or
programmer specify both the name and location of data fragments stored at remote
locations. Having to specify data locations is undesirable, but this is complicated further
when data is moved from one location to another.
Reliability Through Distributed Transactions:
Distributed database systems are potentially more reliable since there are multiple
copies of each system component, which eliminates single points-of-failure, and data
may be replicated to ensure that access can be provided in case of system failures.
Improved Performance:
The case for the improved performance of distributed DBMSs is typically made based
on two points. First, a distributed DBMS fragments the conceptual database, enabling
data to be stored in close proximity to its points of use (also called data localization).
This has two potential advantages:
1. Since each site handles only a portion of the database, contention for CPU and
I/O services is not as severe as for centralized databases.
2. Localization reduces remote access delays that are usually involved in wide area
networks (for example, the minimum round-trip message propagation delay in
satellite-based systems is about 1 second).
Distributed Database Design:
Definition. Distributed database design refers to the following problem: given a
database and its workload, how should the database be split and allocated to sites so as
to optimize certain objective function (e.g., to minimize the resource consumption in
processing the query workload).
What are the advantages of distributed database?
 Distributed databases basically provide us the advantages of distributed
computing to the database management domain. ...
 Management of data with different level of transparency – ...
 Increased Reliability and availability – ...
 Easier Expansion – ...
4|Page

 Improved Performance –
 What is distributed database architecture?
 A distributed database system allows applications to access data from local and
remote databases. In a homogenous distributed database system, each database
is an Oracle Database. In a heterogeneous distributed database system, at least
one of the databases is not an Oracle Database.
Distributed Directory Management:
What is distributed directory management?
A distributed directory is a directory environment in which data is partitioned across
multiple directory servers. A distributed directory must have a collection of machines
including Relational Database Management (RDBM) servers, and proxy servers that
manage the topology.

Distributed Query Processing:


Distributed query processing is the procedure of answering queries (which means
mainly read operations on large data sets) in a distributed environment where data is
managed at multiple sites in a computer network.
Distributed Concurrency Control:
Distributed concurrency control provides concepts and technologies to synchronize
distributed transactions in a way that their interleaved execution does not violate the
ACID properties. Distributed transactions are executed in a distributed
database environment, where a set of connected data servers host related data.
Distributed Deadlock Management:
The two main deadlock handling concerns in a distributed database system that are not
present in a centralized system are transaction location and transaction control. Once
these concerns are addressed, deadlocks are handled through any of deadlock
prevention, deadlock avoidance or deadlock detection and removal.
Distributed DBMS Architecture:
A distributed database system allows applications to access data from local and remote
databases. In a homogenous distributed database system, each database is an Oracle
Database. In a heterogeneous distributed database system, at least one of the databases
is not an Oracle Database.
What are the 3 reference architecture for a distributed DBMS?
Reference Architecture of Distributed DBMSs

A set of global external schema. A global conceptual schema. A fragmentation schema


and allocation schema.

What are distributed database in DBMS?


5|Page

A distributed database is a database that consists of two or more files located in
different sites either on the same network or on entirely different networks. Portions of
the database are stored in multiple physical locations and processing is distributed
among multiple database nodes.

What are the features of distributed DBMS?


A distributed database represents multiple interconnected databases spread out across
several sites connected by a network. Since the databases are all connected, they appear
as a single database to the users. Distributed databases utilize multiple nodes. They
scale horizontally and develop a distributed system.

What architecture and advantages have distributed DBMS?


Advantages of DDBMS

The distributed database can have the data arranged according to different levels of
transparency i.e data with different transparency levels can be stored at different
locations. The database can be stored according to the departmental information in an
organisation.

What is distributed database design?


Definition. Distributed database design refers to the following problem: given a
database and its workload, how should the database be split and allocated to sites so as
to optimize certain objective function (e.g., to minimize the resource consumption in
processing the query workload).

Why distributed DBMS is used?


It is used to create, retrieve, update and delete distributed databases. It synchronizes the
database periodically and provides access mechanisms by the virtue of which the
distribution becomes transparent to the users. It ensures that the data modified at any
site is universally updated.

What is schema in DBMS?

A database schema defines how data is organized within a relational database; this is


inclusive of logical constraints such as, table names, fields, data types, and the
relationships between these entities.

Advantages of DDBMS

 The database is easier to expand as it is already spread across multiple


systems and it is not too complicated to add a system.
 The distributed database can have the data arranged according to
different levels of transparency i.e data with different transparency levels
can be stored at different locations.
6|Page

 The database can be stored according to the departmental information in


an organisation. In that case, it is easier for a organisational hierarchical
access.
 there were a natural catastrophe such as fire or an earthquake all the data
would not be destroyed it is stored at different locations.
 It is cheaper to create a network of systems containing a part of the
database. This database can also be easily increased or decreased.
 Even if some of the data nodes go offline, the rest of the database can
continue its normal functions.

Disadvantages of DDBMS

 The distributed database is quite complex and it is difficult to make sure


that a user gets a uniform view of the database because it is spread across
multiple locations.
 This database is more expensive as it is complex and hence, difficult to
maintain.
 It is difficult to provide security in a distributed database as the database
needs to be secured at all the locations it is stored. Moreover, the
infrastructure connecting all the nodes in a distributed database also
needs to be secured.
 It is difficult to maintain data integrity in the distributed database because
of its nature. There can also be data redundancy in the database as it is
stored at multiple locations.
 The distributed database is complicated and it is difficult to find people
with the necessary experience who can manage and maintain it.

ANSI/SPARC Architecture:
What is ANSI-SPARC database architecture?

The ANSI-SPARC Architecture (American National Standards Institute, Standards


Planning And Requirements Committee), is an abstract design standard for a database
management system (DBMS), first proposed in 1975.

What are the 3 types of DBMS architecture?


To understand the structure of DBMS, they are classified into three types based on their
build architecture:
 One-Tier Architecture. ...
7|Page

 Two-Tier Architecture. ...


 Three-Tier Architecture. ...
 Hierarchical Model. ...
 Network Model. ...
 Relational Model.
 Why is the ANSI-SPARC model important?
 It provides internal view of physical storage of data. It deals with complex low
level data structures,file structures and access methods in detail. It also deals
with Data Compression and Encryption techniques,if used.

What is a 2-tier architecture?


Two-tier architecture. In a two-tier architecture, the client is on the first tier. The
database server and web application server reside on the same server machine, which is
the second tier. This second tier serves the data and executes the business logic for the
web application.
What is a 3-tier architecture?
Three-tier architecture is a well-established software application architecture that
organizes applications into three logical and physical computing tiers: the presentation
tier, or user interface; the application tier, where data is processed; and the data tier,
where the data associated with the application is ...
Autonomy:
Autonomy, in this context, refers to the distribution of control, not of data. It indicates
the degree to which individual DBMSs can operate independently. Autonomy is a
function of a number of factors such as whether the component systems (i.e., individual
DBMSs) exchange information, whether they can independently execute transactions,
and whether one is allowed to modify them
Distribution:
A distributed database (DDB) is an integrated collection of databases that is physically
distributed across sites in a computer network. A distributed database management
system (DDBMS) is the software system that manages a distributed database such that
the distribution aspects are transparent to the users.
Heterogeneity:
What is heterogeneity in Ddbms?
Heterogeneous. A heterogeneous distributed database uses different schemas, operating
systems, DDBMS, and different data models. In the case of a heterogeneous distributed
database, a particular site can be completely unaware of other sites causing limited
cooperation in processing user requests.

Peer-to-Peer Systems:
8|Page

If the term “client/server” is loaded with different interpretations, “peer-to-peer” is


even worse as its meaning has changed and evolved over the years. As noted earlier, the
early works on distributed DBMSs all focused on peer-to-peer architectures where
there was no differentiation between the functionality of each site in the system4
One component handles the interaction with users, and another deals with the storage.
The first major component, which we call the user processor, consists of four elements::
1. The user interface handler is responsible for interpreting user commands
as they come in, and formatting the result data as it is sent to the user.
2. The semantic data controller uses the integrity constraints and
authorizations that are defined as part of the global conceptual schema to
check if the user query can be processed. This component, which is
studied in detail in Chapter 5, is also responsible for authorization and
other functions.
3. 3. The global query optimizer and decomposer determines an execution
strategy to minimize a cost function, and translates the global queries into
local ones using the global and local conceptual schemas as well as the
global directory. The global query optimizer is responsible, among other
things, for generating the best strategy to execute distributed join
operations.
4. 4. The distributed execution monitor coordinates the distributed
execution of the user request. The execution monitor is also called the
distributed transaction manager. In executing queries in a distributed
fashion, the execution monitors at various sites may, and usually do,
communicate with one another.
The second major component of a distributed DBMS is the data processor and consists
of three elements:
1. The local query optimizer, which actually acts as the access path selector, is
responsible for choosing the best access path5 to access any data item
2. 2. The local recovery manager is responsible for making sure that the local
database remains consistent even when failures occur
3. 3. The run-time support processor physically accesses the database according to
the physical commands in the schedule generated by the query optimizer. The
run-time support processor is the interface to the operating system and contains
the database buffer (or cache) manager, which is responsible for maintaining the
main memory buffers and managing the data accesses.
CHAPTER 2

Relational Database Concepts:


In a relational database, all data is held in tables, which are made up of rows and
columns. Each table has one or more columns, and each column is assigned a specific
datatype, such as an integer number, a sequence of characters (for text), or a date. Each
row in the table has a value for each column.
9|Page

The relation schemas for this database can be defined as follows: EMP(ENO, ENAME,
TITLE, SAL, PNO, RESP, DUR) PROJ(PNO,PNAME, BUDGET) In relation scheme
EMP, there are seven attributes: ENO, ENAME, TITLE, SAL, PNO, RESP, DUR. The
values of ENO
Normalization:
Normalization is the process of organizing data in a database. This includes creating
tables and establishing relationships between those tables according to rules designed
both to protect the data and to make the database more flexible by eliminating
redundancy and inconsistent dependency.
The following four problems might exist in a relation scheme:
1. Repetition anomaly. Certain information may be repeated unnecessarily.
Consider, for example, the EMP relation in Figure 2.2. The name, title, and
salary of an employee are repeated for each project on which this person serves.
This is obviously a waste of storage and is contrary to the spirit of databases.
2. 2. Update anomaly. As a consequence of the repetition of data, performing
updates may be troublesome. For example, if the salary of an employee changes,
multiple tuples have to be updated to reflect this change.
3. 3. Insertion anomaly. It may not be possible to add new information to the
database. For example, when a new employee joins the company, we cannot add
personal information (name, title, salary) to the EMP relation unless an
appointment to a project is made. This is because the key of EMP includes the
attribute PNO, and null values cannot be part of the key
4. 4. Deletion anomaly. This is the converse of the insertion anomaly. If an
employee works on only one project, and that project is terminated, it is not
possible to delete the project information from the EMP relation. To do so would
result in deleting the only tuple about the employee, thereby resulting in the loss
of personal information we might want to retain.
Relational Data Languages:
Data manipulation languages developed for the relational model (commonly called
query languages) fall into two fundamental groups: relational algebra languages and
relational calculus languages. The difference between them is based on how the user
query is formulated.
Relational Algebra:
Relational algebra consists of a set of operators that operate on relations. Each operator
takes one or two relations as operands and produces a result relation, which, in turn,
may be an operand to another operator. These operations permit the querying and
updating of a relational database.
Selection:
Selection produces a horizontal subset of a given relation. The subset consists of all the
tuples that satisfy a formula (condition). The selection from a relation R
10 | P a g e

Projection.:
Projection produces a vertical subset of a relation. The result relation contains only
those attributes of the original relation over which projection is performed. Thus the
degree of the result is less than or equal to the degree of the original relation. The
projection of relation R over attributes A and B
Union.:
The union of two relations R and S (denoted as R ∪ S) is the set of all tuples that are in
R, or in S, or in both. We should note that R and S should be union compatible. As in
the case of projection, the duplicate tuples are normally eliminated. Union may be used
to insert new tuples into an existing relation, where these tuples form one of the operand
relations.
Set Difference.:
The set difference of two relations R and S (R− S) is the set of all tuples that are in R but
not in S. In this case, not only should R and S be union compatible, but the operation is
also asymmetric (i.e., R − S 6= S − R). This operation allows the deletion of tuples from
a relation. Together with the union operation, we can perform modification of tuples by
deletion followed by insertion
Cartesian Product.:
The Cartesian product of two relations R of degree k1 and S of degree k2 is the set of
(k1 +k2)-tuples, where each result tuple is a concatenation of one tuple of R with one
tuple of S, for all tuples of R and S. The Cartesian product of R and S is denoted as
R×S.
Intersection:
Intersection of two relations R and S (R ∩ S) consists of the set of all tuples that are in
both R and S. In terms of the basic operators, it can be specified as follows: R∩S = R−
(R−S)
θ-Join.:
Join is a derivative of Cartesian product. There are various forms of join; the primary
classification is between inner join and outer join. We first discuss inner join and its
variants and then describe outer join. The most general type of inner join is the θ-join.
The θ-join of two relations R and S is denoted as R ✶F S
Semi join.:
The semijoin of relation R, defined over the set of attributes A, by relation S, defined
over the set of attributes B, is the subset of the tuples of R that participate in the join of
R with S. It is denoted as RnF S (where F is a predicate as defined before)
Division.:
The division of relation R of degree r with relation S of degree s (where r > s and s 6= 0)
is the set of (r −s)-tuples t such that for all s-tuples u in S, the tuple tu is in R. The
11 | P a g e

division operation is denoted as R÷S and can be specified in terms of the fundamental
operators
Relational Calculus:
In relational calculus-based languages, instead of specifying how to obtain the result,
one specifies what the result is by stating the relationship that is supposed to hold for
the result. Relational calculus languages fall into two groups: tuple relational calculus
and domain relational calculus.
Tuple relational calculus.:
The primitive variable used in tuple relational calculus is a tuple variable which
specifies a tuple of a relation. In other words, it ranges over the tuples of a relation In
tuple relational calculus queries are specified as {t|F(t)}, where t is a tuple variable and
F is a well-formed formula. The atomic formulas are of two forms:
1. Tuple-variable membership expressions. If t is a tuple variable ranging over the
tuples of relation R (predicate symbol), the expression “tuple t belongs to
relation R” is an atomic formula, which is usually specified as R.t or R(t).
2. 2. Conditions. These can be defined as follows:
(a) s[A]θt[B], where s and t are tuple variables and A and B are
components of s and t, respectively. θ is one of the arithmetic
comparison operators , =, 6=, ≤, and ≥. This condition specifies
that component A of s stands in relation θ to the B component of t:
for example, s[SAL] > t[SAL].
(b) (b) s[A]θc, where s, A, and θ are as defined above and c is a
constant. For example, s[ENAME] = “Smith”.
Communication Schemes:
The communication scheme described in [1,2] enables the overlapped execution of
computations and data transfers to reduce waiting times based on a parallel matrix
pattern analysis and, subsequently, a block rearranging of the matrix data.
In terms of the physical communication schemes employed, networks can be either
point-to-point (also called unicast) networks, or broadcast (sometimes also called multi-
point) networks.

Data Communication Concepts:


 CONCEPT OF COMMUNICATION
Thus, data communication refers to the exchange of data between two or more
networked or connected devices. These devices must be capable of sending and
receiving data over a communication medium. Examples of such devices include
personal computers, mobile phones, laptops, etc.
Distributed Database Design:
A distributed database is a database that consists of two or more files located in
different sites either on the same network or on entirely different networks. Portions of
12 | P a g e

the database are stored in multiple physical locations and processing is distributed
among multiple database nodes.
Top-Down Design Process:
The top-down approach aims to structure the creation process of a distributed
database. Defining and separating the construction stages in a correct manner, the
database architects and other people involved in the construction of a distributed
database will have more chances of achieving success in a given project.
The requirements document is input to two parallel activities:
(1) view design:
The view design activity deals with defining the interfaces for end users.
(2) conceptual design:
The conceptual design, on the other hand, is the process by which the enterprise is
examined to determine entity types and relationships among these entities. One can
possibly divide this process into two related activity groups entity analysis and
functional analysis.
Entity analysis:
Entity analysis is concerned with determining the entities, their attributes, and the
relationships among them.
Functional analysis:
Functional analysis, on the other hand, is concerned with determining the fundamental
functions with which the modeled enterprise is involved. The results of these two steps
need to be cross-referenced to get a better understanding of which functions deal with
which entities.
Fragments:
Rather than distributing relations, it is quite common to divide them into subrelations,
called fragments, which are then distributed.
the distribution design activity consists of two steps:
(1) fragmentation: Fragmentation is a process of dividing the whole or full
database into various subtables or sub relations so that data can be stored in
different systems. The small pieces of sub relations or subtables are
called fragments. These fragments are called logical data units and are stored at
various sites. It must be made sure that the fragments are such that they can be
used to reconstruct the original relation (i.e, there isn’t any loss of data). 
(2) allocation: Allocation is the process of requesting access to a data set. If you
allocate a data set that exists, the system allows you to open the data set. If you
allocate a data set that does not exist, the system creates space for it on an
available device and allows you to open that space.
13 | P a g e

Advantages :
 As the data is stored close to the usage site, the efficiency of the database
system will increase
 Local query optimization methods are sufficient for some queries as the
data is available locally
 In order to maintain the security and privacy of the database system,
fragmentation is advantageous
Disadvantages :
 Access speeds may be very high if data from different fragments are needed
 If we are using recursive fragmentation, then it will be very expensive

Distribution Design Issues:


In the preceding section we indicated that the relations in a database schema are usually
decomposed into smaller fragments, but we did not offer any justification or details for
this process. The objective of this section is to fill in these details.
Why fragment at all?
Advantages : As the data is stored close to the usage site, the efficiency of the database
system will increase. Local query optimization methods are sufficient for some queries
as the data is available locally. In order to maintain the security and privacy of the
database system, fragmentation is advantageous..
How should we allocate?
The techniques that are used for distributed database design can be categorized
as: Fragmentation: Technique that is used to break up the database into logical units is
known as fragmentation. Fragments can be assigned for storage at different sites.
Correctness Rules of Fragmentation:
There Three Correctness Rules
1. Completeness
2. Reconstruction
3. Disjointness
Completeness
If relation R is decomposed into fragments R1,R2,……Rn .
Each data item that can be found in R must appear in at least one fragment

Reconstruction
Reconstruction: it should be possible to define a relational operator such that the
originalrelation can be reconstructed
R = g(R1, R2, …, Rn)
Reconstruction ensures that the dependencies of the original relation are preserved in
fragmentation Must be possible to define a relational operation that will reconstruct R
from the fragments.Reconstruction for horizontal fragmentation is union operation and
join for vertical

Disjoint ness
If data item d1 appears in fragment R1 .
14 | P a g e

Then it should not appear in any other fragment


Exception Vertical fragmentation where primary key attributes must be repeated to
allow reconstruction. For horizontal fragmentation data item is a tuple For vertical
fragmentation data item is an attribute.

Information Requirements:
The information needed for distribution design can be divided into four categories:
database information, application information, communication network information,
and computer system information. The latter two categories are completely quantitative
in nature and are used in allocation models rather than in fragmentation algorithms.
We do not consider them in detail here. Instead, the detailed information requirements
of the fragmentation and allocation algorithms are discussed in their respective sections.

Fragmentation: In this section we present the various fragmentation strategies and


algorithms. As mentioned previously, there are two fundamental fragmentation
strategies: horizontal and vertical. Furthermore, there is a possibility of nesting
fragments in a hybrid fashion.

Horizontal Fragmentation:
As we explained earlier, horizontal fragmentation partitions a relation along its tuples.
Thus each fragment has a subset of the tuples of the relation.

There are two versions of horizontal partitioning: primary and derived:


. Primary horizontal fragmentation:
Primary horizontal fragmentation of a relation is performed using predicates that are
defined on that relation.
Derived horizontal fragmentation:
Derived horizontal fragmentation, on the other hand, is the partitioning of a relation
that results from predicates being defined on another relation.

In terms of quantitative information about user applications, we need to have two sets
of data:
1. Minterm selectivity: number of tuples of the relation that would be
accessed by a user query specified according to a given minterm
predicate. For example, the selectivity of m1 of Example 3.6 is 0 since
there are no tuples in PAY that satisfy the minterm predicate. The
selectivity of m2, on the other hand, is 0.25. since one of the four tuples in
PAY satisfy m2. We denote the selectivity of a minterm mi as sel(mi).
2. 2. Access frequency: frequency with which user applications access data.
If Q = {q1, q2, ..., qq} is a set of user queries, acc(qi) indicates the access
frequency of query qi in a given period. Note that minterm access
frequencies can be determined from the query frequencies. We refer to
the access frequency of a minterm mi as acc(mi).

Algorithm 3.1: COM MIN Algorithm:


Input: R: relation; Pr: set of simple predicates
Output: Pr0 : set of simple predicates
Declare: F: set of minterm fragments
Begin
15 | P a g e

find pi ∈ Pr such that pi partitions R according to Rule 1 ;


Pr0 ← pi ;
Pr ← Pr − pi ;
F ← fi { fi is the minterm fragment according to pi} ;
Repeat
find a pj ∈ Pr such that pj partitions some fk of Pr0 according to Rule 1 ;
Pr0 ← Pr0 ∪ pj ;
Pr ← Pr − pj ;
F←F∪fj;
if ∃pk ∈ Pr0 which is not relevant then
Pr0 ← Pr0 − pk ;
F ← F − fk ;
until Pr0 is complete ;
end

PHORIZONTAL Algorithm:
Input: R: relation; Pr: set of simple predicates
Output: M: set of minterm fragments
Begin
Pr0 ←COM MIN(R,Pr) ;
determine the set M of minterm predicates ;
determine the set I of implications among pi ∈ Pr ;
foreach mi ∈ M do
if mi is contradictory according to I then
M ← M −mi
End

Checking for Correctness:


We should now check the fragmentation algorithms discussed so far with respect to the
three correctness criteria presented in Section.

Reconstruction: Reconstruction of a global relation from its fragments is performed by


the union operator in both the primary and the derived horizontal fragmentation. Thus,
for a relation R with fragmentation FR = {R1,R2,...,Rw}, R = [ Ri , ∀Ri ∈ FR

You might also like