CH.4 | PDF | Databases | Central Processing Unit
0% found this document useful (0 votes)
3 views

CH.4

The document discusses parallel and distributed database systems, detailing architectures such as Shared Memory, Shared Disk, and Shared Nothing, along with their advantages and disadvantages. It also covers Distributed Database Systems, including their parameters, architecture models, types, storage methods, and applications, emphasizing the complexities and benefits of distributed data processing. Additionally, it highlights issues like cost, security, and integrity control in distributed environments.

Uploaded by

Sarang Tilekar
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views

CH.4

The document discusses parallel and distributed database systems, detailing architectures such as Shared Memory, Shared Disk, and Shared Nothing, along with their advantages and disadvantages. It also covers Distributed Database Systems, including their parameters, architecture models, types, storage methods, and applications, emphasizing the complexities and benefits of distributed data processing. Additionally, it highlights issues like cost, security, and integrity control in distributed environments.

Uploaded by

Sarang Tilekar
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 16

4.

Parallel and Distributed Database


Parallel Database System: A parallel DBMS is a DBMS that runs across multiple processors or
CPUs and is mainly designed to execute query operations in parallel, wherever possible. The
parallel DBMS link a number of smaller machines to achieve the same throughput as expected
from a single large machine.

In Parallel Databases, mainly there are three architectural designs for parallel DBMS. They are
as follows:

1. Shared Memory Architecture


2. Shared Disk Architecture
3. Shared Nothing Architecture

Let’s discuss them one by one:

1. Shared Memory Architecture- In Shared Memory Architecture, there are multiple CPUs that
are attached to an interconnection network. They are able to share a single or global main
memory and common disk arrays. It is to be noted that, In this architecture, a single copy of a
multi-threaded operating system and multithreaded DBMS can support these multiple CPUs.
Also, the shared memory is a solid coupled architecture in which multiple CPUs share their
memory. It is also known as Symmetric multiprocessing (SMP). This architecture has a very
wide range which starts from personal workstations that support a few microprocessors in
parallel via RISC.

Shared Memory Architecture


Advantages :

1. It has high-speed data access for a limited number of processors.


2. The communication is efficient.

Disadvantages :

1. It cannot use beyond 80 or 100 CPUs in parallel.


2. The bus or the interconnection network gets block due to the increment of the large
number of CPUs.

2. Shared Disk Architectures :


In Shared Disk Architecture, various CPUs are attached to an interconnection network. In this,
each CPU has its own memory and all of them have access to the same disk. Also, note that here
the memory is not shared among CPUs therefore each node has its own copy of the operating
system and DBMS. Shared disk architecture is a loosely coupled architecture optimized for
applications that are inherently centralized. They are also known as clusters.

Shared Disk Architecture


Advantages :

1. The interconnection network is no longer a bottleneck each CPU has its own memory.
2. Load-balancing is easier in shared disk architecture.
3. There is better fault tolerance.

Disadvantages :

1. If the number of CPUs increases, the problems of interference and memory contentions
also increase.
2. There’s also exists a scalability problem.

3, Shared Nothing Architecture :


Shared Nothing Architecture is multiple processor architecture in which each processor has its
own memory and disk storage. In this, multiple CPUs are attached to an interconnection network
through a node. Also, note that no two CPUs can access the same disk area. In this architecture,
no sharing of memory or disk resources is done. It is also known as Massively parallel
processing (MPP).
Shared Nothing Architecture

Advantages :

1. It has better scalability as no sharing of resources is done


2. Multiple CPUs can be added

Disadvantages:

1. The cost of communications is higher as it involves sending of data and software


interaction at both ends
2. The cost of non-local disk access is higher than the cost of shared disk architectures.

Note that this technology is typically used for very large databases that have the size of 10 12
bytes or TB or for the system that has the process of thousands of transactions per second.

4, Hierarchical Architecture:
This architecture is a combination of shared disk, shared memory and shared nothing
architectures. This architecture is scalable due to availability of more memory and many
processor. But is costly to other architecture.

Distributed Database Architecture in DBMS


Distributed Database System:
A Distributed Database System is a kind of database that is present or divided in more than one
location, which means it is not limited to any single computer system. It is divided over the
network of various systems. The Distributed Database System is physically present on the
different systems in different locations. This can be necessary when different users from all over
the world need to access a specific database. For a user, it should be handled in such a way that it
seems like a single database.

Parameters of Distributed Database Systems:


 Distribution:

It describes how data is physically distributed among the several sites.

 Autonomy:

It reveals the division of power inside the Database System and the degree of autonomy enjoyed
by each individual DBMS.

 Heterogeneity:

It speaks of the similarity or differences between the databases, system parts, and data models.

Common Architecture Models of Distributed Database


Systems:
 Client-Server Architecture of DDBMS:

This architecture is two level architecture where clients and servers are the points or levels where
the main functionality is divided. There is various functionality provided by the server, like
managing the transaction, managing the data, processing the queries, and optimization.

 Peer-to-peer Architecture of DDBMS:


In this architecture, each node or peer is considered as a server as well as a client, and it performs
its database services as both (server and client). The peers coordinate their efforts and share their
resources with one another.

 Multi DBMS Architecture of DDBMS:

This is an amalgam of two or more independent Database Systems that functions as a single
integrated Database System.

Types of Distributed Database Systems:


 Homogeneous Database System:

Each site stores the same database in a Homogenous Database. Since each site has the same
database stored, so all the data management schemes, operating system, and data structures will
be the same across all sites. They are, therefore, simple to handle.

 Heterogeneous Database System:

In this type of Database System, different sites are used to store the data and relational tables,
which makes it difficult for database administrators to do the transactions and run the queries
into the database. Additionally, one site might not even be aware of the existence of the other
sites. Different operating systems and database applications may be used by various computers.
Since each system has its own database model to store the data, therefore it is required there
should be translation schemes to establish the connections between different sites to transfer the
data.

Distributed Data Storage:


There are two methods by which we can store the data on different sites:

 Replication:

This method involves redundantly storing the full relationship at two or more locations. Since a
complete database can be accessed from each site, it becomes a redundant database. Systems
preserve copies of the data as a result of replication.

This has advantages because it makes more data accessible at many locations. Additionally,
query requests can now be handled in parallel.

However, there are some drawbacks as well. Data must be updated frequently. Any changes
performed at one site must be documented at every site where that relation is stored in order to
avoid inconsistent results. There is a tonne of overhead here. Additionally, since concurrent
access must now be monitored across several sites, concurrency control becomes far more
complicated.
 Fragmentation:

According to this method, the relationships are divided (i.e., broken up into smaller pieces), and
each fragment is stored at the many locations where it is needed. To ensure there is no data loss,
the pieces must be created in a way that allows for the reconstruction of the original relation.

Since Fragmentation doesn't result in duplicate data, consistency is not a concern.

Ways of fragmentation:

 Horizontal Fragmentation:

In Horizontal Fragmentation, the relational table or schema is broken down into a group of one
and more rows, and each row gets one fragment of the schema. It is also called splitting by
rows.

 Vertical Fragmentation:

In this fragmentation, a relational table or schema is divided into some more schemas of smaller
sizes. A common candidate key must be present in each fragment in order to guarantee a lossless
join. This is also called splitting by columns.

Note: Most of the time, a hybrid approach of replication and fragmentation is used.

Application of Distributed Database Systems:


 Multimedia apps use it.
 The manufacturing control system also makes use of it.
 Another application is by corporate management for the information system.
 It is used in hotel chains, military command systems, etc.

What Is Distributed Data Processing?

Distributed data processing refers to the approach of handling and analyzing data across multiple
interconnected devices or nodes. In contrast to centralized data processing, where all data
operations occur on a single, powerful system, distributed processing decentralizes these tasks
across a network of computers. This method leverages the collective computing power of
interconnected devices, enabling parallel processing and faster data analysis.

Distributed processing means that a specific task can be broken up into functions, and the
functions are dispersed across two or more interconnected processors. A distributed application
is an application for which the component application programs are distributed between two or
more interconnected processors. Distributed data is data that is dispersed across two or more
interconnected systems.
When your application is located on one system and the data you need is located on one or more
other systems, you must decide whether you should write an application to access the distributed
data or write a distributed application.

PROMISES OF DDBMS

1. Transparency Management for Distributed and Replicated Data ➢Transparent system “hides” the
implementation details from the end user and this concept helps in building the complex applications.
➢Storing each partition of a relation at different sites is known as fragmentation.

➢Furthermore, it may be preferable to duplicate some of these fragmented data items at different sites
for performance and reliability reasons.

➢Fully transparent access means that the user can still pose the query, without paying any attention to
the fragmentation, location or replication of the data.

2. RELIABILITY THROUGH DISTRIBUTED TRNSACTIONS

➢Distributed DBMSs are intended to improve reliability since they have replicated components and,
thereby eliminating the single point of failure.

➢Distributed transactions are executed at a number of sites which actually accesses the local database.

➢user applications do not need to be concerned with coordinating their accesses to individual local
databases nor do they need to worry about the possibility of site or communication link failures during
the execution of their transactions.

3. Improved Performance Improved performance for distributed DBMSs is based on two points.

1. Data localization- Data to be stored in close proximity to its points of use. This has two potential
advantages:

(i) Since each site handles only a portion of the database, so contention for CPU and I/O services
is not as severe as for centralized databases.

(ii) Localization reduces remote access delays that are usually involved in wide area networks 2.
Inherent parallelism of distributed systems-

(i) Inter-query parallelism- ability to execute multiple queries at the same time.

(ii) intra-query parallelism- breaking up a single query into a number of subqueries each
of which is executed at a different site, accessing a different part of the distributed database.

4. Easier System Expansion


➢In a distributed environment, it is much easier to accommodate increasing database sizes.

➢expansion can usually be handled by adding processing and storage power to the network.

➢One aspect of easier system expansion is economics. It normally costs much less to put together a
system of “smaller” computers with the equivalent power of a single big machine.

Problem Areas:

1. Complex nature :
Distributed Databases are a network of many computers present at different locations and
they provide an outstanding level of performance, availability, and of course reliability.
Therefore, the nature of Distributed DBMS is comparatively more complex than a
centralized DBMS. Complex software is required for Distributed DBMS. Also, It ensures
no data replication, which adds even more complexity in its nature.

2. Overall Cost :
Various costs such as maintenance cost, procurement cost, hardware cost,
network/communication costs, labor costs, etc, adds up to the overall cost and make it
costlier than normal DBMS.

3. Security issues:
In a Distributed Database, along with maintaining no data redundancy, the security of
data as well as a network is a prime concern. A network can be easily attacked for data
theft and misuse.

4. Integrity Control:
In a vast Distributed database system, maintaining data consistency is important. All
changes made to data at one site must be reflected on all the sites. The communication
and processing cost is high in Distributed DBMS in order to enforce the integrity of data.

5. Lacking Standards:
Although it provides effective communication and data sharing, still there are no standard
rules and protocols to convert a centralized DBMS to a large Distributed DBMS. Lack of
standards decreases the potential of Distributed DBMS.
6. Lack of Professional Support:
Due to a lack of adequate communication standards, it is not possible to link different
equipment produced by different vendors into a smoothly functioning network. Thus
several good resources may not be available to the users of the network.
7. Data design complex:
Designing a distributed database is more complex compared to a centralized database.

Distributed data storage (Fragmentation, Replication & Transparency):


Distribution transparency is the property of distributed databases by the virtue of which the
internal details of the distribution are hidden from the users. The DDBMS designer may choose
to fragment tables, replicate the fragments and store them at different sites. However, since users
are oblivious of these details, they find the distributed database easy to use like any centralized
database.

The three dimensions of distribution transparency are −

 Location transparency
 Fragmentation transparency
 Replication transparency

Location Transparency
Location transparency ensures that the user can query on any table(s) or fragment(s) of a table as
if they were stored locally in the user’s site. The fact that the table or its fragments are stored at
remote site in the distributed database system, should be completely oblivious to the end user.
The address of the remote site(s) and the access mechanisms are completely hidden.

In order to incorporate location transparency, DDBMS should have access to updated and
accurate data dictionary and DDBMS directory which contains the details of locations of data.

Fragmentation Transparency
Fragmentation transparency enables users to query upon any table as if it were unfragmented.
Thus, it hides the fact that the table the user is querying on is actually a fragment or union of
some fragments. It also conceals the fact that the fragments are located at diverse sites.

This is somewhat similar to users of SQL views, where the user may not know that they are
using a view of a table instead of the table itself.

Replication Transparency
Replication transparency ensures that replication of databases are hidden from the users. It
enables users to query upon a table as if only a single copy of the table exists.

Replication transparency is associated with concurrency transparency and failure transparency.


Whenever a user updates a data item, the update is reflected in all the copies of the table.
However, this operation should not be known to the user. This is concurrency transparency.
Also, in case of failure of a site, the user can still proceed with his queries using replicated copies
without any knowledge of failure. This is failure transparency.
Query Processing: Objectives, Query decomposition; Localization of distributed data:

Query processing in a distributed database management system requires the transmission of data
between the computers in a network. A distribution strategy for a query is the ordering of data
transmissions and local data processing in a database system. Generally, a query in Distributed
DBMS requires data from multiple sites, and this need for data from different sites is called the
transmission of data that causes communication costs. Query processing in DBMS is different
from query processing in centralized DBMS due to the communication cost of data transfer over
the network. The transmission cost is low when sites are connected through high-speed Networks
and is quite significant in other networks.

The process used to retrieve data from a database is called query processing. Several processes are
involved in query processing to retrieve data from the database. The actions to be taken are:

 Costs (Transfer of data) of Distributed Query processing

 Using Semi join in Distributed Query processing

Costs (Transfer of Data) of Distributed Query Processing


In Distributed Query processing, the data transfer cost of distributed query processing means the
cost of transferring intermediate files to other sites for processing and therefore the cost of
transferring the ultimate result files to the location where that result is required. Let’s say that a
user sends a query to site S1, which requires data from its own and also from another site S2. Now,
there are three strategies to process this query which are given below:

1. We can transfer the data from S2 to S1 and then process the query

2. We can transfer the data from S1 to S2 and then process the query

3. We can transfer the data from S1 and S2 to S3 and then process the query. So the choice
depends on various factors like the size of relations and the results, the communication
cost between different sites, and at which the site result will be utilized.

Commonly, the data transfer cost is calculated in terms of the size of the messages. By using the
below formula, we can calculate the data transfer cost:

Data transfer cost = C * Size

Where C refers to the cost per byte of data transferring and Size is the no. of bytes transmitted.

Query decomposition:
Query decomposition maps a distributed calculus query into an algebraic query on
global relations. The techniques used at this layer are those of the centralized DBMS
since relation distribution is not yet considered at this point. The resultant algebraic
query is “good” in the sense that even if the subsequent layers apply a straightforward
algorithm, the worst executions will be avoided. However, the subsequent layers
usually perform important optimizations, as they add to the query increasing detail
about the processing environment.

Data localization :
Data localization takes as input the decomposed query on global relations and ap-
plies data distribution information to the query in order to localize its data. In Chapter
3 we have seen that to increase the locality of reference and/or parallel execution,
relations are fragmented and then stored in disjoint subsets, called fragments, each
being placed at a different site. Data localization determines which fragments are
involved in the query and thereby transforms the distributed query into a fragment
query. Similar to the decomposition layer, the final fragment query is generally far
from optimal because quantitative information regarding fragments is not exploited
at this point.

Transaction Management & Concurrency Control in DDBMS, Commit Protocols (2-PC, 3-PC)

COMMIT Protocol in DBMS


Distributed commit protocol basics of transactions and transactions states are discussed in this
article in detail. Let us start the discussion by understanding the transactions and transaction
states:

What is a Transaction?
We may have a partially executed program since the level of atomicity is an instruction, meaning
that either an instruction is performed completely or not, according to generic computation
theory (operating system).

However, from the perspective of a database management system (DBMS), a user performs a
logical task (operation) that is always atomic in nature, meaning that there is no such thing as
partial execution.

Let us assume that we have to transfer 10000 rupees from the bank account of X to Y. The
following instructions are needed to complete the following task:

1 Read X

2 X = X - 10000

3 Write X
4 Read Y

5 Y = Y + 10000

6 Write Y

The system's ultimate state will be inconsistent if a failure occurs after executing the third
instruction, as 10000 rupees will be deducted from X's account but will not be credited to Y's
account.

For a system to be consistent, here, X + Y == X' + Y'.

We raise the level of atomicity and group all the instructions for a logical operation into a unit
referred to as a transaction in order to solve this partial execution problem. Therefore, a
transaction is officially defined as "A Set of Logically Related Instructions to Perform a
Logical Unit of Work."

TRANSACTION STATES

There is a total of five states a transaction can acquire once at a time:

 ACTIVE - It is the initial state. A transaction will be in active state while it is executing its
instructions.
 PARTIALLY COMMITED - After the final instruction of a transaction has been processed, if the
actual output is still temporarily stored in the main memory rather than on disc, then the
transaction will remain in the partially committed state because it is still possible that the
transaction will need to be aborted (due to any problem).
 FAILED - Whenever it is detected that the transaction has failed due to any hardware and
software problem, then that transaction will be in the failed state.
 ABORTED - When a transaction has been rolled back, and the database has been returned to
how it was before the execution began, the transaction is said to be in an Aborted state.
 COMMITED - After a transaction is successfully completed, the execution of all its instructions
and the database has undergone its final update, it enters the committed state.

Now let us start with our main agenda.

DISTRIBUTED DBMS - COMMIT PROTOCOLS


The transaction manager in a local database system just needs to inform the recovery manager of
their choice to commit a transaction. However, in a distributed system, the transaction manager
should consistently enforce the decision to commit and communicate it to all the servers in the
various sites where the transaction is being conducted. Each site's processing ends when it
reaches the partially committed transaction state, where it waits for all other transactions to get
their partially committed states. Once it receives the signal that all the sites are prepared, it
begins to commit. Either every site commits in a distributed system, or none of them does.
To guarantee atomicity, the execution's ultimate result must be accepted by every site where
transaction T was executed. T must either commit at every location or abort at every location.
The transaction coordinator of T must carry out a commit protocol in order to guarantee this
property.

There is a total of three different Distributed DBMS Commit Protocols:

 One-phase Commit,
 Two-phase Commit, and
 Three-phase Commit.

1. One-Phase Commit

The distributed one-phase commit is the most straightforward commit protocol. Consider the
scenario where the transaction is being carried out at a controlling site and several slave sites.
These are the steps followed in the one-phase distributed commit protocol:

 Each slave sends a "DONE" notification to the controlling site once it has successfully finished its
transaction locally.
 The slaves await the commanding site's "Commit" or "Abort" message. This period of waiting is
known as the window of vulnerability.
 The controlling site decides whether to commit or abort after receiving the "DONE" message
from each slave. The commit point is where this happens. It then broadcasts this message to
every slave.
 An acknowledgement message is sent to the controlling site by the slave once it commits or
aborts in response to this message.

2. Two-Phase Commit

The two-phase commit protocol (2PC), which is explained in this Section, is one of the most
straightforward and popular commit methods. The vulnerability of one-phase commit methods is
decreased by distributed two-phase commit. The following actions are taken in the two phases:

Considering a transaction T that proceeds at site S i and is coordinated by the transaction


coordinator Ci.

When T has completed, or when all the sites where T has run notify C i that T has accomplished,
Ci initiates the 2PC protocol.

3. Three-Phase Commit
The two-phase commit protocol can be extended to overcome the blocking issue using the three-
phase commit (3PC) protocol, under particular assumptions.
It is specifically anticipated that there will be no network partitions and that there won't be any
more than k sites that fail, where k is a preset number. Under these presumptions, the protocol
prevents blocking by adding a third phase that involves several sites in the commit decision.

The coordinator initially makes certain that at least k other sites are aware that it planned to
commit the transaction before immediately documenting the decision to commit in its persistent
storage. In the event that the coordinator fails, the surviving sites initially choose a replacement.

The protocol's status is checked by the new coordinator from the remaining locations; I If the
coordinator had made the decision to commit, at minimum one of the other K sites it had notified
would be online and would make sure the commit decision was upheld. If some site understood
that the previous coordinator intended to complete the transaction, the new coordinator starts
over with the third phase of the procedure. Otherwise, the transaction is aborted by the new
coordinator.
The 3PC protocol has the advantage of not blocking until k sites fail, but it also has the
disadvantage that a network partitioning might be mistaken for more than k sites failing, which
would result in blocking. In addition, the protocol must be properly developed to prevent
inconsistent results, such as transactions being committed in one partition but aborted in another,
in the event of network partitioning (or more than k sites failing). The 3PC protocol is not
frequently utilized due of its overhead.

You might also like