CH.4
CH.4
In Parallel Databases, mainly there are three architectural designs for parallel DBMS. They are
as follows:
1. Shared Memory Architecture- In Shared Memory Architecture, there are multiple CPUs that
are attached to an interconnection network. They are able to share a single or global main
memory and common disk arrays. It is to be noted that, In this architecture, a single copy of a
multi-threaded operating system and multithreaded DBMS can support these multiple CPUs.
Also, the shared memory is a solid coupled architecture in which multiple CPUs share their
memory. It is also known as Symmetric multiprocessing (SMP). This architecture has a very
wide range which starts from personal workstations that support a few microprocessors in
parallel via RISC.
Disadvantages :
1. The interconnection network is no longer a bottleneck each CPU has its own memory.
2. Load-balancing is easier in shared disk architecture.
3. There is better fault tolerance.
Disadvantages :
1. If the number of CPUs increases, the problems of interference and memory contentions
also increase.
2. There’s also exists a scalability problem.
Advantages :
Disadvantages:
Note that this technology is typically used for very large databases that have the size of 10 12
bytes or TB or for the system that has the process of thousands of transactions per second.
4, Hierarchical Architecture:
This architecture is a combination of shared disk, shared memory and shared nothing
architectures. This architecture is scalable due to availability of more memory and many
processor. But is costly to other architecture.
Autonomy:
It reveals the division of power inside the Database System and the degree of autonomy enjoyed
by each individual DBMS.
Heterogeneity:
It speaks of the similarity or differences between the databases, system parts, and data models.
This architecture is two level architecture where clients and servers are the points or levels where
the main functionality is divided. There is various functionality provided by the server, like
managing the transaction, managing the data, processing the queries, and optimization.
This is an amalgam of two or more independent Database Systems that functions as a single
integrated Database System.
Each site stores the same database in a Homogenous Database. Since each site has the same
database stored, so all the data management schemes, operating system, and data structures will
be the same across all sites. They are, therefore, simple to handle.
In this type of Database System, different sites are used to store the data and relational tables,
which makes it difficult for database administrators to do the transactions and run the queries
into the database. Additionally, one site might not even be aware of the existence of the other
sites. Different operating systems and database applications may be used by various computers.
Since each system has its own database model to store the data, therefore it is required there
should be translation schemes to establish the connections between different sites to transfer the
data.
Replication:
This method involves redundantly storing the full relationship at two or more locations. Since a
complete database can be accessed from each site, it becomes a redundant database. Systems
preserve copies of the data as a result of replication.
This has advantages because it makes more data accessible at many locations. Additionally,
query requests can now be handled in parallel.
However, there are some drawbacks as well. Data must be updated frequently. Any changes
performed at one site must be documented at every site where that relation is stored in order to
avoid inconsistent results. There is a tonne of overhead here. Additionally, since concurrent
access must now be monitored across several sites, concurrency control becomes far more
complicated.
Fragmentation:
According to this method, the relationships are divided (i.e., broken up into smaller pieces), and
each fragment is stored at the many locations where it is needed. To ensure there is no data loss,
the pieces must be created in a way that allows for the reconstruction of the original relation.
Ways of fragmentation:
Horizontal Fragmentation:
In Horizontal Fragmentation, the relational table or schema is broken down into a group of one
and more rows, and each row gets one fragment of the schema. It is also called splitting by
rows.
Vertical Fragmentation:
In this fragmentation, a relational table or schema is divided into some more schemas of smaller
sizes. A common candidate key must be present in each fragment in order to guarantee a lossless
join. This is also called splitting by columns.
Note: Most of the time, a hybrid approach of replication and fragmentation is used.
Distributed data processing refers to the approach of handling and analyzing data across multiple
interconnected devices or nodes. In contrast to centralized data processing, where all data
operations occur on a single, powerful system, distributed processing decentralizes these tasks
across a network of computers. This method leverages the collective computing power of
interconnected devices, enabling parallel processing and faster data analysis.
Distributed processing means that a specific task can be broken up into functions, and the
functions are dispersed across two or more interconnected processors. A distributed application
is an application for which the component application programs are distributed between two or
more interconnected processors. Distributed data is data that is dispersed across two or more
interconnected systems.
When your application is located on one system and the data you need is located on one or more
other systems, you must decide whether you should write an application to access the distributed
data or write a distributed application.
PROMISES OF DDBMS
1. Transparency Management for Distributed and Replicated Data ➢Transparent system “hides” the
implementation details from the end user and this concept helps in building the complex applications.
➢Storing each partition of a relation at different sites is known as fragmentation.
➢Furthermore, it may be preferable to duplicate some of these fragmented data items at different sites
for performance and reliability reasons.
➢Fully transparent access means that the user can still pose the query, without paying any attention to
the fragmentation, location or replication of the data.
➢Distributed DBMSs are intended to improve reliability since they have replicated components and,
thereby eliminating the single point of failure.
➢Distributed transactions are executed at a number of sites which actually accesses the local database.
➢user applications do not need to be concerned with coordinating their accesses to individual local
databases nor do they need to worry about the possibility of site or communication link failures during
the execution of their transactions.
3. Improved Performance Improved performance for distributed DBMSs is based on two points.
1. Data localization- Data to be stored in close proximity to its points of use. This has two potential
advantages:
(i) Since each site handles only a portion of the database, so contention for CPU and I/O services
is not as severe as for centralized databases.
(ii) Localization reduces remote access delays that are usually involved in wide area networks 2.
Inherent parallelism of distributed systems-
(i) Inter-query parallelism- ability to execute multiple queries at the same time.
(ii) intra-query parallelism- breaking up a single query into a number of subqueries each
of which is executed at a different site, accessing a different part of the distributed database.
➢expansion can usually be handled by adding processing and storage power to the network.
➢One aspect of easier system expansion is economics. It normally costs much less to put together a
system of “smaller” computers with the equivalent power of a single big machine.
Problem Areas:
1. Complex nature :
Distributed Databases are a network of many computers present at different locations and
they provide an outstanding level of performance, availability, and of course reliability.
Therefore, the nature of Distributed DBMS is comparatively more complex than a
centralized DBMS. Complex software is required for Distributed DBMS. Also, It ensures
no data replication, which adds even more complexity in its nature.
2. Overall Cost :
Various costs such as maintenance cost, procurement cost, hardware cost,
network/communication costs, labor costs, etc, adds up to the overall cost and make it
costlier than normal DBMS.
3. Security issues:
In a Distributed Database, along with maintaining no data redundancy, the security of
data as well as a network is a prime concern. A network can be easily attacked for data
theft and misuse.
4. Integrity Control:
In a vast Distributed database system, maintaining data consistency is important. All
changes made to data at one site must be reflected on all the sites. The communication
and processing cost is high in Distributed DBMS in order to enforce the integrity of data.
5. Lacking Standards:
Although it provides effective communication and data sharing, still there are no standard
rules and protocols to convert a centralized DBMS to a large Distributed DBMS. Lack of
standards decreases the potential of Distributed DBMS.
6. Lack of Professional Support:
Due to a lack of adequate communication standards, it is not possible to link different
equipment produced by different vendors into a smoothly functioning network. Thus
several good resources may not be available to the users of the network.
7. Data design complex:
Designing a distributed database is more complex compared to a centralized database.
Location transparency
Fragmentation transparency
Replication transparency
Location Transparency
Location transparency ensures that the user can query on any table(s) or fragment(s) of a table as
if they were stored locally in the user’s site. The fact that the table or its fragments are stored at
remote site in the distributed database system, should be completely oblivious to the end user.
The address of the remote site(s) and the access mechanisms are completely hidden.
In order to incorporate location transparency, DDBMS should have access to updated and
accurate data dictionary and DDBMS directory which contains the details of locations of data.
Fragmentation Transparency
Fragmentation transparency enables users to query upon any table as if it were unfragmented.
Thus, it hides the fact that the table the user is querying on is actually a fragment or union of
some fragments. It also conceals the fact that the fragments are located at diverse sites.
This is somewhat similar to users of SQL views, where the user may not know that they are
using a view of a table instead of the table itself.
Replication Transparency
Replication transparency ensures that replication of databases are hidden from the users. It
enables users to query upon a table as if only a single copy of the table exists.
Query processing in a distributed database management system requires the transmission of data
between the computers in a network. A distribution strategy for a query is the ordering of data
transmissions and local data processing in a database system. Generally, a query in Distributed
DBMS requires data from multiple sites, and this need for data from different sites is called the
transmission of data that causes communication costs. Query processing in DBMS is different
from query processing in centralized DBMS due to the communication cost of data transfer over
the network. The transmission cost is low when sites are connected through high-speed Networks
and is quite significant in other networks.
The process used to retrieve data from a database is called query processing. Several processes are
involved in query processing to retrieve data from the database. The actions to be taken are:
1. We can transfer the data from S2 to S1 and then process the query
2. We can transfer the data from S1 to S2 and then process the query
3. We can transfer the data from S1 and S2 to S3 and then process the query. So the choice
depends on various factors like the size of relations and the results, the communication
cost between different sites, and at which the site result will be utilized.
Commonly, the data transfer cost is calculated in terms of the size of the messages. By using the
below formula, we can calculate the data transfer cost:
Where C refers to the cost per byte of data transferring and Size is the no. of bytes transmitted.
Query decomposition:
Query decomposition maps a distributed calculus query into an algebraic query on
global relations. The techniques used at this layer are those of the centralized DBMS
since relation distribution is not yet considered at this point. The resultant algebraic
query is “good” in the sense that even if the subsequent layers apply a straightforward
algorithm, the worst executions will be avoided. However, the subsequent layers
usually perform important optimizations, as they add to the query increasing detail
about the processing environment.
Data localization :
Data localization takes as input the decomposed query on global relations and ap-
plies data distribution information to the query in order to localize its data. In Chapter
3 we have seen that to increase the locality of reference and/or parallel execution,
relations are fragmented and then stored in disjoint subsets, called fragments, each
being placed at a different site. Data localization determines which fragments are
involved in the query and thereby transforms the distributed query into a fragment
query. Similar to the decomposition layer, the final fragment query is generally far
from optimal because quantitative information regarding fragments is not exploited
at this point.
Transaction Management & Concurrency Control in DDBMS, Commit Protocols (2-PC, 3-PC)
What is a Transaction?
We may have a partially executed program since the level of atomicity is an instruction, meaning
that either an instruction is performed completely or not, according to generic computation
theory (operating system).
However, from the perspective of a database management system (DBMS), a user performs a
logical task (operation) that is always atomic in nature, meaning that there is no such thing as
partial execution.
Let us assume that we have to transfer 10000 rupees from the bank account of X to Y. The
following instructions are needed to complete the following task:
1 Read X
2 X = X - 10000
3 Write X
4 Read Y
5 Y = Y + 10000
6 Write Y
The system's ultimate state will be inconsistent if a failure occurs after executing the third
instruction, as 10000 rupees will be deducted from X's account but will not be credited to Y's
account.
We raise the level of atomicity and group all the instructions for a logical operation into a unit
referred to as a transaction in order to solve this partial execution problem. Therefore, a
transaction is officially defined as "A Set of Logically Related Instructions to Perform a
Logical Unit of Work."
TRANSACTION STATES
ACTIVE - It is the initial state. A transaction will be in active state while it is executing its
instructions.
PARTIALLY COMMITED - After the final instruction of a transaction has been processed, if the
actual output is still temporarily stored in the main memory rather than on disc, then the
transaction will remain in the partially committed state because it is still possible that the
transaction will need to be aborted (due to any problem).
FAILED - Whenever it is detected that the transaction has failed due to any hardware and
software problem, then that transaction will be in the failed state.
ABORTED - When a transaction has been rolled back, and the database has been returned to
how it was before the execution began, the transaction is said to be in an Aborted state.
COMMITED - After a transaction is successfully completed, the execution of all its instructions
and the database has undergone its final update, it enters the committed state.
One-phase Commit,
Two-phase Commit, and
Three-phase Commit.
1. One-Phase Commit
The distributed one-phase commit is the most straightforward commit protocol. Consider the
scenario where the transaction is being carried out at a controlling site and several slave sites.
These are the steps followed in the one-phase distributed commit protocol:
Each slave sends a "DONE" notification to the controlling site once it has successfully finished its
transaction locally.
The slaves await the commanding site's "Commit" or "Abort" message. This period of waiting is
known as the window of vulnerability.
The controlling site decides whether to commit or abort after receiving the "DONE" message
from each slave. The commit point is where this happens. It then broadcasts this message to
every slave.
An acknowledgement message is sent to the controlling site by the slave once it commits or
aborts in response to this message.
2. Two-Phase Commit
The two-phase commit protocol (2PC), which is explained in this Section, is one of the most
straightforward and popular commit methods. The vulnerability of one-phase commit methods is
decreased by distributed two-phase commit. The following actions are taken in the two phases:
When T has completed, or when all the sites where T has run notify C i that T has accomplished,
Ci initiates the 2PC protocol.
3. Three-Phase Commit
The two-phase commit protocol can be extended to overcome the blocking issue using the three-
phase commit (3PC) protocol, under particular assumptions.
It is specifically anticipated that there will be no network partitions and that there won't be any
more than k sites that fail, where k is a preset number. Under these presumptions, the protocol
prevents blocking by adding a third phase that involves several sites in the commit decision.
The coordinator initially makes certain that at least k other sites are aware that it planned to
commit the transaction before immediately documenting the decision to commit in its persistent
storage. In the event that the coordinator fails, the surviving sites initially choose a replacement.
The protocol's status is checked by the new coordinator from the remaining locations; I If the
coordinator had made the decision to commit, at minimum one of the other K sites it had notified
would be online and would make sure the commit decision was upheld. If some site understood
that the previous coordinator intended to complete the transaction, the new coordinator starts
over with the third phase of the procedure. Otherwise, the transaction is aborted by the new
coordinator.
The 3PC protocol has the advantage of not blocking until k sites fail, but it also has the
disadvantage that a network partitioning might be mistaken for more than k sites failing, which
would result in blocking. In addition, the protocol must be properly developed to prevent
inconsistent results, such as transactions being committed in one partition but aborted in another,
in the event of network partitioning (or more than k sites failing). The 3PC protocol is not
frequently utilized due of its overhead.