CATHOLIC UNIVERSITY OF GHANA, FIAPRE- SUNYANI
FACULTY OF COMPUTING ENGINEERING AND MATHEMATICS
SCIENCE
PROGRAM: BSc. COMPUTER SCIENCE
Topic:
Fault-Tolerant Parallel Algorithms for Large-Scale Systems
NAME
AMIDU KIKAH ZULKANANI
INDEX NUMBER
UGW0202320126
Table of contents
1.1 Introduction..............................................................................................................................1
1.2 How Fault Tolerance Works...................................................................................................1
1.3 Types of Faults in Parallel Computing..................................................................................2
1.3.1 Process Failures......................................................................................................................................... 2
1.3.2 Node Failures.............................................................................................................................................. 2
1.3.3 Network Failures....................................................................................................................................... 3
1.4 Methods of Attaining Fault Tolerance...................................................................................3
1.4.1 Checkpointing............................................................................................................................................. 3
1.4.1.2 Advantages of Checkpointing......................................................................................................3
1.4.2 Replication.................................................................................................................................................... 4
1.4.2.1 There are two kinds of replication............................................................................................4
1.4.3 Redundancy................................................................................................................................................. 4
1.4.4 Error Correction......................................................................................................................................... 4
1.5 Fault-Tolerant Parallel Algorithms.......................................................................................5
1.5.1 Master-Worker Algorithm.....................................................................................................................5
1.5.1.1 How master-worker handle fault.............................................................................................. 5
1.5.2 Distributed Memory Algorithm...........................................................................................................5
1.5.2.1 Fault Handling by distributed memory algorithms...........................................................5
1.5.3 Parallel Task Queue Algorithm............................................................................................................6
1..5.3.1 Fault Handling................................................................................................................................... 6
1.6 Issues in Designing Fault-Tolerant Algorithms....................................................................6
1.6.1 Scalability Issues Ensuring fault tolerant algorithms will scale well becomes more
challenging as systems get larger and more complex................................................................................6
1.6.2 Overhead Issues................................................................................................................................... 7
1.6.3 Complexity of Design............................................................................................................................... 7
1.7 Applications of Fault-Tolerant Parallel Algorithms............................................................8
1.7.1 High-Performance Computing (HPC)...............................................................................................8
1.7.2 Cloud Computing....................................................................................................................................... 8
1.7.3 Real-Time Systems.................................................................................................................................... 9
1.7.4 Internet of Things (IoT).......................................................................................................................... 9
1.7.5 Edge Computing...................................................................................................................................... 10
Reference......................................................................................................................................11
1.1 Introduction
Fault-tolerant parallel algorithms are made to be as fault tolerant as possible while still offering
performance and reliability in large computing systems such as clusters and clouds. These
systems are typically beset by all some kind of failures such as process, node, and network
failures, which can bring execution to a critical standstill. Fault tolerance cannot be overstressed
because it allows the system to execute continuously, thus ensuring data integrity and providing
availability of service (Kale & Krishnan, 2018).
1.2 How Fault Tolerance Works
From a simple point of view, fault tolerant parallel algorithms employ more than one approach to
manage and reverse failure. Among the best methods are:
Checkpointing: Taking periodic snapshots of the state of computation to aid in recovery
from separate points after failure (Gupta & Trivedi, 2016).
Replication: Creating duplicate copies of processes or data for the intent of providing
alternatives after a fault (Huang & Kintala, 2017).
Redundancy: Employing redundant resources that will be utilized in case primary ones
fail (Shooman, 2017).
Error Correction: Applying algorithms that are error-detecting and correcting during data
processing (Lin & Costello, 2017).
These techniques work in combination to detect failures and dynamically reorganize the system.
Applications are thus able to keep running with little interruption, hence enhancing the reliability
of cloud and high-performance computing systems.
1.3 Types of Faults in Parallel Computing
In parallel computing systems, several types of faults exist that may lead to system failure. These
faults need to be known so that efficient fault tolerant methods can be adopted. The most
prominent types of faults are process failures, node failures, and network failures (Kale &
Krishnan, 2018).
1.3.1 Process Failures
Process failures occur when a processing unit fails to execute as it should. This may be due to
numerous possible reasons, including software faults or resource not availability. For example, if
a data processing task running in a multi-threaded environment crashes because of a memory
leak, the entire operation based on that task can fail.
Example
When a matrix multiplication program is being used, whenever one process executing a row of
the resulting matrix encounters an unhandled exception, it will terminate the whole computation
unless dealt with by fault tolerant methods.
1.3.2 Node Failures
Node failures are failures of an entire computing unit or machine in a distributed system.
Hardware failures can lead to this type of failure, such as power loss, overheating, or physical
damage to server components.
Example
Take the case of a cloud computing setup in which a virtual machine becomes unavailable
because of hardware failure. If the primary computation is on that node, other processes that are
dependent on it will also be affected, and operations such as failover to backup nodes or task
reallocation will be required.
1.3.3 Network Failures
Network failures imply failures in the data channels between the nodes. It may be packet loss,
very high latency, or complete collapse of communication. These kinds of failures can drastically
affect process coordination based on consistent data exchange.
Example
For a distributed algorithm where nodes exchange computational outcomes, a sudden network
failure could induce inconsistent data states, where stale information is with some nodes, leading
to erroneous results or deadlocks.
1.4 Methods of Attaining Fault Tolerance
Fault tolerance of parallel computing relies on several crucial techniques that cooperate to enable
systems to recover from failure with minimal interruption. We address in-depth here four central
techniques: checkpointing, replication, redundancy, and error correction. Each one of these
techniques has unique mechanisms and applications to provide reliability for parallel systems.
1.4.1 Checkpointing
Checkpointing involves saving the state of a computation at regular intervals so that in the event
of failure, the system can resume the earlier saved state (Gupta & Trivedi, 2016).
Figure 1.1 checkpointing
1.4.1.2 Advantages of Checkpointing
Efficiency of recovery making it possible for programs to restore from their last
checkpoint, computation time overall is reduced, especially whenever failure is frequent.
Users may select the checkpoint frequency based on performance demands and failure
rate analysis.
1.4.2 Replication
Replication means to copy data or processes across multiple nodes in a way that if one fails,
another can substitute it without the need for interruption (Huang & Kintala, 2017).
Figure 1.2 replication
1.4.2.1 There are two kinds of replication
Active Replication: All of the replicas take part in processing tasks simultaneously,
providing real-time failover support.
Passive Replication: One active process executes tasks and the rest remain passive until a
failure occurs.
1.4.3 Redundancy
Redundancy provides additional resources which might be invoked when the primary resources
are down. This might be in the form of hardware, software, or network configuration (Shooman,
2017).
Figure 1.3 Redundancy
Examples include
1. Applying dormant servers or components that come into play on the failure of their
operational peers.
2. Having replica copies of valuable information in distinct nodes or locations.
1.4.4 Error Correction
Error correction are the routines used to find and correct errors during data processing (Lin &
Costello, 2017). They are of varying levels, including:
Forward Error Correction (FEC): Appending redundant information so that the errors can
be identified and fixed without needing the information to be resent.
Checksums: Simple methods wherein data integrity is verified by calculating checksums
before and after transmission.
1.5 Fault-Tolerant Parallel Algorithms
In fault tolerant parallel algorithms, there are a number of specialized methodologies, each with a
solution to managing faults efficiently without compromising on efficiency and performance
(Kale & Krishnan, 2018). Three popular algorithms are the Master-Worker, the Distributed
Memory, and the Parallel Task Queue algorithms, are discussed here in terms of how they
manage fault conditions in parallel computing environments.
1.5.1 Master-Worker Algorithm
The Master-Worker algorithm is structured around a core coordinator called the "master," which
dispatches work to multiple subordinate processes known as "workers." A worker executes its
given task independently and returns with the outcome.
Figure 1.4 Master-Worker
1.5.1.1 How master-worker handle fault
In redundancy, to cope with worker failure, the master can assign jobs to redundant
workers. When a worker crashes, the master transfers the task to an alternate so that
computation is not delayed due to significant pauses.
Master holds intermediate outcomes and active task state. In case a worker crashes, the
checkpoints help to resume lost computation from a saved state.
1.5.2 Distributed Memory Algorithm
In Distributed Memory architecture, each node in a system possesses individual dedicated
memory, and nodes share data via a network. Such a type of architecture is used in clusters
consisting of separate physical machines, raising scalability and usage of resources.
Figure 1.5 Distributed Memory
1.5.2.1 Fault Handling by distributed memory algorithms
Replication: Data can be replicated on multiple nodes to prevent loss. Suppose the node
containing critical data shuts down, other nodes with replicated data can still run
processes, maintaining the system intact.
Error Correction: Distributed Memory algorithms generally incorporate sophisticated
error-checking techniques that actively monitor communication between nodes, allowing
detection and correction of communication errors.
1.5.3 Parallel Task Queue Algorithm
The Parallel Task Queue algorithm employs a common task queue to control the workload
among concurrently processing units in an efficient manner. Tasks are queued and dynamically
allocated to ready processors depending on the current load of the processors, maximizing
resource utilization.
Figure 1.6 Parallel Task Queue
1.5.3.1 Fault Handling
Dynamic Reassignment: If the processing unit crashes during task execution, the system
can rerun the task on another available unit from the queue, allowing uninterrupted
processing without the need for user intervention.
Traceability: The task queue maintains records of task status, enabling attempts at
recovery and avoiding duplication or loss of tasks.
1.6 Issues in Designing Fault-Tolerant Algorithms
Fault-tolerant parallel algorithm design is in itself challenging due to the numerous issues that
must be addressed to ensure performance and reliability. Scalability, overhead, and complexity
are some of the most significant such issues (Chen & Lee, 2017).
1.6.1 Scalability Issues
Ensuring fault tolerant algorithms will scale well becomes more challenging as
systems get larger and more complex
Scalability includes
One aspect of scalability is resource management: effective resource allocation and
scheduling across nodes might result in bottlenecks, particularly in the event of a failure.
Task Redistribution: Task redistribution to maintain the high performance of the system
in the event of node failures is a difficult task.
Low scalability results in poor performance since systems take time to recover from
failure, hence causing increased downtime or degradation of service.
1.6.2 Overhead Issues
Enhanced memory and computational overhead are the costs of fault tolerant schemes:
Continuous checkpointing Because it takes a lot of time to store states regularly,
overhead will reduce overall performance.
The expenses associated with replicating data or operations have the potential to
considerably raise the amount of storage space and resources needed.
1.6.3 Complexity of Design
Designing fault tolerant algorithms involves a number of additional aspects in addition to
the technical complexity:
Algorithm Complexity: To prevent errors and inefficiencies, it is important to carefully
manage the complex interconnections between different components.
The incorporation of robust error management into algorithms necessitates rigorous
testing and careful planning.
1.7 Applications of Fault-Tolerant Parallel Algorithms
Fault-tolerant parallel algorithms are critical in many sectors, ensuring the validity and efficiency
of operation despite the adversity of potential failure in computing systems. Below, we discuss
the main applications of such algorithms in high performance computing, cloud computing, and
real-time systems.
1.7.1 High-Performance Computing (HPC)
In the domain of high-performance computing, with big simulations and complex computations a
normal affair, fault tolerant parallel algorithms help researchers and scientists to deliver results
with minimum chances of loss of data (Dongarra et al., 2018).
Scientific Simulations: HPC applications that compute large datasets, such as astrophysical
simulations, molecular dynamics, and climate modeling, frequently use a lot of computer power.
Monte Carlo Techniques: Monte Carlo techniques, which are widely used in statistical analysis
and computational finance, typically require extensive computational simulations. By using fault
tolerant techniques, researchers may make sure that simulations can recover from process
failures without compromising the integrity of the results.
1.7.2 Cloud Computing
Cloud usage has increased the need for robust mechanisms of fault tolerance. Cloud services,
rooted in diverse physical resources distributed over different locations, rely on undisturbed
access and performance (Atzori et al., 2017).
Data Storage and Management: Without replication of data across nodes, cloud systems
would be prone to data loss. Replication protocols and other fault tolerant approaches
guarantee automatic data recovery in the event of failures in addition to granting access.
Cloud-based databases, which guarantee availability and consistency by actively
replicating data between servers, are a perfect example.
Microservices Architecture: Microservices in cloud applications today typically rely on
multiple components which need to collaborate and function in unison smoothly. Fault-
tolerant algorithms support dynamic task rebalancing and load distribution so that even
when a service crashes, others can take up the load, guaranteeing continuity of the
service.
1.7.3 Real-Time Systems
Time systems, such those seen in financial trading websites or healthcare monitoring,
require immediate responsiveness and reliability. Algorithms with fault tolerance stop
unacceptably large delays or losses from occurring when a system fails (Chen & Lee,
2017).
Systems for healthcare monitoring: They keep tabs on vital signs and other important
metrics. When one sensor fails, fault tolerant treatment makes sure that data is collected
and processed continually and uninterruptedly, protecting patients' health.
Economic Trading Platforms: In trading systems, speed and dependability are essential
because delay can result in losses. Error correction techniques and automated rehydration
are examples of transaction processes that can continue without interruption thanks to
fault tolerant parallel algorithms.
1.7.4 Internet of Things (IoT)
With IoT devices overwhelming all areas of our lives, incorporation of fault tolerant techniques
into distributed sensor networks will be an imperative to offer fault-resistant data collection and
processing, particularly for mission-critical systems like smart cities and healthcare (Hwang &
Xu, 2018).
1.7.5 Edge Computing
With edge computing on the rise, where computations are done nearer the data sources, focus on
fault tolerant approaches will be a necessity in ensuring continuity of service between distributed
nodes with unreliable interconnections (Li & Liu, 2019).
Reference
Kale, L. V., & Krishnan, S. (2018). Fault-Tolerant Parallel Computing. In Encyclopedia of
Parallel Computing (pp. 643-654). Springer.
Gupta, S., & Trivedi, K. S. (2016). Checkpointing and recovery in distributed systems. Journal of
System’s and Software, 117, 345-355.
Huang, Y., & Kintala, C. (2017). Replication and fault tolerance in distributed systems. IEEE
Transactions on Parallel and Distributed Systems, 28(10), 2611-2624.
Shooman, M. L. (2017). Redundancy and fault tolerance in computer systems. IEEE
Transactions on Reliability, 66(3), 531-543.
Lin, S., & Costello, D. J. (2017). Error control coding: Fundamentals and
applications. Prentice Hall.
Huang, Y., & Kintala, C. (2017). Replication and fault tolerance in distributed systems. IEEE
Transactions on Parallel and Distributed Systems, 28(10), 2611-2624.
Atzori, L., et al. (2017). The Internet of Things: A survey. Computer Networks, 121, 125-145
Dongarra, J., et al. (2018). High-performance computing: Clusters, constellations, MPPs, and
more. In Encyclopedia of Parallel Computing (pp. 755-766). Springer.
Chen, P. M., & Lee, E. K. (2017). Fault tolerance in distributed systems. Journal of Systems and
Software, 131, 245-256.