Edited - Distributed Computing - CS3551 - Notes
Edited - Distributed Computing - CS3551 - Notes
www.BrainKart.com
UNIT I
INTRODUCTION
The process of computation was started from working on a single processor. This uni-
processor computing can be termed as centralized computing.
A distributed system is a collection of independent computers, interconnected via a
network, capable of collaborating on a task. Distributed computing is computing
performed in a distributed system.
https://play.google.com/store/apps/details?id=info.therithal.brainkart.annauniversitynotes&hl=en_IN
Page 3 of 105
www.BrainKart.com
As shown in Fig 1.1, Each computer has a memory-processing unit and the computers are
connected by a communication network. Each system connected to the distributed networks
hosts distributed software which is a middleware technology. This drives the Distributed
System (DS) at the same time preserves the heterogeneity of the DS. The term computation
or run in a distributed system is the execution of processes to achieve a common goal.
The interaction of the layers of the network with the operating system and
middleware is shown in Fig 1.2. The middleware contains important library functions for
facilitating the operations of DS.
The distributed system uses a layered architecture to break down the complexity of system
design. The middleware is the distributed software that drives the distributed system, while
providing transparency of heterogeneity at the platform level
1.3 Motivation
The following are the key points that acts as a driving force behind DS:
https://play.google.com/store/apps/details?id=info.therithal.brainkart.annauniversitynotes&hl=en_IN
Page 4 of 105
www.BrainKart.com
• Buffered: The standard option copies the data from the user buffer to the kernel
buffer. The data later gets copied from the kernel buffer onto the network. For the
Receive primitive, the buffered option is usually required because the data may
already have arrived when the primitive is invoked, and needs a storage place in
the kernel.
• Unbuffered: The data gets copied directly from the user buffer onto the network.
Blocking primitives
• The primitive commands wait for the message to be delivered. The execution of
the processes is blocked.
• The sending process must wait after a send until an acknowledgement is made
bythe receiver.
• The receiving process must wait for the expected message from the sending
process
• A primitive is blocking if control returns to the invoking process after the
processing for the primitive completes.
Non Blocking primitives
• If send is nonblocking, it returns control to the caller immediately, before the
message is sent.
• The advantage of this scheme is that the sending process can continue computing
in parallel with the message transmission, instead of having the CPU go idle.
• This is a form of asynchronous communication.
• A primitive is non-blocking if control returns back to the invoking process
immediately after invocation, even though the operation has not completed.
• For a non-blocking Send, control returns to the process even before the data
iscopied out of the user buffer.
For a non-blocking Receive, control returns to the process even before thedata may have
arrived from the sender.
Synchronous
• A Send or a Receive primitive is synchronous if both the Send() and Receive()
handshake with each other.
• The processing for the Send primitive completes only after the invoking
processor learns
https://play.google.com/store/apps/details?id=info.therithal.brainkart.annauniversitynotes&hl=en_IN
Page 6 of 105
www.BrainKart.com
• The processing for the Receive primitive completes when the data to be
received is copied into the receiver’s user buffer.
Asynchronous
• A Send primitive is said to be asynchronous, if control returns back to the
invoking process after the data item to be sent has been copied out of the user-
specified buffer.
• For non-blocking primitives, a return parameter on the primitive call returns a
system-generated handle which can be later used to check the status of
completion of the call.
• The process can check for the completion:
o checking if the handle has been flagged or posted
o issue a Wait with a list of handles as parameters: usually blocks until one
of the parameter handles is posted.
The send and receive primitives can be implemented in four modes:
• Blocking synchronous
• Non- blocking synchronous
• Blocking asynchronous
• Non- blocking asynchronous
https://play.google.com/store/apps/details?id=info.therithal.brainkart.annauniversitynotes&hl=en_IN
Page 7 of 105
www.BrainKart.com
The Receive call blocks until the data expected arrives and is written in the specified
user buffer. Then control is returned to the user process.
Non-blocking Receive:
• The Receive call will cause the kernel to register the call and return the handle
of a location that the user process can later check for the completion of the
non-blocking Receive operation.
• This location gets posted by the kernel after the expected data arrives and is
copied to the user-specified buffer. The user process can check for then
completion of the non-blocking Receive by invoking the Wait operation on the
returned handle.
•
https://play.google.com/store/apps/details?id=info.therithal.brainkart.annauniversitynotes&hl=en_IN
Page 8 of 105
www.BrainKart.com
Processor Synchrony
Processor synchrony indicates that all the processors execute in lock-step with their clocks
synchronized.
To ensure that no processor begins executing the next step of code until all the processors
have completed executing the previous steps ofcode assigned to each of the processors.
Asynchronous Execution:
A communication among processes is considered asynchronous, when every
communicating process can have a different observation of the order of the messages being
exchanged. In an asynchronous execution:
• there is no processor synchrony and there is no bound on the drift rate of processor
clocks
• message delays are finite but unbounded
• no upper bound on the time taken by a process
https://play.google.com/store/apps/details?id=info.therithal.brainkart.annauniversitynotes&hl=en_IN
Page 9 of 105
www.BrainKart.com
Synchronous Execution:
A communication among processes is considered synchronous when every process
observes the same order of messages within the system. In an synchronous execution:
• processors are synchronized and the clock drift rate between any two processors is
bounded
• message delivery times are such that they occur in one logical step or round
• upper bound on the time taken by a process to execute a
step.
https://play.google.com/store/apps/details?id=info.therithal.brainkart.annauniversitynotes&hl=en_IN
Page 10 of 105
www.BrainKart.com
➢ Processes: The main challenges involved are: process and thread management at
both client and server environments, migration of code between systems, design of software
and mobile agents.
➢ Naming: Devising easy to use and robust schemes for names, identifiers, and
addresses is essential for locating resources and processes in a transparent and scalable
manner. The remote and highly varied geographical locations make this task difficult.
➢ Synchronization: Mutual exclusion, leader election, deploying physical clocks,
global state recording are some synchronization mechanisms.
➢ Data storage and access Schemes: Designing file systems for easy and efficient data
storage with implicit accessing mechanism is very much essential for distributed operation
➢ Consistency and replication: The notion of Distributed systems goes hand in hand
with replication of data, to provide high degree of scalability. The replicas should be handed
with care since data consistency is prime issue.
➢ Fault tolerance: This requires maintenance of fail proof links, nodes, and processes.
Some of the common fault tolerant techniques are resilience, reliable communication,
distributed commit, checkpointing and recovery, agreement and consensus, failure detection,
and self-stabilization.
➢ Security: Cryptography, secure channels, access control, key management –
generation and distribution, authorization, and secure group management are some of the
security measure that is imposed on distributed systems.
➢ Applications Programming Interface (API) and transparency: The user
friendliness and ease of use is very important to make the distributed services to be used by
wide community. Transparency, which is hiding inner implementation policy from users, is
of the following types:
https://play.google.com/store/apps/details?id=info.therithal.brainkart.annauniversitynotes&hl=en_IN
Page 11 of 105
www.BrainKart.com
▪ Migration transparency: allows relocating resources without changing names.
▪ Replication transparency: Makes the user unaware whether he is working on
original or replicated data.
▪ Concurrency transparency: Masks the concurrent use of shared resources for the
user.
▪ Failure transparency: system being reliable and fault-tolerant.
➢ Scalability and modularity: The algorithms, data and services must be as distributed
as possible. Various techniques such as replication, caching and cache management, and
asynchronous processing help to achieve scalability.
1.7.2 Algorithmic challenges in distributed computing
➢ Designing useful execution models and frameworks
The interleaving model, partial order model, input/output automata model and the Temporal
Logic of Actions (TLA) are some examples of models that provide different degrees of
infrastructure.
➢ Dynamic distributed graph algorithms and distributed routing algorithms
• The distributed system is generally modeled as a distributed graph.
• Hence graph algorithms are the base for large number of higher level
communication,data dissemination, object location, and object search functions.
• These algorithms must have the capacity to deal with highly dynamic graph
characteristics. They are expected to function like routing algorithms.
• The performance of these algorithms has direct impact on user-perceived latency, data
traffic and load in the network.
➢ Time and global state in a distributed system
https://play.google.com/store/apps/details?id=info.therithal.brainkart.annauniversitynotes&hl=en_IN
Page 12 of 105
www.BrainKart.com
✓ Termination detection: cooperation among the processes to detect the specific global
state of quiescence.
✓ Garbage collection: Detecting garbage requires coordination among the processes.
➢ Group communication, multicast, and ordered message delivery
• A group is a collection of processes that share a common context and collaborate on a
common task within an application domain. Group management protocols are needed for
group communication wherein processes can join and leave groups dynamically, or fail.
➢ Monitoring distributed events and predicates
• Predicates defined on program variables that are local to different processes are used
for specifying conditions on the global system state.
• On-line algorithms for monitoring such predicates are hence important.
• The specification of such predicates uses physical or logical time relationships.
➢ Distributed program design and verification tools
Methodically designed and verifiably correct programs can greatly reduce the overhead of
software design, debugging, and engineering. Designing these is a big challenge.
➢ Debugging distributed programs
Debugging distributed programs is much harder because of the concurrency and replications.
Adequate debugging mechanisms and tools are need of the hour.
➢ Data replication, consistency models, and caching
• Fast access to data and other resources is important in distributed systems.
Managing replicas and their updates faces concurrency problems.
• Placement of the replicas in the systems is also a challenge because resources
usuallycannot be freely replicated.
➢ World Wide Web design – caching, searching, scheduling
• WWW is a commonly known distributed system.
• The issues of object replication and caching, prefetching of objects have to be done on
WWW also.
• Object search and navigationon the web are important functions in the operation of
the web.
➢ Distributed shared memory abstraction
• A shared memory is easier to implement since it does not involve managing the
communication tasks.
• The communication is done by the middleware by message passing.
• The overhead of shared memory is to be dealt by the middleware technology.
• Some of the methodologies that does the task of communication in shared memory
distributed systems are:
✓ Wait-free algorithms: The ability of a process to complete its execution irrespective
of the actions of other processes is wait free algorithm. They control the access to shared
resources in the shared memory abstraction. They are expensive.
✓ Mutual exclusion: Concurrent access of processes to a shared resource or data is
executed in mutually exclusive manner. Only one process is allowed to execute the critical
section at any given time. In a distributed system, shared variables or a local kernel cannot
be used to implement mutual exclusion. Message passing is the sole means for implementing
distributed mutual exclusion.
https://play.google.com/store/apps/details?id=info.therithal.brainkart.annauniversitynotes&hl=en_IN
Page 13 of 105
www.BrainKart.com
✓ Consensus algorithms: Consensus algorithms allow correctly functioning processes
to reach agreement among themselves in spite of the existence of malicious processes. The
goal of the malicious processes is to prevent the correctly functioning processes from
reaching agreement. The malicious processes operate by sending messages with misleading
information, to confuse the correctly functioning processes.
✓ Replication and replica management: The Triple Modular Redundancy (TMR)
technique is used in software and hardware implementation. TMR is a fault-tolerant form of
N-modular redundancy, in which three systems perform a process and that result is
processed by a majority-voting system to produce a single output.
✓ Voting and quorum systems: Providing redundancy in the active or passive
components in the system and then performing voting based on some quorum criterion is a
classical way of dealing with fault-tolerance. Designing efficient algorithms for this
purposeis the challenge.
✓ Distributed databases and distributed commit: The distributed databases should
also follow atomicity, consistency, isolation and durability (ACID) properties.
✓ Self-stabilizing systems: A self-stabilizing algorithm guarantee to take the system to
a good state even if a bad state were to arise due to some error. Self-stabilizing algorithms
require some in-built redundancy to track additional variables of the state and do extra work.
✓ Checkpointing and recovery algorithms: Checkpointing is periodically recording
the current state on secondary storage so that, in case of a failure. The entire computation is
not lost but can be recovered from one of the recently taken checkpoints. Checkpointing in
distributed environment is difficult because if the checkpoints at the different processes are
not coordinated, the local checkpoints may become useless because they are inconsistent with
the checkpoints at other processes.
✓ Failure detectors: The asynchronous distributed do not have a bound on the message
transmission time. This makes the message passing very difficult, since the receiver do not
know the waiting time. Failure detectors probabilistically suspect another process as having
failed and then converge on a determination of the up/down status of the suspected process.
➢ Load balancing
The objective of load balancing is to gain higher throughput, and reduce the user
perceived latency. Load balancing may be necessary because of a variety off actors such
as high network traffic or high request rate causing the network connection to be a
bottleneck, or high computational load. The following are some forms of load balancing:
✓ Data migration: The ability to move data around in the system, based on the access
pattern of the users
✓ Computation migration: The ability to relocate processes in order to perform
are distribution of the workload.
✓ Distributed scheduling: This achieves a better turnaround time for the users by
using idle processing power in the system more efficiently.
➢ Real-time scheduling
Real-time scheduling becomes more challenging when a global view of the system state is
absent with more frequent on-line or dynamic changes. The message propagation delays
which are network-dependent are hard to control or predict. This is an hindrance to meet the
QoS requirements of the network.
➢ Performance
User perceived latency in distributed systems must be reduced. The common issues in
performance:
✓ Metrics: Appropriate metrics must be defined for measuring the performance of
theoretical distributed algorithms and its implementation.
✓ Measurement methods/tools: The distributed system is a complex entity
appropriate methodology and tools must be developed for measuring the performance
metrics.
https://play.google.com/store/apps/details?id=info.therithal.brainkart.annauniversitynotes&hl=en_IN
Page 14 of 105
www.BrainKart.com
➢ Sensor networks
o A sensor is a processor with an electro-mechanical interface that is capable of
sensing physical parameters.
o They are low cost equipment with limited computational power and battery
life. They are designed to handle streaming data and route it to external
computer network and processes.
o They are susceptible to faults and have to reconfigure themselves.
o These features introduces a whole new set of challenges, such as position
estimation and time estimation when designing a distributed system .
➢ Ubiquitous or pervasive computing
o In Ubiquitous systems the processors are embedded in the environment to
perform application functions in the background.
o Examples: Intelligent devices, smart homes etc.
o They are distributed systems with recent advancements operating in wireless
environments through actuator mechanisms.
o They can be self-organizing and network-centric with limited resources.
➢ Peer-to-peer computing
o Peer-to-peer (P2P) computing is computing over an application layer
networkwhere all interactions among the processors are at a same level.
o This is a form of symmetric computation against the client sever paradigm.
o They are self-organizing with or without regular structure to the network.
Some of the key challenges include: object storage mechanisms, efficientobject lookup, and retrieval in a
scalable manner; dynamic reconfiguration with nodes as well as objects joining and leaving the network
randomly;replication strategies to expedite object search; tradeoffs between object size latency and table
sizes; anonymity, privacy, and security
https://play.google.com/store/apps/details?id=info.therithal.brainkart.annauniversitynotes&hl=en_IN
Page 15 of 105
www.BrainKart.com
https://play.google.com/store/apps/details?id=info.therithal.brainkart.annauniversitynotes&hl=en_IN
Page 16 of 105
www.BrainKart.com
• An internal event changes the state of the process at which it occurs. A send event
changes the state of the process that sends the message and the state of the channel
onwhich the message is sent.
• A receive event changes the state of the process that receives the message and the
stateof the channel on which the message is received.
Casual Precedence Relations
Causal message ordering is a partial ordering of messages in a distributed computing
environment. It is the delivery of messages to a process in the order in which they were
transmitted to that process.
https://play.google.com/store/apps/details?id=info.therithal.brainkart.annauniversitynotes&hl=en_IN
Page 17 of 105
www.BrainKart.com
When all the above conditions are satisfied, then it can be concluded that a→b is casually
related. Consider two events c and d; c→d and d→c is false (i.e) they are not casually
related, then c and d are said to be concurrent events denoted as c||d.
https://play.google.com/store/apps/details?id=info.therithal.brainkart.annauniversitynotes&hl=en_IN
Page 18 of 105
www.BrainKart.com
A system that supports the causal ordering model satisfies the following property:
GLOBAL STATE
Distributed Snapshot represents a state in which the distributed system might have been in. A snapshot
of the system is a single configuration of the system.
• The global state of a distributed system is a collection of the local states of its components, namely,
the processes
and the communication channels. • The state of a process at any time is defined by the contents of
processor registers, stacks, local memory, etc. and depends on the local context of the distributed
application.
• The state of a channel is given by the set of messages in transit in the channel.
https://play.google.com/store/apps/details?id=info.therithal.brainkart.annauniversitynotes&hl=en_IN