Sub: Distributed Computing.
Unit 6:
1. Explain the following in brief: i) Wearable devices, ii) PVM iii) JINI
i) Wearable Devices:
Definition: Wearable devices are electronic gadgets that can be worn on the body,
typically incorporating sensors and connectivity features.
Purpose: They collect data, track activities, and often sync with other devices to
share or process data.
Examples: Smartwatches (e.g., Apple Watch), fitness trackers (e.g., Fitbit), and smart
glasses (e.g., Google Glass).
Applications: Health monitoring (heart rate, steps), augmented reality, hands-free
operations, and real-time notifications.
Relevance in Distributed Systems: Wearable devices act as nodes within a
distributed network, collecting and transmitting data to central servers or cloud
systems for processing.
ii) PVM (Parallel Virtual Machine):
Definition: PVM is a software package that allows a heterogeneous network of
computers to be used as a single distributed parallel processor.
Purpose: To enable parallel computing across multiple systems, making them work
together as a unified processing unit.
Components: PVM daemon (pvmd), host file, and application programs.
Features:
o Supports message passing between nodes.
o Dynamic configuration: Machines can be added or removed at runtime.
o Fault tolerance and heterogeneous architecture support.
Applications: Scientific computing, simulations, and parallel data processing.
Relevance in Distributed Systems: PVM facilitates distributed computing by allowing
separate machines to work collaboratively, forming a virtual supercomputer.
iii) JINI (Java Intelligent Network Infrastructure):
Definition: JINI is a network architecture developed by Sun Microsystems, designed
to simplify the construction of distributed systems.
Purpose: To allow devices and services to discover, join, and communicate in a
dynamic, flexible network environment.
Components:
o Lookup Service: Acts as a directory for services.
o Discovery Protocol: Helps devices and services find each other.
o Leasing: Temporary resource allocation.
Features:
o Plug-and-play capability: Devices can dynamically join or leave.
o Java-based: Uses RMI (Remote Method Invocation) for communication.
Applications: Smart home systems, distributed device control, and service-oriented
architecture (SOA).
Relevance in Distributed Systems: JINI supports building adaptive and flexible
distributed systems through its service-oriented approach.
2. How wearable devices work in distributed systems? Discuss the problems involved
with wearable computing
How Wearable Devices Work in Distributed Systems:
1. Data Collection: Wearable devices (like smartwatches or fitness trackers)
continuously collect data using embedded sensors (e.g., accelerometers, heart rate
monitors).
2. Data Transmission: The collected data is transmitted to a central server or cloud via
wireless communication (e.g., Bluetooth, Wi-Fi, or cellular networks).
3. Data Processing: Central servers process the data to generate insights (e.g., activity
tracking, health monitoring).
4. Feedback and Synchronization: Processed data is sent back to the wearable device
or other connected devices (like smartphones).
5. Distributed Coordination: Multiple wearable devices can communicate with each
other or with centralized systems, forming a network of interconnected devices.
Problems Involved with Wearable Computing:
1. Data Privacy and Security:
o Wearables collect sensitive data (e.g., health information) that can be
vulnerable to cyberattacks.
o Challenge: Ensuring secure transmission and storage of data.
2. Battery Life:
o Constant data collection and transmission drain the battery quickly.
o Challenge: Balancing functionality with low power consumption.
3. Connectivity Issues:
o Wearables often rely on Bluetooth or Wi-Fi, which may experience
interruptions or limited range.
o Challenge: Maintaining stable communication between devices.
4. Data Integration:
o Integrating data from multiple wearables and other devices can be complex.
o Challenge: Standardizing data formats and ensuring compatibility.
5. Limited Processing Power:
o Wearable devices have limited CPU and memory, affecting real-time
processing.
o Challenge: Offloading heavy computations to servers while minimizing
latency.
6. User Comfort and Usability:
o Devices should be lightweight, comfortable, and non-intrusive while being
functional.
o Challenge: Balancing ergonomics with technological capabilities.
3. Explain with an application (e.g., Travel Booking Service) various key components of
Service Oriented Architecture.
Service-Oriented Architecture (SOA)
SOA is an architectural model where services are provided to other components
through a communication protocol over a network. It enables interoperability and
loose coupling between services.
Key Components of SOA (with Travel Booking Service Example):
1. Service Provider:
o Publishes services to be consumed.
o Example: An airline company providing flight booking APIs.
2. Service Consumer:
o Uses services offered by the provider.
o Example: A travel booking website that accesses flight, hotel, and car rental
services.
3. Service Registry:
o Directory for discovering available services.
o Example: A central registry where booking services are listed.
4. Service Contract:
o Defines the rules and data format for using a service.
o Example: API documentation specifying booking details and data format.
5. Service Interface:
o The endpoint through which the service is accessed.
o Example: A REST API endpoint for booking a flight.
6. Service Implementation:
o The actual business logic behind the service.
o Example: Code that processes flight booking requests and updates the
database.
7. Service Bus (ESB):
o Middleware for message routing and integration between services.
o Example: Coordinates flight booking, payment, and confirmation services.
8. Service Orchestration:
o Combines multiple services into a composite application.
o Example: Booking a travel package including flight, hotel, and car rental.
9. Service Data:
o Information exchanged between services.
o Example: Passenger details and booking confirmation.
SOA in Action (Travel Booking Scenario):
The Travel Booking Application (Consumer) searches the Service Registry to find the
Flight Booking Service (Provider).
It retrieves the Service Contract to understand the API structure.
The application sends a booking request via the Service Interface.
The Service Implementation processes the request and stores the data.
The Service Bus (ESB) handles message routing between booking, payment, and
confirmation services.
Finally, the Service Orchestration ensures that flight, hotel, and car booking are
coordinated as a single package.
Benefits:
Loose Coupling: Services operate independently.
Reusability: Services can be reused in different applications.
Interoperability: Different platforms can integrate seamlessly.
Scalability: Easy to add or update services.
4. Compare the following tools for Distributed System Monitoring : Prometheus, Zabbix,
Nagios.
Criteria Prometheus Zabbix Nagios
Purpose Real-time Enterprise- Infrastructure
monitoring grade and network
and alerting monitoring and monitoring
alerting
Data Collection Pull-based, Agent-based, Agent-based,
collects collects from collects logs
metrics via various sources and metrics
exporters
Data Storage Time-series Relational Flat files, uses
database database RRD for
(TSDB), stores (MySQL, performance
metrics PostgreSQL) data
efficiently
Alerting Integrated Built-in alerting Customizable
alert manager, with flexible alerting with
rule-based notifications plugins
alerts
Visualization Integrates Built-in Basic
with Grafana dashboard and visualization,
for visualization integrates
dashboards tools with Grafana
Scalability Highly Scalable but can Scales well
scalable, get complex with plugins
supports with large and
multi- environments extensions
dimensional
data
Community Strong, Active Long-
Support maintained by community and standing,
CNCF, widely commercial large user
used in support base,
DevOps available community-
driven
Best for Cloud-native Complex IT Traditional IT
environments, infrastructures, infrastructure
container large networks and server
monitoring monitoring
Weaknesses Limited log Steeper Plugin
monitoring, learning curve, dependency
lacks built-in requires for advanced
dashboard extensive setup features
Example Kubernetes Network monitoring Server monitoring
monitoring
5. Explain in brief following Microkernels : i) Mach ii) CHORUS How are memory
management techniques used to avoid physical copying of data in Mach and
CHORUS?
Microkernels: Mach and CHORUS:
i) Mach Microkernel
Developed at Carnegie Mellon University.
Focuses on minimal kernel functions: interprocess communication (IPC), basic
scheduling, and memory management.
Most services like device drivers, file systems run in user space (outside kernel),
making it modular and extensible.
ii) CHORUS Microkernel
Developed in the 1980s for distributed and real-time systems.
Designed to support distributed OS services and real-time operations.
Implements message-passing for communication between distributed components,
with emphasis on flexibility and scalability.
Memory Management Techniques to Avoid Physical Copying of Data
Both Mach and CHORUS use techniques to avoid expensive physical copying of data
during interprocess communication:
Mach:
o Uses copy-on-write and virtual memory techniques.
o Data is shared via pointers to the same physical memory pages. Actual
copying happens only if one process modifies the data (copy-on-write).
o This avoids copying data between processes during IPC, improving efficiency.
CHORUS:
o Uses zero-copy message passing where messages refer to memory locations
instead of copying data.
o It passes memory descriptors or pointers, and physical copying is deferred or
avoided.
o This improves performance, especially in distributed environments.
6. How SOA differ from traditional software architecture?
SOA uses small independent services that work together over a network.
Traditional software has parts that are closely linked and depend on each other.
SOA is more flexible and easier to change than traditional software.
Unit 1:
1. What is NTP? With the help of a diagram, describe how NTP works. [9]
What is NTP?
NTP (Network Time Protocol) is a protocol used to synchronize the clocks of
computers over a network. It ensures that all devices on a network agree on the
exact time, which is crucial for logging events, coordinating tasks, and security.
NTP was developed by David Mills in 1981 at the University of Delaware.
Working of NTP :
NTP is a protocol that works over the application layer, it uses a hierarchical system
of time resources and provides synchronization within the stratum servers. First, at
the topmost level, there is highly accurate time resources' ex. atomic or GPS clocks.
These clock resources are called stratum 0 servers, and they are linked to the below
NTP server called Stratum 1,2 or 3 and so on. These servers then provide the
accurate date and time so that communicating hosts are synced to each other.
Architecture of Network Time Protocol :
Features of NTP :
NTP servers have access to highly precise atomic clocks and GPU clocks
It uses Coordinated Universal Time (UTC) to synchronize CPU clock time.
Avoids even having a fraction of vulnerabilities in information exchange
communication.
Provides consistent timekeeping for file servers
Applications of NTP :
Used in a production system where the live sound is recorded.
Used in the development of Broadcasting infrastructures.
2. Why is the Berkeley algorithm used? Describe how it works using pseudocode.
Why is the Berkeley Algorithm Used?
The Berkeley Algorithm is used to synchronize clocks in a distributed system where
there is no external accurate time source (like GPS or atomic clocks). It is especially
useful when no single machine has the correct time, so all clocks adjust to a
common time agreed upon by the system.
How Berkeley Algorithm Works
One machine is chosen as the master (time server).
The master polls other machines (slaves) to get their clock times.
It calculates an average time (excluding any clocks that are too far off).
Then, it sends adjustment values to each machine so they can update their clocks to
this average time.
Pseudocode for Berkeley Algorithm:
1. Initialize:
- Each node has its own local clock time.
- Designate one node as the coordinator.
2. Coordinator collects time:
- Coordinator sends a request to all nodes asking for their local clock times.
- Each node responds with its current clock time.
3. Calculate average time:
- Coordinator calculates the average time:
Average Time = (Sum of all clock times + Coordinator's clock time) / (Number of
nodes + 1)
4. Calculate offsets:
- Coordinator computes the offset for each node:
Offset = Average Time - Node's Clock Time
5. Send adjustments:
- Coordinator sends the calculated offset to each node.
6. Adjust clocks:
- Each node adjusts its clock by adding the received offset.
7. Synchronization complete:
- All nodes are now synchronized to the average time.
3. What is MPI? Describe the point-to-point communication in MPI with suitable
diagram. [9] Explain in brief, the following send operations in MPI : i) MPI_Ssend
ii) MPI_Bsend iii) MPI_Rsend iv) MPI_Isend
What is MPI?
MPI (Message Passing Interface) is a standardized and portable library used for
communication between processes in parallel computing. It allows multiple
processes running on one or more computers to exchange data efficiently.
Point-to-Point Communication in MPI
Point-to-point communication is the simplest form of communication in MPI, where
data is sent from one process to another.
Sender process sends a message using a send operation.
Receiver process receives the message using a receive operation.
Process A sends a message to Process B.
Process B waits and receives the message from Process A.
Basic MPI Functions:
MPI_Send() — Blocking send
MPI_Recv() — Blocking receive
MPI Send Operations Explained:
Operation Description
Synchronous send. The sender waits until the receiver starts
MPI_Ssend
receiving before continuing.
Buffered send. The message is copied into a buffer, so the
MPI_Bsend
sender can continue immediately.
Ready send. The sender assumes the receiver is already ready
MPI_Rsend
to receive the message.
Non-blocking send. The send operation starts, and the sender
MPI_Isend
can continue without waiting for completion.
4. What is the need of mutual exclusion? Explain the permission-based centralized
distributed mutual exclusion algorithm.
What is the Need for Mutual Exclusion?
In distributed systems, multiple processes may want to access shared resources (like
files, databases, printers) at the same time.
To avoid conflicts, data inconsistency, or corruption, only one process should access
the shared resource at a time.
Mutual Exclusion ensures only one process enters the critical section (CS) at any
moment, preventing race conditions and ensuring data integrity.
Permission-Based Centralized Distributed Mutual Exclusion Algorithm
This algorithm uses a central coordinator (server) to control access to the critical
section.
Processes send requests to the coordinator to enter the critical section.
The coordinator grants permission (token) to one process at a time.
The process with permission enters the critical section.
After completing, the process releases the permission back to the coordinator.
Coordinator then grants permission to the next requesting process.
Steps of the Algorithm:
1. A process sends a request message to the coordinator when it wants to enter the
critical section.
2. The coordinator maintains a queue of requests.
3. If the coordinator is not currently granting permission, it immediately grants
permission to the requesting process.
4. If a permission is already granted to another process, the new request is queued.
5. After a process finishes in the critical section, it sends a release message to the
coordinator.
6. The coordinator then grants permission to the next process in the queue.
Advantages:
Simple to implement.
Only one coordinator handles requests, reducing complexity.
Disadvantages:
Coordinator is a single point of failure.
If coordinator crashes, mutual exclusion fails.
Can create a bottleneck if many requests.
5. Explain external data representation (XDR), marshalling and unmarshalling. Why
is XDR required? Discuss in brief the three alternative approaches to external data
representation.
External Data Representation (XDR), Marshalling, and Unmarshalling
1. External Data Representation (XDR)
XDR is a standard way to encode data so that it can be understood across different
computer systems.
It converts data into a common format for transmission between machines with
different architectures (e.g., different byte order, data size).
Ensures platform-independent data exchange in distributed systems.
2. Marshalling
The process of packing data into a standardized format before transmission.
Converts complex data structures into a byte stream for sending over the network.
Includes encoding data types, sizes, and values.
3. Unmarshalling
The reverse process of marshalling.
The receiver unpacks the byte stream back into the original data structures.
Decodes the data to a usable form on the receiving machine.
Why is XDR Required?
Different computers use different data formats, byte ordering, and sizes.
Without a common representation, data sent from one machine may be
misinterpreted by another.
XDR ensures data portability and interoperability between heterogeneous systems in
distributed computing.
Three Alternative Approaches to External Data Representation
Approach Description Example
1. Standardized Data
Use a fixed, agreed-upon data format for all systems. XDR, ASN.1
Formats
2. Self-Describing
Data includes metadata describing its format. XML, JSON
Data
3. Machine-Specific Each system uses its own format and translators Custom
Formats convert data during communication. converters
6. Why do we need an election algorithm? Using appropriate diagrams. describe the
bully and ring algorithm step-by-step.
Why Do We Need an Election Algorithm?
In a distributed system, many processes run on different machines (nodes).
Sometimes, one process must act as a coordinator or leader to manage tasks like
resource allocation or synchronization.
If the current coordinator fails or leaves, the system needs a way to elect a new
coordinator automatically.
An election algorithm helps nodes decide which process will become the new
coordinator without confusion or conflicts.
Bully Algorithm
How it Works:
Every process has a unique ID (usually a number). Higher ID means higher priority.
When a process notices the coordinator is down, it starts an election.
The process sends election messages to all processes with higher IDs.
If no one responds, it declares itself as coordinator and announces to others.
If any higher-ID process responds, that process takes over the election.
This continues until the highest-ID process becomes coordinator.
Step-by-Step Bully Algorithm (with diagram)
Processes: P1 (ID=1), P2 (ID=2), P3 (ID=3), P4 (ID=4)
Assume P4 (highest ID) is the coordinator.
1. Coordinator (P4) crashes.
2. P2 notices and starts election.
3. P2 sends election messages to P3 and P4.
4. P3 responds (P4 is down).
5. P3 starts its own election, sends to P4 (no response).
6. P3 declares itself coordinator.
7. P3 sends coordinator message to P1, P2.
Diagram:
Ring Algorithm
How it Works:
Processes are arranged logically in a ring.
When a process detects coordinator failure, it sends an election message containing
its own ID to the next process in the ring.
Each process appends its ID to the message and passes it along.
When the message comes back to the initiator, it knows all IDs.
The initiator selects the process with the highest ID as the coordinator and sends a
coordinator message around the ring to inform all.
Step-by-Step Ring Algorithm (with diagram)
Processes in ring order: P1 → P2 → P3 → P4 → P1
Assume P4 was coordinator and crashed.
1. P2 detects failure and starts election by sending message: [2] to P3.
2. P3 receives [2], adds its ID: [2,3], passes to P4.
3. P4 down, no response; timeout triggers P3 to send to P1.
4. P1 receives [2,3], adds its ID: [2,3,1], sends to P2.
5. P2 receives message back, finds max ID = 3 (P3).
6. P2 sends coordinator message: "P3 is coordinator" around ring.
Diagram :
7. What are the general characteristics of inter-process communication? What are
various types of Synchronous and asynchronous communication in IPC? Why it is
blocking receive has no disadvantages in Java.
General Characteristics of Inter-Process Communication (IPC)
1. Process Coordination: Enables processes to cooperate and synchronize their actions.
2. Data Exchange: Allows sharing of data or messages between processes.
3. Communication Mechanism: Can be message passing or shared memory.
4. Concurrency: Supports concurrent execution of processes.
5. Transparency: Often abstracts communication details from processes.
6. Synchronization: Helps coordinate access to shared resources.
7. Deadlock Handling: Requires mechanisms to avoid or handle deadlocks.
8. Reliability: Ensures messages are delivered accurately and in order.
Types of IPC Communication
Type Description Example
Synchronous Sender and receiver must wait for each Blocking send &
Type Description Example
Communication other to send/receive messages. blocking receive
Asynchronous Sender sends message without waiting for Non-blocking send &
Communication receiver; receiver can receive later. non-blocking receive
Why Blocking Receive Has No Disadvantages in Java?
In Java, blocking receive is often implemented using wait/notify or synchronized
methods.
Java threads are managed efficiently by JVM, so blocking a thread to wait for
messages does not waste CPU cycles.
The thread is put in a waiting state and only resumes when the message arrives.
This avoids busy-waiting and resource wastage, making blocking receive efficient and
natural in Java’s thread model.
8. Describe Lamport’s clock algorithm and give an example of how to order events in
a distributed system. What are the algorithm’s limitations?
Purpose:
To provide a way to order events in a distributed system where there is no global clock,
ensuring causal ordering of events.
How Lamport’s Clock Works:
1. Logical Clock:
Each process maintains a logical clock (counter).
2. Rules for Updating Clocks:
o Rule 1: Before a process executes an event, it increments its clock by 1.
o Rule 2: When a process sends a message, it attaches its current clock value to
the message.
o Rule 3: When a process receives a message with timestamp T_msg, it updates
its clock as:
C = max(C, T_msg) + 1
3. Ordering Events:
Events are ordered by comparing their timestamps. If two events have the same
timestamp, break ties by process IDs.
example of Lamport’s Clock with 2 processes (P1 and P2):
Step-by-step:
Event Process Action Clock Update
E1 P1 Local event Clock = 0 + 1 = 1
E2 (send msg) P1 Send message to P2 Clock = 1 + 1 = 2, send ts=2
E3 (recv msg) P2 Receive message from P1 Clock = max(0, 2) + 1 = 3
E4 P2 Local event Clock = 3 + 1 = 4
Event order by timestamps:
E1 (P1:1) → E2 (P1:2) → E3 (P2:3) → E4 (P2:4)
Limitations of Lamport’s Clock Algorithm
1. Only Partial Ordering:
Lamport clocks ensure causal order, but cannot detect concurrent events (events
that are independent).
2. No Total Ordering of Events:
Different processes may have the same timestamp for unrelated events.
3. Does Not Capture Causality Fully:
It can't distinguish if two events are causally related or concurrent, only that one
happened before another.
4. Tie-breaking Needed:
Additional rules like process IDs are needed to break ties.
9. Why is computer clock synchronization necessary? Describe the design
requirements for a system to synchronize the clocks in a distributed system.
Why is Computer Clock Synchronization Necessary?
Consistency: To ensure all processes in a distributed system agree on the order of
events.
Coordination: Many distributed algorithms rely on synchronized time for
coordination (e.g., mutual exclusion, resource allocation).
Data Integrity: Correct timestamps prevent conflicts in databases and logs.
Fault Detection: Helps detect failures or anomalies by comparing timestamps.
Scheduling: Enables timed events and deadlines to be managed accurately.
Design Requirements for Clock Synchronization in Distributed Systems
1. Accuracy:
Clocks should be synchronized within an acceptable error margin suitable for the
application.
2. Scalability:
The synchronization method must work efficiently as the number of nodes increases.
3. Fault Tolerance:
The system should handle failures of nodes or communication links without losing
synchronization.
4. Low Overhead:
Communication and computation overhead for synchronization should be minimal to
avoid performance degradation.
5. Autonomy:
Each node should be able to maintain and adjust its clock independently but still
synchronize effectively with others.
6. Consistency:
The system should ensure a consistent global time view or causal ordering among all
processes.
7. Robustness to Network Delays:
The design must consider unpredictable network delays and asymmetries in message
transmission times.
8. Adaptability:
The synchronization system should adapt to changes in network topology or node
addition/removal.
Unit 4:
1. What are the key issues in Replica Management? Explain the following with
respect to content replication and placement with suitable diagram.[9] i)
Permanent Replicas ii) Server-Initiated Replicas iii) Client-Initiated Replicas
Key Issues in Replica Management
Consistency: Keeping all replicas up-to-date and synchronized.
Replica Placement: Deciding where to place replicas for better performance.
Replica Creation: When and how to create replicas.
Fault Tolerance: Ensuring system works even if some replicas fail.
Load Balancing: Distributing requests evenly among replicas.
Scalability: Handling many replicas without overhead.
Update Propagation: Efficiently updating all replicas when data changes.
Replica Types in Content Replication and Placement :
Replica Type Description How It Works Diagram Idea
Always
available, Server 1 and
Replicas placed
i) Permanent placed by Server 2
fixed on certain
Replicas admin or always have
servers
system for replica of data
reliability
Server decides
Server monitors Server creates
ii) Server- when and where
usage and new replica
Initiated to create replicas
creates replicas when demand
Replicas based on
dynamically spikes
demand
Client copies Client
iii) Client- Clients request
data to its local downloads
Initiated and create
cache for faster and stores
Replicas replicas locally
access data locally
Diag :
Short and Simple Explanation of the Diagram:
1. Permanent Replicas:
o Always stored on fixed servers for reliability.
2. Server-Initiated Replicas:
o Created by servers when demand increases.
3. Client-Initiated Replicas:
o Stored locally by clients for faster access.
2. What is a primary-based protocol in a consistency protocol?
Explain the working of replicated write protocols with active replication.
Explain the working of a primary-backup protocol with a suitable diagram.
Primary-Based Protocol in Consistency Protocol:
A primary-based protocol is a consistency protocol used in distributed systems where
one replica (called the primary) is responsible for managing write operations. All
updates are first performed on the primary, and then the changes are propagated to
backup replicas.
Replicated Write Protocols with Active Replication:
In active replication, each replica can process write requests simultaneously.
All replicas execute the same operation independently to maintain consistency.
This method is suitable when all replicas can perform identical operations
concurrently.
How It Works:
1. A client sends a write request to all replicas.
2. All replicas perform the same write operation independently.
3. If all replicas succeed, the client receives a confirmation.
4. If any replica fails, the operation might be retried or handled by a fault-tolerant
mechanism.
Primary-Backup Protocol:
The system designates one replica as the primary and others as backups.
Write operations are first performed on the primary, which then propagates the
update to the backups.
This protocol ensures strong consistency as all writes are handled in a centralized
way.
Working Steps:
1. Client Request: The client sends a write request to the primary.
2. Primary Update: The primary performs the write operation.
3. Propagation: The primary sends the update to all backup replicas.
4. Acknowledgment: Once backups are updated, the primary sends an
acknowledgment to the client.
Advantages of Primary-Backup Protocol:
Strong Consistency: Ensures that updates are uniform across replicas.
Fault Tolerance: Backup replicas take over if the primary fails.
Disadvantages:
Single Point of Failure: If the primary crashes, system availability is affected until a
new primary is elected.
Performance Bottleneck: The primary may become a bottleneck under heavy load.
3. What is checkpointing in a distributed system? Explain the working of
Coordinated checkpointing recovery mechanism.
Checkpointing in a Distributed System
Checkpointing saves the current state (snapshot) of a system during execution.
The saved state is stored in stable storage.
It is used to recover the system after a failure.
Helps distributed systems recover all processes to a consistent state.
Minimizes work lost and avoids restarting from the beginning.
Coordinated Checkpointing Recovery Mechanism
In coordinated checkpointing, all processes in the distributed system synchronize to
take checkpoints at the same logical time to ensure consistency. This avoids problems
like the "domino effect," where recovery could roll back indefinitely.
Working Steps of Coordinated Checkpointing:
1. Checkpoint Initiation:
One process (called the coordinator) decides to initiate a checkpoint.
2. Checkpoint Request:
The coordinator sends a checkpoint request message to all other processes.
3. Checkpoint Taking:
Upon receiving the request, each process:
o Suspends normal operations briefly.
o Saves its current state (checkpoint) to stable storage.
4. Acknowledgment:
Each process sends an acknowledgment back to the coordinator once checkpointing
is done.
5. Completion:
When the coordinator receives acknowledgments from all processes, the
coordinated checkpoint is considered complete.
6. Normal Execution Resumes:
All processes continue normal execution until the next checkpoint.
Recovery Using Coordinated Checkpointing:
If a failure occurs, all processes roll back to the last coordinated checkpoint and
resume execution from there, ensuring a consistent global state.
4. What are two primary reasons for replication? Explain the causal consistency
model with suitable example using distributed shared database.
Two Primary Reasons for Replication
1. Reliability and Fault Tolerance:
o Replication ensures that if one server or copy fails, others can take over
without service interruption.
2. Performance and Availability:
o Replicas closer to users reduce response time and balance the load,
improving system availability and speed.
Causal Consistency Model
Causal consistency ensures that if one operation causally affects another, all
processes see these operations in the same order.
However, operations that are concurrent (independent) can be seen in different
orders on different processes.
Causal Consistency Model
Definition:
Causal consistency ensures that operations that are causally related are seen by all
processes in the same order, while operations that are not causally related can be
seen in different orders on different processes.
Key Points:
If operation A causally affects operation B, then all processes must see A before B.
Unrelated operations can be seen in different orders by different processes.
This model provides a balance between strong consistency and eventual consistency
Example: Distributed Shared Database
Consider a distributed shared database with three processes: P1, P2, and P3.
1. P1 writes X = 5.
2. P2 reads X = 5 and writes Y = 10.
3. P3 reads X = 5 and writes Z = 20.
Causal Relationships:
P2's write to Y is causally dependent on P1's write to X.
P3's write to Z is independent of P1 and P2's operations
Causal Consistency Guarantees:
All processes must see X = 5 before Y = 10.
The order of Z = 20 relative to X = 5 and Y = 10 can vary across processes.
5. What is fault tolerance? Explain the transient, intermittent and permanent fault
classes. Explain in brief, various types of failures
What is Fault Tolerance?
Fault tolerance is the system's ability to keep working even when parts of it fail. For
example, if one server in a web service crashes, other servers keep the website
running without downtime.
Fault Classes
1. Transient Faults:
o Occur temporarily and disappear after a short time.
o Example: A momentary network glitch.
2. Intermittent Faults:
o Occur irregularly, appearing and disappearing repeatedly.
o Example: A loose connection causing occasional signal loss.
3. Permanent Faults:
o Persist until repaired or replaced.
o Example: A hardware component that has completely failed.
Types of Failures with Examples
1. Crash Failure:
o A process or device stops working suddenly.
o Example: A computer suddenly shuts down due to a power failure.
2. Omission Failure:
o Failure to send or receive messages.
o Example: A server fails to respond to a request due to packet loss.
3. Timing Failure:
o The system responds too early or too late.
o Example: A real-time sensor sends data after the deadline, causing the system
to miss the event.
4. Byzantine Failure:
o The system behaves incorrectly or maliciously.
o Example: A faulty sensor sends wrong temperature data inconsistent with
other sensors.
Fault Tolerance
Advantages:
1. Improves system reliability by handling failures gracefully.
2. Increases system availability, reducing downtime.
Disadvantages:
1. Adds cost due to extra hardware/software.
2. Makes system design more complex.
Applications:
1. Banking systems for safe transactions.
2. Cloud services to ensure continuous uptime.
6. What is the distribution commit problem? Discuss how this problem is solved
using the two-phase commit protocol with suitable diagram
What is the Distributed Commit Problem?
In a distributed system, a transaction often involves multiple nodes or databases. The
Distributed Commit Problem is about ensuring all nodes agree to either commit or
abort the transaction so that the system remains consistent.
Problem:
If one node commits and others don’t, the system becomes inconsistent.
Network failures or crashes during the commit can leave some nodes uncertain
about the transaction's final state.
Two-Phase Commit Protocol (2PC)
The 2PC protocol ensures all nodes in a distributed system either commit or abort a
transaction atomically, using two phases coordinated by a coordinator node.
Phases of 2PC
1. Phase 1: Voting Phase (Prepare phase)
o Coordinator sends a prepare request to all participants asking if they can
commit.
o Participants execute the transaction up to the commit point and reply "Yes"
(ready to commit) or "No" (abort).
2. Phase 2: Commit Phase
o If all participants vote "Yes", the coordinator sends a commit request to all
participants.
o If any participant votes "No", the coordinator sends an abort request to all
participants.
o Participants either commit or abort based on the coordinator’s message and
send acknowledgment.
Advantages of 2PC
1. Ensures all nodes commit or abort together.
2. Easy to understand and use.
Disadvantages of 2PC
1. Can get stuck if coordinator fails.
2. Takes more time because of many messages.
7. What are the requirements of dependable systems with respect to fault
tolerance? How RPC handles the communication failure in the presence of :
i) The client is unable to locate the server.
ii) The request message from the client to the server is lost.
Requirements of Dependable Systems for Fault Tolerance
1. Reliability: The system should continue to work correctly despite faults.
2. Availability: The system should be accessible and operational most of the time.
3. Fault Detection: The system must detect faults quickly.
4. Fault Recovery: The system should recover from faults automatically or with minimal
intervention.
5. Transparency: Failures should be hidden from users as much as possible.
i) Client Unable to Locate the Server
Cause: The client cannot find the server due to network issues or the server is down.
How RPC Handles It:
o The client’s RPC stub tries to resolve the server address.
o If the server is not found, the stub raises an error or times out.
o The client application can handle this by retrying after some delay or by
reporting the failure to the user.
o Some systems use name services or registries to track available servers to
improve locating them.
ii) Request Message from Client to Server is Lost
Cause: Network failure causes the client’s request message not to reach the server.
How RPC Handles It:
o The client waits for a reply, but after a timeout period, assumes the message
was lost.
o The client retransmits the request message.
o To avoid the server executing the same request multiple times (duplicate
execution), the server keeps track of request IDs.
o If the server receives a duplicate request, it sends the previous response
without re-executing the request, ensuring idempotency.
o This mechanism ensures reliable communication despite message loss.
Unit 5:
1. Describe the architecture of Sun Network File System in details.
Architecture of Sun Network File System (NFS)
NFS is a distributed file system protocol developed by Sun Microsystems that allows
users to access files over a network as if they were on their local machines.
Key Components of NFS Architecture
1. Client:
o The machine that requests access to files stored on remote servers.
o Runs an NFS client software that handles remote file operations.
2. Server:
o The machine that stores and shares files over the network.
o Runs NFS server software which exports directories (shares files) to clients.
3. Remote Procedure Call (RPC):
o NFS uses RPC to communicate between client and server.
o RPC allows a client to request services (like reading/writing files) from the
server transparently.
4. Mount Protocol:
o Clients use the mount protocol to request access to specific directories on the
NFS server.
o After mounting, the remote directories appear as part of the client’s local file
system.
5. File Handle:
o Unique identifier used by the server to identify each file or directory.
o Sent to the client and used in all file operations to locate the file on the
server.
How NFS Works (Step-by-step)
1. Exporting:
o The server exports one or more directories for sharing by clients.
2. Mounting:
o The client uses the mount protocol to mount the exported directory.
o After mounting, the client can access files in the directory as if they are local.
3. File Operations:
o When a client wants to perform file operations (read, write, open, close), it
sends RPC requests to the server using the file handle.
4. Server Processing:
o The server receives RPC calls, processes the file operations, and returns
results to the client.
5. Statelessness:
o NFS servers are mostly stateless, meaning they do not keep client states
between requests.
o This simplifies recovery from failures.
2. What is a Directory Service? What is the difference between DNS and x500?
Describe in detail the components of X.500 service architecture.
What is a Directory Service?
A Directory Service is a specialized database system that stores, organizes, and
provides access to information in a networked environment. It helps users and
applications find resources like users, devices, services, or files in a distributed
system by name or attributes.
Difference Between DNS and X.500
DNS (Domain Name
Feature X.500 Directory Service
System)
Provides a comprehensive
Translates domain
Purpose directory for various network
names to IP addresses
resources and entities
Hierarchical, tree-
Data Hierarchical, object-oriented
based (domain name
Structure directory information tree (DIT)
hierarchy)
Scope Mainly for Internet Broader; stores info about
DNS (Domain Name
Feature X.500 Directory Service
System)
addressing (hosts, users, organizations, devices,
domains) etc.
Defined by ITU-T X.500
Standards Defined by IETF RFCs
standards
Query Simple name More complex queries based on
Mechanism resolution attributes and relationships
Distributed directory servers
Yes, decentralized DNS
Distributed called Directory System Agents
servers
(DSAs)
Limited security Stronger security and access
Security
features control features
Components of X.500 Service Architecture
X.500 provides a distributed directory service with the following components:
1. Directory User Agent (DUA):
o Client application that accesses directory services to query or update
directory information.
o It is the interface for users or applications.
2. Directory System Agent (DSA):
o Server component that stores part of the directory information.
o Responsible for managing and maintaining the directory database.
o DSAs communicate with each other to resolve queries across distributed
directory trees.
3. Directory Information Base (DIB):
o The database that holds directory entries organized in a hierarchical tree
called the Directory Information Tree (DIT).
o Each entry contains attributes related to objects like people, organizations, or
devices.
4. Directory Access Protocol (DAP):
o The protocol used by DUAs to communicate with DSAs.
o Allows users to query and update directory entries.
5. Directory System Protocol (DSP):
o Protocol used for communication between DSAs.
o Helps in distributing directory requests across servers.
3. What are web services? Describe with a suitable diagram the general
organization of the Apache web server
What are Web Services?
Web services are software systems designed to support machine-to-machine
interaction over a network. They allow applications to communicate with each other,
regardless of the platforms or languages they are built on.
Characteristics of Web Services:
1. Interoperability: Different applications can communicate.
2. Platform Independence: Works across various OS and programming languages.
3. Network Accessibility: Uses HTTP/HTTPS for communication.
4. Standard Protocols: Uses SOAP, REST, XML, or JSON.
General Organization of the Apache Web Server:
The Apache Web Server is a widely-used open-source web server software that
handles HTTP requests and serves web content to clients.
Components of Apache Web Server:
1. Client:
o The browser or application that sends HTTP requests to the server.
2. Apache Server (Main Components):
o HTTP Protocol Handler: Manages HTTP requests and responses.
o Request Processing Modules: Handle various content types (like PHP, Perl).
o Core Server: Controls configuration, logging, and server management.
o Modules (Dynamic Shared Objects): Extend the server’s functionality (e.g.,
mod_ssl for SSL support).
o Content Generators: Generate content dynamically (like CGI scripts).
o Access Control Modules: Manage authentication and authorization.
o Logging Module: Tracks requests, errors, and server activity.
3. File System:
o Stores HTML, CSS, JavaScript, images, and other web content.
How Apache Web Server Works (Step-by-Step):
1. Client Request: The client sends an HTTP request to the Apache server.
2. Request Handling: The HTTP protocol handler processes the request.
3. Module Processing: Appropriate modules handle dynamic content (like PHP).
4. Content Retrieval: The server retrieves the requested file from the file system.
5. Response to Client: The server sends the processed response back to the client.
What is Apache HTTP Server?
Apache HTTP Server is an open-source and free web server that is written by the
Apache Software Foundation (ASF)
4. Discuss the master-slave architecture of Hadoop Distributed File System along
with functions of its key components.
Master-Slave Architecture of Hadoop Distributed File System (HDFS)
HDFS follows a master-slave architecture to store and process vast amounts of data.
It is a core component of the Hadoop framework designed for distributed storage
and efficient data processing.
Key Components of HDFS:
The HDFS architecture mainly consists of two types of nodes:
1. NameNode (Master)
2. DataNode (Slave)
1. NameNode (Master)
Role: Manages the file system metadata and directory structure.
Functions:
o Stores metadata like filenames, file permissions, and block locations.
o Controls access to files and coordinates data replication.
o Maintains a namespace image and an edit log.
o Monitors the health of DataNodes through heartbeat signals.
2. DataNode (Slave)
Role: Stores actual data in the form of data blocks.
Functions:
o Reads and writes requests from clients as instructed by the NameNode.
o Periodically sends heartbeat signals to the NameNode.
o Sends block reports to the NameNode.
o Responsible for data replication as instructed.
How HDFS Works (Step-by-Step):
1. File Splitting:
o A file is split into fixed-size blocks (default 128 MB).
2. Data Storage:
o These blocks are stored on different DataNodes.
3. Metadata Management:
o The NameNode maintains metadata of file-to-block mapping.
4. Client Interaction:
o The client requests a file from the NameNode.
o The NameNode responds with the location of the data blocks.
o The client directly communicates with the DataNodes to access data.
5. Replication:
o Data is replicated across multiple DataNodes for fault tolerance (default
replication factor is 3).
Advantages of Master-Slave Architecture:
1. Efficient Data Management: Centralized control through the NameNode.
2. High Availability: Data replication ensures fault tolerance.
3. Scalability: Easy to add more DataNodes as needed.
Disadvantages:
1. Single Point of Failure: If the NameNode fails, the entire system goes down.
2. Network Congestion: Heavy load on NameNode in large clusters.
5. Explain the Bandwidth, Latency and Loss rate parameters with respect to
multimedia stream. Explain the QoS negotiation procedure and admission control
scheme for distributed multimedia application.
Bandwidth, Latency, and Loss Rate in Multimedia Streaming:
In multimedia streaming, Bandwidth, Latency, and Loss Rate are crucial parameters
that impact the Quality of Service (QoS).
1. Bandwidth:
Definition: The amount of data transmitted per unit time (measured in Mbps or
Gbps).
Relevance:
o High bandwidth ensures smooth video and audio transmission.
o Insufficient bandwidth causes buffering and lag.
Example: A 4K video stream may require 25 Mbps of bandwidth.
2. Latency:
Definition: The time taken for data to travel from the source to the destination
(measured in milliseconds).
Relevance:
o Low latency is crucial for real-time multimedia applications like video calls and
online gaming.
o High latency results in delays, making live interactions frustrating.
Example: A latency below 50 ms is ideal for video conferencing.
3. Loss Rate:
Definition: The percentage of data packets lost during transmission.
Relevance:
o High loss rate leads to degraded video and audio quality.
o Causes pixelation in video and distortion in audio.
Example: A loss rate of more than 1% can significantly affect streaming quality.
Quality of Service (QoS) Negotiation Procedure:
QoS ensures that multimedia applications meet performance standards, particularly
when resources are shared among multiple users.
Steps of QoS Negotiation:
1. Resource Request:
o The client requests the desired QoS (like bandwidth, latency, and loss
tolerance) from the server.
2. Service Offer:
o The server checks the availability of resources and proposes a service level.
3. Negotiation:
o If the client’s request and server’s offer don’t match, they negotiate to find a
middle ground.
4. Agreement:
o Both parties agree on a QoS level that can be maintained.
5. Service Establishment:
o The service starts with the agreed parameters.
Admission Control Scheme for Distributed Multimedia Applications:
Admission control determines whether the network can accommodate a new
multimedia stream without degrading the quality of existing streams.
Procedure:
1. Request Evaluation:
o The system checks the current network load and available resources.
2. QoS Verification:
o Compares the requested QoS with the available capacity.
3. Decision Making:
o If resources are sufficient, the request is accepted.
o If not, it is either rejected or downgraded to a lower quality.
4. Resource Reservation:
o If accepted, the system reserves the necessary resources.
Example:
In a video conferencing application, admission control ensures that adding a new
user does not compromise the existing call quality.
If the network is already at capacity, new requests might be downgraded from HD to
SD quality.
6. Why Quality of Service Management is important in Distributed Multimedia
Systems? Describe QoS manager responsibilities using suitable graphical
representation.
Why Quality of Service (QoS) Management is Important in Distributed Multimedia
Systems:
Distributed multimedia systems (like video conferencing, online streaming, and VoIP)
require consistent performance to deliver a good user experience. QoS management
ensures that multimedia content is transmitted smoothly, without delays,
interruptions, or degradation.
Importance of QoS Management:
1. Reliable Data Delivery: Maintains the quality of audio, video, and real-time data
transmission.
2. Minimal Latency: Reduces delay to support interactive applications.
3. Bandwidth Optimization: Efficiently uses available network resources.
4. Reduced Packet Loss: Prevents loss of important data, which can disrupt media
quality.
5. Service Assurance: Guarantees that users receive the expected level of service even
under network congestion.
Responsibilities of QoS Manager:
A QoS Manager is responsible for maintaining and optimizing the quality of
multimedia streams in distributed systems.
1. Resource Reservation:
Allocates necessary resources (like bandwidth) before starting the multimedia
service.
Uses protocols like RSVP (Resource Reservation Protocol) to reserve resources.
2. QoS Monitoring:
Continuously checks network performance metrics such as latency, jitter, and packet
loss.
Uses monitoring tools to detect performance issues.
3. Traffic Shaping:
Regulates data flow to prevent congestion by adjusting transmission rates.
Prioritizes multimedia traffic over less critical data.
4. Adaptive QoS Control:
Adjusts quality parameters dynamically based on network conditions.
Uses techniques like bitrate adaptation to maintain quality during fluctuations.
5. Fault Tolerance:
Detects and handles network failures or performance degradation.
Reroutes data to maintain service continuity.
Graphical Representation of QoS Management:
Example: Video Conferencing System:
Before Call: Reserves bandwidth.
During Call: Monitors latency; reduces video quality if needed.
Server Failure: Switches to a backup server to keep the call running.
7. What are the key design issues for distributed file systems? Describe the
requirements for distributed file systems
Key Design Issues for Distributed File Systems (DFS):
1. Transparency:
o The system should appear as a single unified file system, despite being
distributed.
o Types: Location, Access, Failure, and Replication Transparency.
2. Consistency:
o Ensures that all users see the same data, even with concurrent access.
o Techniques: Caching, Locking, Versioning.
3. Fault Tolerance:
o The system should continue to function despite hardware or network
failures.
o Methods: Data Replication and Checkpointing.
4. Scalability:
o The system should efficiently handle growth in the number of users and
data.
o Approaches: Decentralized architectures and Load Balancing.
5. Security:
o Protect data from unauthorized access and modifications.
o Techniques: Authentication, Encryption, and Access Control.
6. Performance:
o Ensure low latency and high throughput for file access and updates.
o Optimization: Efficient caching and data distribution.
7. Concurrency Control:
o Manages simultaneous data access by multiple users without conflicts.
o Techniques: Locking Mechanisms and Versioning.
Requirements for Distributed File Systems:
1. Data Accessibility:
o Users should access files from any location within the network.
2. Reliability:
o Continuous access even in the presence of server or network failures.
3. Performance Efficiency:
o Fast read/write operations through caching and data replication.
4. Consistency Maintenance:
o Ensure data consistency during concurrent file modifications.
5. Security:
o Data encryption, user authentication, and access control mechanisms.
6. Scalable Architecture:
o Ability to expand the system without performance degradation.
some real-world examples of Distributed File Systems (DFS):
1. Hadoop Distributed File System (HDFS):
o Usage: Big Data processing (e.g., Apache Hadoop).
o Features: Fault tolerance, scalability, data replication.
o Example: Used by companies like Yahoo and Facebook for data analytics.
8. Explain in brief, the two places of client-side web caching? Explain cooperative
caching with suitable diagram
Client-Side Web Caching:
Client-side web caching improves web performance by storing frequently accessed
data locally. There are two primary places where client-side caching occurs:
1. Browser Cache:
Location: Stored on the user’s device within the web browser.
Purpose: Stores web pages, images, CSS, and scripts to reduce load time during
future visits.
Example: Visiting a website multiple times loads faster as images and styles are
cached.
2. Proxy Cache:
Location: Stored on a proxy server between the client and the internet.
Purpose: Reduces bandwidth usage and speeds up access for multiple users by
storing shared content.
Example: An organization's proxy server caches common resources to reduce
network traffic.
Cooperative Caching:
Cooperative caching involves multiple caches (e.g., in a distributed network) working
together to enhance data availability and access speed.
Key Concept:
Caches on different client nodes cooperate to share cached content.
Reduces redundant data retrieval from the server.
Diagram: Cooperative Caching (Simple Representation)
Explanation:
1. Client A requests data not in its cache, so it checks Client B’s cache before accessing
the server.
2. Client B may retrieve data from its cache or forward the request to Client C.
3. If no client has the data, it is fetched from the Main Server and shared among the
clients.
End…………………………..!