Unit 1 - Distributed System
Unit 1 - Distributed System
• Resource Sharing: It is the ability to use any Hardware, Software, or Data anywhere
in the System.
• Openness: It is concerned with Extensions and improvements in the system (i.e., How
openly the software is developed and shared with others)
• Concurrency: It is naturally present in Distributed Systems, that deal with the same
activity or functionality that can be performed by separate users who are in remote
locations. Every local system has its independent Operating Systems and Resources.
• Scalability: It increases the scale of the system as a number of processors communicate
with more users by accommodating to improve the responsiveness of the system.
• Fault tolerance: It cares about the reliability of the system if there is a failure in
Hardware or Software, the system continues to operate properly without degrading the
performance the system.
• Transparency: It hides the complexity of the Distributed Systems to the Users and
Application programs as there should be privacy in every system.
Properties:
• Scalability:
The ability to expand a system by adding more nodes to handle increased workload without
significant performance degradation.
• Fault Tolerance:
The system's capability to continue functioning even when individual components fail,
achieved through redundancy and error handling mechanisms.
• Availability:
The system remains accessible to users even during partial failures, ensuring continuous
service.
• Consistency:
Maintaining data integrity across multiple nodes, ensuring consistent access to information
regardless of which node is accessed.
• Concurrency:
Multiple operations can be executed simultaneously on different nodes within the system.
• Transparency:
Users should not be aware of the underlying distributed nature of the system, experiencing a
seamless interaction.
• Reliability:
The system is dependable and can be relied upon to perform its intended functions consistently.
GOALS
Just because it is possible to build distributed systems does not necessarily mean that it is a
good idea. After all, with current technology it is also possible to put four floppy disk drives
on a personal computer. It is just that doing so would be pointless. In this section we discuss
four important goals that should be met to make building a distributed system worth the effort.
A distributed system should make resources easily accessible; it should reasonably hide the
fact that resources are distributed across a network; it should be open; and it should be scalable.
1. Making Resources Accessible
The main goal of a distributed system is to make it easy for the users to access remote resources,
and to share them in a controlled and efficient way. Resources can be just about anything, but
typical examples include2. Enhancing Reliability things like printers, computers, storage
facilities, data, files, Web pages, and net works, to name just a few. There are many reasons for
wanting to share resources. One obvious reason is that of economics. For example, it is cheaper
to let a printer be shared by several users in a smaJl office than having to buy and maintain a
separate printer for each user. Likewise, it makes economic sense to share costly resources such
as supercomputers, high-performance storage systems, imagesetters, and other expensive
peripherals.
2. Distribution Transparency
An important goal of a distributed system is to hide the fact that its processes and resources are
physically distributed across multiple computers. A distributed system that is able to present
itself to users and applications as if it were only a single computer system is said to be
transparent.
Types of Transparency
The concept of transparency can be applied to several aspects of a distributed
system, the most important ones shown in Fig 1-2
3. Openness
Another important goal of distributed systems is openness. An open distributed system is a
system that offers services according to standard rules that describe the syntax and semantics
of those services. For example, in computer networks, standard rules govern the format,
contents, and meaning of messages sent and received. Such rules are formalized in protocols.
In distributed systems, services are generally specified through interfaces, which are often
described in an Interface Definition Language (IDL). Interface definitions written in an IDL
nearly always capture only the syntax of services. In other words, they specify precisely the
names of the functions that are available together with types of the parameters, return values,
possible exceptions that can be raised, and so on. The hard part is specifying precisely what
those services do, that is, the semantics of interfaces. In practice, such specifications are always
given in an informal way by means of natural language.
4. Scalability
Worldwide connectivity through the Internet is rapidly becoming as common as being able to
send a postcard to anyone anywhere around the world. With this in mind, scalability is one of
the most important design goals for developers of distributed systems.
Scalability Problems
When a system needs to scale, very different types of problems need to be solved. Let us first
consider scaling with respect to size. If more users or resources need to be supported, we are
often confronted with the limitations of centralized services, data, and algorithms
Only decentralized algorithms should be
used. These algorithms generally have the following characteristics, which distinzuish
them from centralized algorithms: e
1. No machine has complete information about the system state.
2. Machines make decisions based only on local information,
3. Failure of one machine does not ruin the algorithm.
4. There is no implicit assumption that a global clock exists.
Scaling Techniques
Approach of shipping code is now widely supported by the Web in the form of Java applets
and Javascript.
Pitfalls
It should be clear by now that developing distributed systems can be a formidable task. As we
will see many times throughout this book, there are so many issues to consider at the same time
that it seems that only complexity can be the result. Nevertheless, by following a number of
design principles, distributed systems can be developed that strongly adhere to the goals we set
out in this chapter. Many principles follow the basic rules of decent software engineering and
wiJI not be repeated here.
However, distributed systems differ from traditional software because components are
dispersed across a network. Not taking this dispersion into account during design time is what
makes so many systems needlessly complex and results in mistakes that need to be patched
later on. Peter Deutsch, then at Sun Microsystems, formulated these mistakes as the following
false assumptions that everyone makes when developing a distributed application for the first
time:
1. The network is reliable.
2. The network is secure.
3. The network is homogeneous.
4. The topology does not change.
5. Latency is zero.
6. Bandwidth is infinite.
7. Transport cost is zero.
8. There is one administrator.
1. Location Transparency
Location transparency refers to the ability to access distributed resources without knowing their
physical or network locations. It hides the details of where resources are located, providing a
uniform interface for accessing them.
• Importance: Enhances system flexibility and scalability by allowing resources to be
relocated or replicated without affecting applications.
• Examples:
o DNS (Domain Name System): Maps domain names to IP addresses, providing
location transparency for web services.
o Virtual Machines (VMs): Abstract hardware details, allowing applications to
run without knowledge of the underlying physical servers.
2. Access Transparency
Access transparency ensures that users and applications can access distributed resources
uniformly, regardless of the distribution of those resources across the network.
• Significance: Simplifies application development and maintenance by providing a
consistent method for accessing distributed services and data.
• Methods:
o Remote Procedure Call (RPC): Allows a program to call procedures located
on remote systems as if they were local.
o Message Queues: Enable asynchronous communication between distributed
components without exposing the underlying communication mechanism.
3. Concurrency Transparency
Concurrency transparency hides the complexities of concurrent access to shared resources in
distributed systems from the application developer. It ensures that concurrent operations do not
interfere with each other.
• Challenges: Managing synchronization, consistency, and deadlock avoidance in a
distributed environment where multiple processes or threads may access shared
resources simultaneously.
• Techniques:
o Locking Mechanisms: Ensure mutual exclusion to prevent simultaneous
access to critical sections of code or data.
o Transaction Management: Guarantees atomicity, consistency, isolation, and
durability (ACID properties) across distributed transactions.
4. Replication Transparency
Replication transparency ensures that clients interact with a set of replicated resources as if
they were a single resource. It hides the presence of replicas and manages consistency among
them.
• Strategies: Maintaining consistency through techniques like primary-backup
replication, where one replica (primary) handles updates and others (backups) replicate
changes.
• Applications:
o Content Delivery Networks (CDNs): Replicate content across geographically
distributed servers to reduce latency and improve availability.
o Database Replication: Copies data across multiple database instances to
enhance fault tolerance and scalability.
5. Failure Transparency
Failure transparency ensures that the occurrence of failures in a distributed system does not
disrupt service availability or correctness. It involves mechanisms for fault detection, recovery,
and resilience.
• Approaches:
o Heartbeating: Periodically checks the availability of nodes or services to detect
failures.
o Replication and Redundancy: Uses redundant components or data replicas to
continue operation despite failures.
• Examples:
o Load Balancers: Distribute traffic across healthy servers and remove failed
ones from the pool automatically.
o Automatic Failover: Redirects requests to backup resources or nodes when
primary resources fail.
6. Performance Transparency
Performance transparency ensures consistent performance levels across distributed nodes
despite variations in workload, network conditions, or hardware capabilities.
• Challenges: Optimizing resource allocation and workload distribution to maintain
predictable performance levels across distributed systems.
• Strategies:
o Load Balancing: Distributes incoming traffic evenly across multiple servers to
optimize resource utilization and response times.
o Caching: Stores frequently accessed data closer to clients or within nodes to
reduce latency and improve responsiveness.
7. Security Transparency
Security transparency ensures that security mechanisms and protocols are integrated into a
distributed system seamlessly, protecting data and resources from unauthorized access or
breaches.
• Importance: Ensures confidentiality, integrity, and availability of data and services in
distributed environments.
• Techniques:
o Encryption: Secures data at rest and in transit using cryptographic algorithms
to prevent eavesdropping or tampering.
o Access Control: Manages permissions and authentication to restrict access to
sensitive resources based on user roles and policies.
8. Management Transparency
Management transparency simplifies the monitoring, control, and administration of distributed
systems by providing unified visibility and control over distributed resources.
• Methods: Utilizes automation, monitoring tools, and centralized management
interfaces to streamline operations and reduce administrative overhead.
• Examples:
o Cloud Management Platforms (CMPs): Provide unified interfaces for
provisioning, monitoring, and managing cloud resources across multiple
providers.
o Configuration Management Tools: Automate deployment, configuration, and
updates of software and infrastructure components in distributed environments.
These types of transparency are essential for designing robust, scalable, and maintainable
distributed systems, ensuring seamless operation, optimal performance, and enhanced security
in cloud computing and other distributed computing environments.
2.1 ARCHITECTURAL STYLES
Architecture styles in distributed system, define how components interact and are structured to
achieve scalability, reliability, and efficiency. This article explores key architecture styles—
including Peer-to-Peer, SOA, and others—highlighting their concepts, advantages, and
applications in building robust distributed systems.
Using components and connectors, we can come to various configurations, which, in tum have
been classified into architectural styles. Several styles have by now been identified, of which the
most important ones for distributed systems are:
1. Layered architectures
2. Object-based architectures
3. Data-centered architectures
4. Event-based architectures
Layered Architecture in distributed systems organizes the system into hierarchical layers, each
with specific functions and responsibilities. This design pattern helps manage complexity and
promotes separation of concerns. Here’s a detailed explanation:
• In a layered architecture, the system is divided into distinct layers, where each layer
provides specific services and interacts only with adjacent layers.
• This separation helps in managing and scaling the system more effectively.
• Presentation Layer
o Function: Handles user interaction and presentation of data. It is responsible for
user interfaces and client-side interactions.
o Responsibilities: Rendering data, accepting user inputs, and sending requests to
the underlying layers.
• Application Layer
o Function: Contains the business logic and application-specific functionalities.
o Responsibilities: Processes requests from the presentation layer, executes
business rules, and provides responses back to the presentation layer.
• Separation of Concerns: Each layer focuses on a specific aspect of the system, making it
easier to develop, test, and maintain.
• Modularity: Changes in one layer do not necessarily affect others, allowing for more
flexible updates and enhancements.
• Reusability: Layers can be reused across different applications or services within the same
system.
• Scalability: Different layers can be scaled independently to handle increased load or
performance requirements.
• Performance Overhead: Each layer introduces additional overhead due to data passing
and processing between layers.
• Complexity: Managing interactions between layers and ensuring proper integration can be
complex, particularly in large-scale systems.
• Rigidity: The strict separation of concerns might lead to rigidity, where changes in the
system’s requirements could require substantial modifications across multiple layers.
Peer-to-Peer (P2P) Architecture is a decentralized network design where each node, or “peer,”
acts as both a client and a server, contributing resources and services to the network. This
architecture contrasts with traditional client-server models, where nodes have distinct roles as
clients or servers.
• In a P2P architecture, all nodes (peers) are equal participants in the network, each capable
of initiating and receiving requests.
• Peers collaborate to share resources, such as files or computational power, without relying
on a central server.
Key Features of Peer-to-Peer (P2P) Architecture in Distributed Systems
• Decentralization
o Function: There is no central server or authority. Each peer operates
independently and communicates directly with other peers.
o Advantages: Reduces single points of failure and avoids central bottlenecks,
enhancing robustness and fault tolerance.
• Resource Sharing
o Function: Peers share resources such as processing power, storage space, or
data with other peers.
o Advantages: Increases resource availability and utilization across the network.
• Scalability
o Function: The network can scale easily by adding more peers. Each new peer
contributes additional resources and capacity.
o Advantages: The system can handle growth in demand without requiring
significant changes to the underlying infrastructure.
• Self-Organization
o Function: Peers organize themselves and manage network connections
dynamically, adapting to changes such as peer arrivals and departures.
o Advantages: Facilitates network management and resilience without central
coordination.
• Fault Tolerance: The decentralized nature ensures that the failure of one or several peers
does not bring down the entire network.
• Cost Efficiency: Eliminates the need for expensive central servers and infrastructure by
leveraging existing resources of the peers.
• Scalability: Easily accommodates a growing number of peers, as each new peer enhances
the network’s capacity.
Disadvantages of Peer-to-Peer (P2P) Architecture in Distributed Systems
• Security: Decentralization can make it challenging to enforce security policies and manage
malicious activity, as there is no central authority to oversee or control the network.
• Performance Variability: The quality of services can vary depending on the peers’
resources and their availability, leading to inconsistent performance.
• Complexity: Managing connections, data consistency, and network coordination without
central control can be complex and may require sophisticated protocols.
Data-Centric Architecture is an architectural style that focuses on the central management and
utilization of data. In this approach, data is treated as a critical asset, and the system is
designed around data management, storage, and retrieval processes rather than just the
application logic or user interfaces.
• The core idea of Data-Centric Architecture is to design systems where data is the primary
concern, and various components or services are organized to support efficient data
management and manipulation.
• Data is centrally managed and accessed by multiple applications or services, ensuring
consistency and coherence across the system.
• Disadvantages:
o Single Point of Failure: Centralized data repositories can become a bottleneck
or single point of failure, potentially impacting system reliability.
o Performance Overhead: Managing large volumes of centralized data can
introduce performance overhead, requiring robust infrastructure and
optimization strategies.
o Complexity: Designing and managing a centralized data system can be
complex, especially when dealing with large and diverse datasets.
o Scalability Challenges: Scaling centralized data systems to accommodate
increasing data volumes and access demands can be challenging and may
require significant infrastructure investment.
Now that we have briefly discussed some common architectural styles, let us take a look at how
many distributed systems are actually organized by considering where software components are
placed. Deciding on software components, their interaction, and their placement leads 10 an
instance of a software architecture, also called a system architecture (Bass et aI., 2003). We will
discuss centralized and decentralized organizations, as wen as various hybrid forms.
The centralized architecture is defined as every node being connected to a central coordination
system, and whatever information they desire to exchange will be shared by that system. A
centralized architecture does not automatically require that all functions must be in a single
place or circuit, but rather that most parts are grouped and none are repeated elsewhere as
would be the case in a distributed architecture.
It consists following types of architecture:
• Client-server
• Application Layering
Client Server
Processes in a distributed system are split into two (potentially overlapping) groups in the
fundamental client-server architecture. A server is a program that provides a particular service,
such as a database service or a file system service. A client is a process that sends a request to
a server and then waits for the server to respond before requesting a service from it. This
client-server interaction, also known as request-reply behavior is shown in the figure below:
When the underlying network is reasonably dependable, as it is in many local-area networks,
communication between a client and a server can be implemented using a straightforward
connection-less protocol. In these circumstances, a client simply bundles a message for the
server specifying the service they want along with the relevant input data when they make a
service request. After that, the server receives the message. The latter, on the other hand, will
always await an incoming request, process it after that, and then package the outcomes in a
reply message that is then provided to the client.
Efficiency is a clear benefit of using a connectionless protocol. The request/reply protocol just
sketched up works as long as communications do not get lost or damaged. It is unfortunately
not easy to make the protocol robust against occasional transmission errors. When no reply
message is received, our only option is to perhaps allow the client to resubmit the request.
However, there is a problem with the client’s ability to determine if the original request
message was lost or if the transmission of the reply failed.
A reliable connection-oriented protocol is used as an alternative by many client-server systems.
Due to its relatively poor performance, this method is not totally suitable for local-area
networks, but it is ideal for wide-area networks where communication is inherently unreliable.
Application Layering
However, many individuals have urged a distinction between the three levels below,
effectively adhering to the layered architectural approach we previously described, given that
many client-server applications are intended to provide user access to databases:
Consider an Internet search engine. The user interface of a search engine is very simple: a user
types in a string of keywords and is subsequently presented with a list of titles of Web pages.
The back end is formed by a huge database of Web pages that have been prefetched and
indexed. The core of the search engine is a program that transforms the user’s string of
keywords into one or more database queries. It subsequently ranks the results into a list and
transforms that list into a series of HTML pages. Within the client-server model, this
information retrieval part is typically placed at the processing level.
Multitiered Architectures
Multitiered Architectures in Distributed Systems explains how complex computer systems are
organized into different layers or tiers to improve performance and manageability. Each tier
has a specific role, such as handling user interactions, processing data, or storing information.
• By dividing tasks among these tiers, systems can run more efficiently, be more secure, and
handle more users at once.
• This architecture is widely used in modern applications like web services, where front-end
interfaces, business logic, and databases are separated to enhance functionality and
scalability.
1. Presentation Tier:
The presentation tier, also known as the user interface tier, is responsible for presenting
information to users and accepting user inputs. Its main purpose is to handle user interactions
and display data in a human-readable format. This tier provides the interface through which
users interact with the application.
Components:
• User Interfaces: This includes web browsers, mobile applications, desktop applications, or
any other means through which users interact with the system.
• UI Components: Components such as forms, buttons, menus, and other graphical elements
that enable user interaction.
• Presentation Logic: Code responsible for controlling the behavior and appearance of the
user interface.
Example: Consider an e-commerce website. The presentation tier would include the web
pages users see when browsing products, adding items to their cart, and completing their
purchases. It encompasses the visual design, layout, and interactive elements like buttons and
forms.
2. Application Tier:
The application tier, also referred to as the business logic tier or middle tier, contains the core
logic and functionality of the application. It processes user requests, implements business
rules, performs computations, and coordinates the application's overall behavior. This tier
acts as an intermediary between the presentation tier and the data tier.
Components:
• Application Servers: These servers execute the application's code and handle business
logic tasks.
• APIs (Application Programming Interfaces): Interfaces that allow different parts of the
application to communicate with each other or with external systems.
• Business Logic: Code responsible for implementing the application's rules, workflows, and
algorithms.
• Middleware: Software components that facilitate communication and integration between
different parts of the system.
Example: In an online banking system, the application tier would manage tasks such as
validating user credentials, processing transactions, checking account balances, and generating
reports. It ensures that business rules are enforced, transactions are secure, and data integrity is
maintained.
3. Data Tier:
The data tier, also known as the persistence tier or backend tier, is responsible for managing
data storage, retrieval, and manipulation. It stores the application's data in a structured
format, making it accessible to other tiers as needed. This tier ensures data integrity, security,
and efficiency in data operations.
Components:
• Databases: Systems such as relational databases, NoSQL databases, or file systems used to
store and organize data.
• Data Access Layer: Components responsible for interacting with the database, executing
queries, and handling data retrieval and manipulation.
• Data Models: Structures that represent the organization and relationships of the data stored
in the database.
Example: In a social media platform, the data tier would manage user profiles, posts,
comments, and other content. It would include databases to store user information,
relationships between users, posts, comments, media files, and any other relevant data. The
data tier ensures that data is stored securely, retrieved efficiently, and remains consistent across
the application.
Communication Between Tiers
Communication between tiers in a multitiered architecture is essential for the overall
functionality and performance of the system. Here's how communication typically occurs
between the presentation, application, and data tiers:
1. Presentation Tier to Application Tier:
• User Requests: Users interact with the presentation tier by submitting requests through the
user interface.
• HTTP Requests: In web-based applications, user requests are typically sent over HTTP
(Hypertext Transfer Protocol) to the application tier.
• API Calls: The presentation tier may make API (Application Programming Interface) calls
to the application tier to retrieve data, submit forms, or perform other actions.
• Data Transfer: Data is transferred between the presentation tier and the application tier in
a format such as JSON (JavaScript Object Notation) or XML (eXtensible Markup
Language).
2. Application Tier to Data Tier:
• Business Logic Execution: The application tier executes business logic and processes user
requests.
• Data Access: When data is required, the application tier interacts with the data tier to
retrieve or update information stored in the database.
• Database Queries: The application tier constructs and sends queries to the data tier to
fetch or modify data.
• Data Manipulation: Upon receiving data from the data tier, the application tier may
perform additional processing or transformation before sending it back to the presentation
tier.
3. Data Tier to Application Tier:
• Query Processing: The data tier processes incoming queries from the application tier, such
as SELECT, INSERT, UPDATE, or DELETE operations.
• Data Retrieval: If the query involves data retrieval, the data tier accesses the database,
retrieves the requested data, and prepares it for transmission.
• Data Serialization: Data is serialized into a format suitable for transmission, such as JSON
or XML, before being sent back to the application tier.
• Response Sending: The data tier sends the response containing the requested data or the
result of the operation back to the application tier.
4. Application Tier to Presentation Tier:
• Data Processing: The application tier processes the data received from the data tier or user
inputs.
• Response Generation: Based on the processed data and business logic, the application tier
generates a response to be sent back to the presentation tier.
• Rendering: In web applications, the application tier may generate HTML (Hypertext
Markup Language) or other markup language to be rendered by the browser.
• Response Sending: The application tier sends the response back to the presentation tier
over the network.
Scalability and Load Balancing in Multitiered Architectures
Scalability and load balancing are crucial considerations in multitiered architectures to ensure
that systems can handle increasing user loads while maintaining performance and reliability.
Here's how these concepts are applied in such architectures:
1. Scalability in Multitiered Architectures:
Scalability refers to the ability of a system to handle growing amounts of work by adding
resources or scaling out horizontally without negatively impacting performance or user
experience.
Types of Scalability:
• Vertical Scalability: Involves adding more resources, such as CPU, memory, or storage, to
a single server or instance. However, there is a limit to how much a single server can scale
vertically.
• Horizontal Scalability: Involves adding more instances of servers or nodes to distribute
the workload across multiple machines. This approach allows for virtually unlimited
scalability by adding more servers as needed.
Scalability in Each Tier:
• Presentation Tier: Scalability can be achieved by deploying multiple instances of web
servers or load balancers to handle increasing user requests.
• Application Tier: Applications can be designed to scale horizontally by deploying
multiple instances of application servers or microservices and using technologies like
containerization and orchestration (e.g., Docker and Kubernetes).
• Data Tier: Database scalability can be achieved through techniques like database sharding,
replication, or using distributed database systems to distribute data across multiple nodes.
2. Load Balancing in Multitiered Architectures:
Load balancing involves distributing incoming network traffic across multiple servers or
resources to optimize resource utilization, maximize throughput, minimize response time, and
ensure high availability.
Types of Load Balancers:
• Hardware Load Balancers: Dedicated physical appliances designed to distribute traffic
across servers. They offer high performance and scalability but can be expensive.
• Software Load Balancers: Implemented as software solutions that run on standard server
hardware or virtual machines. They provide flexibility and can be deployed in cloud
environments.
• DNS Load Balancing: Distributes traffic by resolving domain names to multiple IP
addresses, allowing DNS servers to direct clients to different servers based on predefined
policies.
Load Balancing Strategies:
• Round Robin: Distributes incoming requests equally among servers in a cyclic manner.
• Least Connections: Routes new requests to the server with the fewest active connections,
aiming to distribute the load evenly.
• IP Hash: Assigns requests to servers based on the client's IP address, ensuring that requests
from the same client are consistently routed to the same server.
Fault Tolerance and Reliability in Multitiered Architectures
Fault tolerance and reliability are critical aspects of multitiered architectures, ensuring that
systems remain available and responsive even in the face of failures or errors. Here's how these
concepts are applied in such architectures:
1. Fault Tolerance in Multitiered Architectures:
Fault tolerance is the ability of a system to continue operating properly in the event of the
failure of some of its components. It involves designing systems to anticipate and recover from
failures gracefully without causing a complete system outage.
Techniques for Fault Tolerance:
• Redundancy: Introducing redundancy by replicating critical components or data across
multiple servers or locations. This ensures that if one component fails, another can take
over seamlessly.
• Failover: Implementing mechanisms to detect failures automatically and redirect traffic or
operations to backup components or systems. This minimizes downtime and ensures
continuity of service.
• Isolation: Isolating components to contain the impact of failures and prevent them from
spreading to other parts of the system. This can be achieved through techniques like
containerization or microservices architecture.
• Graceful Degradation: Designing systems to gracefully degrade performance or
functionality in the event of failures, rather than crashing or becoming unavailable entirely.
This ensures that users can still access essential features even under degraded conditions.
2. Reliability in Multitiered Architectures:
Reliability refers to the ability of a system to consistently perform its intended functions
accurately and without failure over a specified period. It involves building systems that can
withstand various types of stresses and environmental conditions without experiencing
unexpected failures.
Factors Affecting Reliability:
• Robust Design: Designing systems with robust architecture, well-defined interfaces, and
clear error handling mechanisms to minimize the likelihood of failures.
• Redundancy and Backup Systems: Implementing redundant components, backup
systems, and data replication to ensure continuous operation even in the face of hardware
or software failures.
• Monitoring and Alerting: Deploying monitoring tools and systems to continuously
monitor the health and performance of the system, detect anomalies or failures, and trigger
alerts for timely intervention.
• Regular Testing and Maintenance: Conducting regular testing, maintenance, and updates
to identify and address potential points of failure before they can affect system reliability.
Technologies for Improving Fault Tolerance and Reliability:
• Clustering: Creating clusters of servers or nodes that work together to provide fault
tolerance and high availability by automatically redistributing workloads and resources in
the event of failures.
• Load Balancing: Distributing incoming traffic across multiple servers to prevent
overloading and ensure that no single server becomes a single point of failure.
• Data Replication: Replicating data across multiple servers or data centers to ensure data
availability and integrity even in the event of hardware failures or disasters.
• Automatic Failover: Implementing automated failover mechanisms to detect and respond
to failures quickly, minimizing downtime and ensuring uninterrupted service.
Use Cases of Multitiered Architectures in Distributed System
Multitiered architectures are widely used in distributed systems across various industries and
applications to achieve scalability, reliability, and maintainability. Here are some common use
cases:
1. E-commerce Platforms:
• Presentation Tier: Web interfaces or mobile apps where users browse products, add items
to carts, and make purchases.
• Application Tier: Handles business logic such as inventory management, order
processing, and user authentication.
• Data Tier: Manages product catalogs, customer profiles, order histories, and transaction
data in databases.
• Use Case: E-commerce platforms like Amazon or eBay employ multitiered architectures to
efficiently handle a large number of users, transactions, and product listings while ensuring
scalability and reliability.
3. Social Media Platforms:
• Presentation Tier: User interfaces for posting content, interacting with friends, and
exploring feeds.
• Application Tier: Manages user authentication, friend connections, content
recommendation algorithms, and privacy settings.
• Data Tier: Stores user profiles, posts, comments, media files, and social graphs in
databases or distributed storage systems.
• Use Case: Social media platforms like Facebook, Twitter, and Instagram utilize multitiered
architectures to handle millions of active users, interactions, and content updates while
delivering personalized experiences and maintaining data consistency.
5. Healthcare Information Systems:
• Presentation Tier: Interfaces for healthcare providers to access patient records, schedule
appointments, and view medical images.
• Application Tier: Manages patient data, medical histories, treatment plans, and regulatory
compliance.
• Data Tier: Stores electronic health records (EHRs), diagnostic reports, medication
histories, and medical imaging data in secure databases.
• Use Case: Healthcare organizations leverage multitiered architectures to ensure the
confidentiality, availability, and integrity of patient information while facilitating seamless
communication and collaboration among healthcare professionals.
Decentralized architecture in distributed systems means that the control and data are
distributed and not necessarily controlled from a central point. This architecture increases
system dependability, expansion potential, and error resilience due to the lack of specific
critical points and balanced load. Decentralized systems are commonly found in P2P networks,
Blockchain Technologies, and Large-scale internet services are made up of.
Decentralized architecture means a conceptual design of a system’s components as elements
that are mutually integrated and act according to the general principles of a system without a
specific coordinative center.
• In such systems, the control and data are present in different nodes and each node is
independent and can make decisions independently.
• This architecture improves the system’s capability on scalability, fault tolerance and
robustness in eliminating a single centralized point of failure and integrating peers to work
together. Distribution is a usual practice in the blockchain networks, P2P systems, and
distributed ledger technology field.
• Hybrid structures are notably deployed in collaborative distributed systems. The main
issue in many of these systems to first get started, for which often a traditional client-
server scheme is deployed. Once a node has joined the system, it can use a fully
decentralized scheme for collaboration. To make matters concrete, let us first consider the
BitTorrent file-sharing system (Cohen, 2003). BitTorrent is a peer-to-peer file
downloading system. Its principal working is shown in Fig. 2-14 The basic idea is that
when an end user is looking for a file, he downloads chunks of the file from other users
until the downloaded chunks can be assembled together yielding the complete file.
• An important design goal was to ensure collaboration. In most file-sharing systems, a
significant fraction of participants merely download files but otherwise contribute close
to nothing (Adar and Huberman, 2000; Saroiu et al., 2003; and Yang et al., 2005). To this
end, a file can be downloaded only when the downloading client is providing content to
someone else. We will return to this "tit-for-tat" behavior shortly.
2.3 ARCHITECTURES VERSUS MIDDLEW ARE
When considering the architectural issues we have discussed so far, a question that comes to
mind is where middleware fits in.
An important purpose is to provide a degree of distribution transparency, that is, to a certain
extent hiding the distribution of-data, processing, and control from applications. What is
commonly seen in practice is that middleware systems actually follow a specific architectural
style.
For example, many middleware solutions have adopted an object-based architectural style, such
as CORBA (OMG. 2004a). Others, like TIB/Rendezvous (TIBCO, 2005) provide middleware
that follow event-based architectural style. In later chapters, we will come across more examples
of architectural styles. Having middleware model according to a specific architectural style has
the benefit that designing applications may become simpler. However, an obvious drawback is
that the middleware may no longer be optimal for what an application developer had in mind.
For example, COREA initially offered only objects that could be invoked by remote clients.
Later, it was felt that having only this form of interaction was too restrictive, so that other
interaction patterns such as messaging were added. Obviously, adding new features can easily
lead to bloated middleware solutions. In addition, although middleware is meant to provide
distribution transparency, it is generally felt that specific solutions should be adaptable to
application requirements. One solution to this problem is to make several versions of a
middleware system, where each version is tailored to a specific class of applications. An
approach that is generally considered better is to make middleware systems such that they are
easy to configure, adapt, and customize as needed by an application. As a result, systems are now
being developed in which a stricter separation between policies and mechanisms is being made.
This has led to several mechanisms by which the behavior of middleware can be modified
(Sadjadi and McKinley, 2003). Let us take a look at some of the commonly followed approaches
2.3.1 Interceptors
Conceptually, an interceptor is nothing but a software construct that will break the usual flow of
control and allow other (application specific) code to be executed. To make interceptors generic
may require a substantial implementation effort, as illustrated in Schmidt et al. (2000), and it is
unclear whether in such cases generality should be preferred over restricted applicability and
simplicity. Also, in many cases having only limited interception facilities will improve
management of the software and the distributed system as a whole. To make matters concrete,
consider interception as supported in many objectbased distributed systems. The basic idea is
simple: an object A can call a method that belongs to an object B, while the latter resides on a
different machine than A. As we explain in detail later in the book, such a remote-object
invocation is carried as a three-step approach:
1. Object A is offered a local interface that is exactly the same as the interface offered by object
B. A simply calls the method available in' that interface.
2. The call by A is transformed into a generic object invocation, made possible through a general
object-invocation interface offered by the middleware at the machine where A resides.
3. Finally, the generic object invocation is transformed into a message that is sent through the
transport-level network interface as offered by A's local operating system.
There are many different views on self-managing systems, but what most have in common
(either explicitly or implicitly) is the assumption that adaptations take place by means of one or
more feedback control loops. Accordingly, systems that are organized by means of such loops are
referred to as feedback COl)- trol systems. Feedback control has since long been applied in
various engineering fields, and its mathematical foundations are gradually also finding their way
in computing systems (Hellerstein et al., 2004; and Diao et al., 2005). For selfmanaging systems,
the architectural issues are initially the most interesting. The basic idea behind this organization
is quite simple, as shown in Fig. 2-16.
The core of a feedback control system is formed by the components that need to be managed.
These components are assumed to be driven through controllable input parameters, but their
behavior may be influenced by all kinds of uncontrollable input, also known as disturbance or
noise input. Although disturbance will often come from the environment in which a distributed
system is executing, it may well be the case that unanticipated component interaction causes
unexpected behavior.
There are essentially three elements that form the feedback control loop. First, the system itself
needs to be monitored, which requires that various aspects of the system need to be measured. In
many cases, measuring behavior is easier said than done. For example, round-trip delays in the
Internet may vary wildly, and also depend on what exactly is being measured. In such cases,
accurately estimating a delay may be difficult indeed. Matters are further complicated when a
node A needs to estimate the latency between two other completely different nodes B and C,
without being able to intrude on either two nodes. For reasons as this, a feedback control loop
generally contains a logical metric estimation component.
Our first model for communication in distributed systems is the remote procedure call (RPC). An RPC
aims at hiding most of the intricacies of message passing, and is ideal for client-server applications.
To make it easier to deal with the numerous levels and issues involved in communication, the
International Standards Organization (ISO) developed a reference model that clearly identifies the
various levels involved, gives them standard names, and points out which level should do which
job. This model is called the Open Systems Interconnection Reference Model (Day and
Zimmerman, 1983), usually abbreviated as ISO OSI or sometimes just the OSI model.
The OSI model is designed to allow open systems to communicate. An open system is one that is
prepared to communicate with any other open system by using standard rules that govern the
format, contents, and meaning of the messages sent and received. These rules are formalized in
what are called protocols.
In the OSI model, communication is divided up into seven levels or layers, as shown in Fig. 4-1.
Each layer deals with one specific aspect of the communication. In this way, the problem can be
divided up into manageable pieces, each of which can be solved independent of the others. Each
layer provides an interface to the one above it.
Lower-Level Protocols
We start with discussing the three lowest layers of the OSI protocol suite. Together, these layers
implement the basic functions that encompass a computer network.
The physical layer protocol deals with standardizing the electrical, mechanical, and signalling
interfaces so that when one machine sends a 0 bit it is actually received as a 0 bit and not a 1bit.
Many physical layer standards have been developed (for different media), for example, the RS-232-
C standard for serial communication lines.
The real communication networks are subject to errors, so some mechanism is needed to detect
and correct them. This mechanism is the main task of the data link layer. What it does is to group
the bits into units, sometimes called frames, and see that each frame is correctly received. The
data link layer does its work by putting a special bit pattern on the start and end of each frame to
mark them, as well as computing a checksum by adding up all the bytes in the frame in a certain
way. The data link layer appends the checksum to the frame. When the frame arrives, the receiver
recomputes the checksum from the data and compares the result to the checksum following the
frame. If the two agree, the frame is considered correct and is accepted. It they disagree. The
receiver asks the sender to retransmit it. Frames are assigned sequence numbers (in the header),
so everyone can tell which is which.
The question of how to choose the best path is called routing, and is essentially the primary task
of the network layer. The problem is complicated by the fact that the shortest route is not always
the best route. What really matters is the amount of delay on a given route, which, in tum, is
related to the amount of traffic and the number of messages queued up for transmission over the
various lines. The delay can thus change over the course of time. Some routing algorithms try to
adapt to changing loads, whereas others are content to make decisions based on long-term
averages. At present, the most widely used network protocol is the connectionless IP (Internet
Protocol), which is part of the Internet protocol suite. An IP packet (the technical term for a
message in the network layer) can be sent without any setup. Each IP packet is routed to its
destination independent of all others. No internal path is selected and remembered.
Transport Protocols
The transport layer forms the last part of what could be called a basic network protocol stack, in
the sense that it implements all those services that are not provided at the interface of the network
layer, but which are reasonably needed to build network applications. Packets can be lost on the
way from the sender to the receiver. Although some applications can handle their own error
recovery, others prefer a reliable connection. The job of the transport layer is to provide this
service. The idea is that the application layer should be able to deliver a message to the transport
layer with the expectation that it will be delivered without loss. Upon receiving a message from
the application layer, the transport layer breaks it into pieces small enough for transmission,
assigns each one a sequence number, and then sends them all.
The Internet transport protocol is called TCP (Transmission Control Proto col) and is described in
detail in Comer (2006). The combination TCPIIP is now used as a de facto standard for network
communication. The Internet protocol suite also supports a connectionless transport protocol
called UDP (Universal Datagram Protocol), which is essentially just IP with some minor additions.
User programs that do not need a connection-oriented protocol normally use UDP. Additional
transport protocols are regularly proposed. For example, to support real-time data transfer, the
Real-time Transport Protocol (RTP) has been de fined. RTP is a framework protocol in the sense
that it specifies packet formats for real-time data without providing the actual mechanisms for
guaranteeing data delivery.
The session layer is essentially an enhanced version of the transport layer. It provides dialog
control, to keep track of which party is currently talking, and it provides synchronization facilities.
The latter are useful to allow users to insert checkpoints into long transfers, so that in the event of
a crash, it is necessary to go back only to the last checkpoint, rather than all the way back to the
beginning.
In the presentation layer it is possible to define records contain ing fields like these and then have
the sender notify the receiver that a message contains a particular record in a certain format. This
makes it easier for machines with different internal representations to communicate with each
other.
The OSI application layer was originally intended to contain a collection of standard network
applications such as those for electronic mail, file transfer, and terminal emulation. By now, it has
become the container for all applications and protocols that in one way or the other do not fit into
one of the underlying layers. From the perspective of the OSI reference model, virtually all
distributed systems are just applications.
Middleware Protocols
Middleware is an application that logically lives (mostly) in the application layer, but which contains
many general-purpose protocols that warrant their own layers, independent of other, more
specific applications. There are numerous protocols to support a variety of middleware services.
Authentication protocols are not closely tied to any specific application, but instead, can be
integrated into a middleware system as a general service. Likewise, authorization protocols by
which authenticated users and processes are granted access only to those resources for which
they have authorization tend to have a general, application-independent nature. Middleware
communication protocols support high-level communication services.
1.1.2 Types of Communication
In contrast, with transient communication, a message is stored by the communication system only
as long as the sending and receiving application are executing. More precisely, in terms of Fig. 4-
4, the middleware cannot deliver a message due to a transmission interrupt, or because the
recipient is currently not active, it will simply be discarded. Typically, all transport-level
communication services offer only transient communication. In this case, the communication
system consists traditional store-and-forward routers. If a router cannot deliver a message to the
next one or the destination host, it will simply drop the message.
Birrell and Nelson (1984) introduced Remote Procedure Call (RPC), a simpler way to handle
communication in distributed systems. Instead of sending explicit messages, RPC lets a program
on one machine call a procedure on another machine. The caller pauses, and the procedure runs
on the remote machine. Data is passed through parameters, and results are returned seamlessly,
hiding the complexity of message passing from the programmer.
However, RPC has challenges. The caller and remote procedure run in different memory spaces,
making parameter passing tricky, especially between different machines. Crashes on either
machine also create complications. Despite these issues, RPC is widely used in distributed systems
because it simplifies communication.
- The caller pushes parameters onto the stack in reverse order (last parameter first).
- After the procedure runs, the return value is placed in a register, the return address is removed,
and control goes back to the caller.
- The caller removes parameters, restoring the stack to its original state.
- No direct stack manipulation across machines; communication happens behind the scenes.
RPC aims to make remote procedure calls look like local ones, keeping the process transparent.
For example, when a program reads data from a file, the `read` call works the same whether it’s
local or remote. In a local system, `read` is a library procedure that interacts with the operating
system. With RPC, a client stub replaces the local `read`. It looks the same to the programmer but
works differently: instead of fetching data locally, it packs the parameters into a message, sends it
to the server, and waits for a reply. This hides the complexity of remote communication.
When the message reaches the server, the **server stub** takes over. It unpacks the parameters
and calls the server procedure as if it were a local call. The server performs its task (e.g., filling a
buffer with data) and returns the result to the server stub. The stub then packs the result into a
message and sends it back to the client. On the client side, the **client stub** receives the
message, unpacks the result, and passes it to the caller. To the client, the remote call feels like a
local procedure call, hiding all the complexity of message passing. This transparency is the key
strength of RPC.
1. The client procedure calls the client stub in the normal way.
2. The client stub builds a message and calls the local operating system.
5. The server stub unpacks the parameters and calls the server.
6. The server does the work and returns the result to the stub.
7. The server stub packs it in a message and calls its local as.
Packing parameters into a message is called parameter marshalling. For example, a remote
procedure `add(i, j)` takes two integers and returns their sum. The client stub packs the parameters
and procedure name into a message. When the message reaches the server, the server stub
identifies the procedure and makes the call. After the server computes the result, the server stub
packs it into a message and sends it back to the client stub, which unpacks it and returns the result
to the client. This works well if the client and server machines are identical, but problems arise if
they use different data representations (e.g., ASCII vs. EBCDIC, little-endian vs. big-endian). These
differences can cause errors in interpreting parameters, requiring additional steps to ensure
compatibility.
Passing pointers or references in RPC is challenging because pointers are only valid within the
address space of the process using them. For example, a buffer address on the client (like 1000)
won’t work on the server. One solution is to copy the data (e.g., an array) into the message and
send it to the server. The server stub uses a local pointer to access the data, and changes are sent
back to the client. This replaces call-by-reference with copy/restore, which works well for simple
cases like arrays. An optimization skips unnecessary copies if the stubs know whether the data is
input or output. However, handling complex data structures (e.g., graphs) remains difficult, as it
may require special code or additional client-server communication.
Remote procedure calls (RPC) and remote object invocations improve access transparency in
distributed systems by hiding communication details. However, they aren’t always suitable,
especially when the receiver isn’t active or when synchronous communication (blocking the client)
isn’t ideal. In such cases, messaging is used as an alternative. Messaging systems can work
synchronously (both parties active) or asynchronously (via message queues), allowing
communication even if the other party isn’t running at the time. This flexibility makes messaging
a powerful tool for distributed systems.
Many distributed systems and applications use the basic messaging model provided by the
transport layer. To understand messaging in middleware, we first look at transport-level sockets,
like Berkeley Sockets, which simplify network programming. Sockets act as communication
endpoints, allowing applications to send and receive data over a network. They abstract the details
of the underlying transport protocol, making it easier to write portable applications. For example,
in TCP, servers typically use four key primitives:
2. Bind: Associates a local address (e.g., IP and port) with the socket.
These steps allow servers to receive messages on specific addresses and ports, while the operating
system manages the resources needed for communication. Sockets and similar interfaces, like XTI,
standardize network programming, making it easier to build distributed systems.
Support for time-dependent information, like audio or video, is called **continuous media**.
It relies on correct timing between data items, such as playing audio samples or displaying
video frames at precise intervals. In contrast, **discrete media**, like text or images, doesn’t
depend on timing for interpretation. Continuous media requires maintaining temporal
relationships (e.g., 44,100 audio samples per second or 30-40 ms per video frame), while
discrete media, such as text or still images, doesn’t. This distinction is key for handling different
types of data in distributed systems.
Data Stream
Distributed systems use data streams to handle time-dependent information, which can be
discrete (like files or TCP connections) or continuous (like audio or video). Timing is critical for
continuous streams, leading to three transmission modes:
Streams can be simple (single sequence) or complex (multiple synchronized sub streams, like
stereo audio or movies with video, audio, and subtitles). Synchronization is vital for complex
streams to ensure proper playback.
For streaming stored data (not live), systems use a client-server architecture, focusing on
compression to save storage and bandwidth, quality control, and synchronization. These
elements are key for effective multimedia streaming.
Timing (and other non-functional) requirements are generally expressed as Quality of Service
(QoS) requirements.
From an application's perspective, in many cases it boils down to specifying a few important
properties (Halsall, 2001):
1. Discrete and Continuous Streams: For example, syncing audio with slides in a presentation.
2. Continuous Streams: Like syncing video and audio in a movie (lip-sync) or stereo audio
channels (within 20 µsec for clear sound).
Synchronization happens at the level of data units. For CD-quality audio, synchronization could
occur every 23 µsec (44,100 samples/sec). For video and audio sync, coarser units (e.g., 33
msec for NTSC video) work well, grouping audio samples into larger chunks (e.g., 1,470
samples). Larger units (40-80 msec) are often acceptable in practice.
An important topic in communication in distributed systems is the support for sending data to
multiple receivers, also known as multicast communication.
Overlay Construction
From the high-level description given above, it should be clear that although building a tree
by itself is not that difficult once we have organized the nodes into an overlay, building an
efficient tree may be a different story. Note that in our description so far, the selection of nodes
that participate in the tree does not take into account any performance metrics: it is purely
based on the (logical) routing of messages through the overlay.
To understand the problem at hand, take a look at Fig. 4-31 which shows a small set of four
nodes that are organized in a simple overlay network, with node A forming the root of a
multicast tree. The costs for traversing a physical link are also shown. The quality of an
application-level multicast tree is generally measured by three different metrics: link stress,
stretch, and tree cost. Link stress is defined per link and counts how often a packet crosses the
same link.
The stretch or Relative Delay Penalty (RDP) measures the ratio in the delay between two
nodes in the overlay, and the delay that those two nodes would experience in the underlying
network. Finally, the tree cost is a global metric, generally related to minimizing the aggregated
link costs.
The main goal of these epidemic protocols is to rapidly propagate information among a large
col lection of nodes using only local information. In other words, there is no central component
by which information dissemination is coordinated.
Epidemic algorithms are inspired by how diseases spread, but instead of infections, they
spread information in distributed systems. The goal is to ensure that all nodes receive updates
quickly. A node with new data is "infected," while one without is "susceptible." Once a node
receives the data but stops sharing, it is "removed."
A common model, anti-entropy, involves nodes randomly exchanging updates in three ways:
pushing updates, pulling updates, or both (push-pull). The push-only method is slow when
many nodes are already infected, as they struggle to find susceptible nodes. The pull-based
approach works better, as susceptible nodes actively seek updates.
Push-pull is the fastest and ensures updates spread in O(log(N)) rounds, making it scalable. A
variation, gossiping, mimics real-life rumor spreading. Nodes share updates randomly but may
stop if they find others already updated, just like people lose interest in repeating old gossip.
Removing Data
Epidemic algorithms are extremely good for spreading updates. However, they have a rather
strange side-effect: spreading the deletion of a data item is hard. The essence of the problem
lies in the fact that deletion of a data item destroys all information on that item. Consequently,
when a data item is simply re moved from a node, that node will eventually receive old copies
of the data item and interpret those as updates on something it did not have before.
Applications
To finalize this presentation, let us take a look at some interesting applications of epidemic
protocols. Gossiping can be used to discover nodes that have a few outgoing wide-area links,
to subsequently apply directional gossiping. Another interesting application area is simply
collecting, or actually aggregating information.