Distributed Computing Overviews
Agenda
• What is distributed computing
• Why distributed computing
• Common Architecture
• Best Practice
What is Distributed Systems
• A distributed system is one in which
components located at networked computers
communicate and coordinate their actions
only by passing messages.
• Three Examples
– The internet
– An intranet which is a portion of the internet
managed by an organization
– Mobile and ubiquitous computing
The Internet
• The Internet is a very large distributed system.
• The implementation of the internet and the services
that it suports has entailed the development of
practical solutions to many distributed system issues.
Intranets
• An intranet is a portion of the Internet that is
separately administered and has a boundary that can
be configured to enforce local security policies
• The main issues arising in the design of components
for use in intranets are: file services, firewalls, cost.
Mobile and ubiquitous computing
• The portability of the devices, such as laptop
computers, PDA, mobil phone, refrigerators,
togather with their ability to connect
conveniently to networks in different places,
makes mobile computing possible.
• Ubiquitous computing is the harnessing of
many small cheap computational devices that
are present in users’ physical environments,
including the home, office and elsewhere.
Significant Consequences of DS
• Concurrency
– The capacity of the system to handle shared
resources can be increased by adding more
resources to the network.
• No global clock
– The only communication is by sending messages
through a network.
• Independent failures
– The programs may not be able to detect whether
the network has failed or has become unusually
slow.
Resource
• The term ”resource” is a rather abstract one, but it
best characterizes the range of things that can
usefully be shared in a networked computer system.
• It extends from hardware components such as disks
and printers to software-defined entities such as files,
databases and data objects of all kinds.
Important Terms of Web
• Services
– A distinct part of a computer system that manages a
collection of related resources and presents their
functionality to users and applications.
– http, telnet, pop3...
• Server
– A running program (a process) on a networked computer
that accepts requests from programs running on other
computers to perform a service, and responds
apppropriately.
– IIS, Apache...
• Client
– The requesting processes.
Challenges
• Heterogeneity
• Openness
• Security
• Scalability
• Failure handling
• Concurrency
• Transparency
Heterogeneity
• Different networks, hardware, operating
systems, programming languages,
developers.
• We set up protocols to solve these
heterogeneities.
• Middleware: a software layer that provides a
programming abstraction as well as masking
the heterogeneity.
Openness
• The openness of DS is determined primarily by
the degree to which new resource-sharing
services can be added and be made available
for use by a variety of client programs.
• Open systems are characterized by the fact that
their key interfaces are published.
• Open DS are based on the provision of a uniform
communication mechanism and published
interfaces for access to shared resources.
• Open DS can be constructed from
heterogeneous hardware and software.
Security
• Security for information resources has three
components:
– Confidentiality: protection against disclosure to
unauthorized individuals.
– Integrity: protection against alteration or
corruption.
– Availability: protection against interference with the
means to access the resources.
• Two new security challenges:
– Denial of service attacks (DoS).
– Security of mobile code.
Scalability
• A system is described as scalable if it
remains effective when there is a significant
increase in the number of resources and the
number of users.
• Challenges:
– Controlling the cost of resources or money.
– Controlling the performance loss.
– Preventing software resources from running out
– Avoiding preformance bottlenecks.
Scaling Techniques (1)
1.4
The difference between letting: (a) a server or (b)a
client check forms as they are being filled
Scaling Techniques (2)
1.5
An example of dividing the DNS name space into
zones.
Failure handling
• When faults occur in hardware or software,
programs may produce incorrect results or
they may stop before they have completed
the intended computation.
• Techniques for dealing with failures:
– Detecting failures
– Masking failures
– Tolerating failures
– Recovering form failures
– Redundancy
Concurrency
• There is a possibility that several clients will attempt
to access a shared resource at the same time.
• Any object that represents a shared resource in a
distributed system must be responsible for ensuring
that operates correctly in a concurrent environment.
INTRODUCTION
Distributed computing involves computing performed among
multiple network-connected computers.
Distributed system is a collection of individual computers.
Distributed programming is the process of writing distributed
program
[19]
What is Distributed Computing/System?
• Distributed computing
– A field of computing science that
studies distributed system.
– The use of distributed systems to
solve computational problems.
• Distributed system
• There are several autonomous
computational entities, each of which has its
own local memory.
• The entities communicate with each other by
message passing.
– Operating System Concept
• The processors communicate with one
another through various communication
lines, such as high-speed buses or
telephone lines.
• Each processor has its own local memory.
WHY DISTRIBUTED COMPUTING?
Economics- distributed systems allow the pooling
of resources.
Reliability- distributed system allow replication of
resources .
The Internet has become a universal platform for distributed
computing.
[21]
Why Distributed Computing?
• The nature of application
• Performance
– Computing intensive
• The task could consume a lot of time on computing.
– Data intensive
• The task that deals with a lot mount or large size of files. For example,
Facebook, LHC(Large Hadron Collider).
• Robustness
– No SPOF (Single Point Of Failure)
– Other nodes can execute the same task executed
on failed node.
Common properties
– Fault tolerance
• When one or some nodes fails, the whole system can still work fine
except performance.
• Need to check the status of each node
– Each node play partial role
• Each computer has only a limited, incomplete view of the system. Each
computer may know only one part of the input.
– Resource sharing
• Each user can share the computing power and storage resource in the
system with other users
– Load Sharing
• Dispatching several tasks to each nodes can help share loading to the
whole system.
– Easy to expand
• We expect to use few time when adding nodes. Hope to spend no time
if possible.
Centralized vs. Distributed
Computing
t e r m in a l
m a in f r a m e c o m p u t e r
w o r k s t a t io n
Early computing was performed
on a single processor.
Uni-processor computing
can be called centralized
n e t w o r k lin k
computing
n e tw o r k h o s t
c e n t r a liz e d c o m p u tin g
d is t r ib u t e d c o m p u t in g
Centralized vs. Distributed Computing
A distributed system is a collection of independent computers,
interconnected via a network, capable of collaborating on a task.
Distributed computing is computing performed in a distributed
system.
Distributed computing has become increasingly common due
advances that have made both machines and networks
cheaper and faster.
t e r m in a l
m a in f r a m e c o m p u t e r
w o r k s t a t io n
n e t w o r k l in k
n e tw o rk h o s t
c e n tr a liz e d c o m p u t in g
d is tr ib u te d c o m p u tin g
A typical portion of the Internet
intranet %
%
% ISP
backbone
satellite link
desktop computer:
server:
network link:
Computers in a Distributed System
• Workstations: computers used by end-users to perform
computing
• Server machines: computers which provide resources
and services
• Personal Assistance Devices: handheld computers
connected to the system via a wireless communication
link.
• …
DISTRIBUTED COMPUTING PARADIGMS
Paradigm means a pattern, example, or model.
Paradigms for distributed applications
Message passing : Appropriate paradigm for network
services.
Client server : Provides an efficient abstraction for the
delivery of network services.
Peer to peer :The participating processes play equal
roles with equivalent capabilities and
responsibilities
[28]
Message Passing
• Message passing is the paradigm of communication
where messages are sent from a sender to one or
more recipients.
• Forms of messages include
(remote) method invocation, signals, and data
packets.
• Message Passing: Synchronous and Asynchronous
Synchronous Message Passing
• Synchronous message passing systems require the
sender and receiver to wait for each other to transfer the
message. That is, the sender will not continue until the
receiver has received the message.
• Synchronous communication has two advantages. The
first advantage is that reasoning about the program can
be simplified in that there is a synchronisation point
between sender and receiver on message transfer. The
second advantage is that no buffering is required. The
message can always be stored on the receiving side,
because the sender will not continue until the receiver is
ready.
Asynchronous Message Passing
• Asynchronous message passing systems deliver a
message from sender to receiver, without waiting for
the receiver to be ready.
• The advantage of asynchronous communication is
that the sender and receiver can overlap their
computation because they do not wait for each other.
Message passing - Synchronous Vs.
Asynchronous
Peer-peer Computing
• Peer-to-peer (P2P) computing or networking is a
distributed application architecture that partitions tasks or
workloads between peers.
• Peers are equally privileged, equipotent participants in the
application. They are said to form a peer-to-peer network
of nodes.
• Peers make a portion of their resources, such as
processing power, disk storage or network bandwidth,
directly available to other network participants, without the
need for central coordination by servers or stable hosts.
• Peers are both suppliers and consumers of resources, in
contrast to the traditional client–server model where only
servers supply, and clients consume.
Best Practice
• Data Intensive or Computing Intensive
– Data size and the amount of data
• The attribute of data you consume
• Computing intensive
– We can move data to the nodes where we can execute jobs
• Data Intensive
– We can separate/replicate data to difference nodes, then we can execute
our tasks on these nodes
– Reduce data replication when executing tasks
• Master nodes need to know data location
• No data loss when incidents happen
– SAN (Storage Area Network)
– Data replication on different nodes
• Synchronization
– When splitting tasks to different nodes, how can
we make sure these tasks are synchronized?
Best Practice
• Robustness
– Still safe when one or partial nodes fail
– Need to recover when failed nodes are online. No
further or few action is needed
• Condor – restart daemon
– Failure detection
• When any nodes fails, master nodes can detect this situation.
– Eg: Heartbeat detection
– App/Users don’t need to know if any partial failure
happens.
• Restart tasks on other nodes for users
Best Practice
• Network issue
– Bandwidth
• Need to think of bandwidth when copying files from one node to other nodes
if we would like to execute the task on the nodes if no data in these nodes.
• Scalability
– Easy to expand
• Hadoop – configuration modification and start daemon
• Optimization
– What can we do if the performance of some nodes is
not good?
• Monitoring the performance of each node
– According to any information exchange like heartbeat or log
• Resume the same task on another nodes
Best Practice
• App/User
– shouldn’t know how to communicate between
nodes
– User mobility – user can access the system from
some point or anywhere
• Grid – UI (User interface)
• Condor – submit machine
Components of Distributed Software
Systems
• Distributed systems
• Middleware
• Distributed applications
Middleware
Figure 1-1. The middleware layer extends over multiple machines,
and offers each application the same interface.
Middleware: Goals
• Middleware handles heterogeneity
• Higher-level support
– Make distributed nature of application transparent to
the user/programmer
• Remote Procedure Calls
• RPC + Object orientation = CORBA
• Higher-level support BUT expose remote
objects, partial failure, etc. to the programmer
– JINI, Javaspaces
• Scalability
Communication Patterns
• Client-server
• Group-oriented/Peer-to-Peer
– Applications that require reliability, scalability
41
Clients invoke individual servers
Client invocation Server
invocation
result result
Server
Client
Key:
Process: Computer:
A service provided by multiple servers
Service
Server
Client
Server
Client
Server
Web proxy server
Client Web
server
Proxy
server
Client Web
server
A distributed application based on peer
processes
Peer 2
Peer 1
Application
Application
Sharable Peer 3
objects
Application
Peer 4
Application
Peers 5 .... N
Distributed applications
• Applications that consist of a set of processes
that are distributed across a network of
machines and work together as an ensemble
to solve a common problem
• In the past, mostly “client-server”
– Resource management centralized at the server
• “Peer to Peer” computing represents a
movement towards more “truly” distributed
applications
Transparency
• Transparency is defined as the concealment
from the user and the application programmer
of the separation of components in a
distributed system, so that the system is
perceived as a whole rather than as a
collection of independent components.
• Eight forms of transparency:
Forms of Transparency in a
Distributed System
Transparency Description
Hide differences in data representation and how a
Access
resource is accessed
Location Hide where a resource is located
Migration Hide that a resource may move to another location
Hide that a resource may be moved to another
Relocation
location while in use
Hide that a resource may be shared by several
Replication
competitive users
Hide that a resource may be shared by several
Concurrency
competitive users
Failure Hide the failure and recovery of a resource
Hide whether a (software) resource is in memory or
Persistence
on disk
Distributed Computing Systems
48
Transparency in Distributed Systems
Access transparency: enables local and remote resources to be
accessed using identical operations.
Location transparency: enables resources to be accessed
without knowledge of their physical or network location (for
example, which building or IP address).
Concurrency transparency: enables several processes to
operate concurrently using shared resources without
interference between them.
Replication transparency: enables multiple instances of
resources to be used to increase reliability and performance
without knowledge of the replicas by users or application
programmers.
Transparency in Distributed Systems
Failure transparency: enables the concealment of faults,
allowing users and application programs to complete their tasks
despite the failure of hardware or software components.
Mobility transparency: allows the movement of resources and
clients within a system without affecting the operation of users
or programs.
Performance transparency: allows the system to be
reconfigured to improve performance as loads vary.
Scaling transparency: allows the system and applications to
expand in scale without change to the system structure or the
application algorithms.
Case study - Hadoop
• HDFS
– Namenode:
• manages the file system namespace and regulates access to files by
clients.
• determines the mapping of blocks to DataNodes.
– Data Node :
• manage storage attached to the nodes that they run on
• save CRC codes
• send heartbeat to namenode.
• Each data is split as a chunk and each chuck is stored on some data
nodes.
– Secondary Namenode
• responsible for merging fsImage and EditLog
Case study - Hadoop
Case study - Hadoop
• Map-reduce Framework
– JobTracker
• Responsible for dispatch job to each tasktracker
• Job management like removing and scheduling.
– TaskTracker
• Responsible for executing job. Usually tasktracker launch another JVM
to execute the job.
Case study - Hadoop
From Hadoop - The Definitive Guide
Case study - Hadoop
• Data replication
– Data are replicated to different nodes
• Reduce the possibility of data loss
• Data locality. Job will be sent to the node where data are.
• Robustness
– One datanode fails
• We can get data from other nodes.
– One tasktracker failed
• We can start the same task on different node
– Recovery
• Only need to restart the daemon when the failed nodes are online
Case study - Hadoop
• Resource sharing
– Each hadoop user can share computing power
and storage space with other hadoop users.
• Synchronization
– No synchronization
• Failure detection
– Namenode/Jobtracker can know when
datanode/tasktracker fails
• Based on heartbeat