0% found this document useful (0 votes)
73 views98 pages

CSE352 Lecture9 DistributedSystemsDesignInto

This document provides an overview of distributed system design principles and patterns. It discusses different types of distributed systems and their characteristics. It also covers common design issues such as naming, communication, software structure, and consistency maintenance.

Uploaded by

hicam12309
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
73 views98 pages

CSE352 Lecture9 DistributedSystemsDesignInto

This document provides an overview of distributed system design principles and patterns. It discusses different types of distributed systems and their characteristics. It also covers common design issues such as naming, communication, software structure, and consistency maintenance.

Uploaded by

hicam12309
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 98

CSE 352

Distributed Systems
Design Principles and Patterns: An
Overview

1
1. DISTRIBUTED SYSTEM TYPES

Fully
Control Distributed

Autonomous
fully cooperative
Local data,
Autonomous local directory
transaction based Not fully replicated
master directory
Master-slave Fully replicated

Homog. Homog.
special general Processors
purpose purpose
Heterog. Heterog.
special general 2
purpose purpose
1. WHAT IS A DISTRIBUTED SYSTEM?

Definition: A distributed system is one in which


components located at networked computers communicate
and coordinate their actions only by passing messages.
This definition leads to the following characteristics of
distributed systems:

 Concurrency of components
 Lack of a global ‘clock’

 Independent failures of components


3
1.1 CENTRALIZED SYSTEM CHARACTERISTICS
 One component with non-autonomous parts

 Component shared by users all the time

 All resources accessible

 Software runs in a single process

 Single point of control

 Single point of failure

4
1.2 DISTRIBUTED SYSTEM CHARACTERISTICS

 Multiple autonomous components

 Components are not shared by all users

 Resources may not be accessible

 Software runs in concurrent processes on different


processors

 Multiple points of control

 Multiple points of failure


5
2. EXAMPLES OF DISTRIBUTED SYSTEMS

 Local Area Network and Intranet

 Database Management System

 Automatic Teller Machine Network

 Internet/World-Wide Web

 Mobile and Ubiquitous Computing

6
2.1 LOCAL AREA NETWORK
email s erv er Desktop
computers
print and other s erv ers

Loc al area
Web server netw ork

email s erv er
print
File s erv er
other s erv ers
the res t of
the Internet
router/firew all
7
2.2 DATABASE MANAGEMENT SYSTEM

8
2.3 AUTOMATIC TELLER MACHINE NETWORK

9
2.4 INTERNET

intranet %
%
% ISP

backbone

satellite link

desktop computer:
server:
network link:

10
2.4.1 WORLD-WIDE-WEB

11
2.4.2 WEB SERVERS AND WEB BROWSERS
http://www.google.comlsearch?q=lyu
www.google.com

Browsers
Web servers

www.uu.se Internet
http://www.uu.se/

www.w3c.org

File system of http://www.w3c.org/Protocols/Activity.html


www.w3c.org Protocols

Activity.html
12
2.5 MOBILE AND UBIQUITOUS COMPUTING

Internet

Host intranet GSM/GPRS


Wireless LAN gateway Home intranet

Mobile
phone
Printer Laptop
Camera Host site
13
3. COMMON CHARACTERISTICS

 What are we trying to achieve when we construct a distributed


system?
 Certain common characteristics can be used to assess
distributed systems
 Heterogeneity
 Openness
 Security
 Scalability
 Failure Handling
 Concurrency
 Transparency

14
3.1 HETEROGENEITY
 Variety and differences in
 Networks
 Computer hardware
 Operating systems
 Programming languages
 Implementations by different developers
 Middleware as software layers to provide a programming
abstraction as well as masking the heterogeneity of the
underlying networks, hardware, OS, and programming languages
(e.g., CORBA).
 Mobile Code to refer to code that can be sent from one computer
to another and run at the destination (e.g., Java applets and Java
virtual machine).
15
3.2 OPENNESS
 Openness is concerned with extensions and
improvements of distributed systems.
 Detailed interfaces of components need to be
published.
 New components have to be integrated with existing
components.
 Differences in data representation of interface types
on different processors (of different vendors) have to
be resolved.

16
3.3 SECURITY
 In a distributed system, clients send requests to
access data managed by servers, resources in the
networks:
 Doctors requesting records from hospitals
 Users purchase products through electronic commerce
 Security is required for:
 Concealing the contents of messages: security and privacy
 Identifying a remote user or other agent correctly
(authentication)
 New challenges:
 Denial of service attack
 Security of mobile code
17
3.4 SCALABILITY
 Adaptation of distributed systems to
 accommodate more users
 respond faster (this is the hard one)

 Usually done by adding more and/or faster processors.


 Components should not need to be changed when
scale of a system increases.
 Design components to be scalable!

18
3.5 FAILURE HANDLING (FAULT TOLERANCE)

 Hardware, software and networks fail!


 Distributed systems must maintain availability even
at low levels of hardware/software/network reliability.
 Fault tolerance is achieved by
 recovery
 redundancy

19
3.6 CONCURRENCY
 Components in distributed systems are executed in
concurrent processes.
 Components access and update shared resources (e.g.
variables, databases, device drivers).
 Integrity of the system may be violated if concurrent
updates are not coordinated.
 Lost updates
 Inconsistent analysis

20
3.7 TRANSPARENCY
 Distributed systems should be perceived by users and
application programmers as a whole rather than as a
collection of cooperating components.
 Transparency has different aspects.
 These represent various properties that distributed
systems should have.

21
4. BASIC DESIGN ISSUES
 General software engineering principles include
rigor and formality, separation of concerns,
modularity, abstraction, anticipation of
change, …
 Specific issues for distributed systems:
 Naming
 Communication
 Software structure
 System architecture
 Workload allocation
30
 Consistency maintenance
4.1 NAMING
 A name is resolved when translated into an interpretable
form for resource/object reference.
 Communication identifier (IP address + port number)
 Name resolution involves several translation steps

 Design considerations
 Choice of name space for each resource type
 Name service to resolve resource names to comm. id.

 Name services include naming context resolution,


hierarchical structure, resource protection

31
4.2 COMMUNICATION
 Separated components communicate with sending
processes and receiving processes for data transfer and
synchronization.
 Message passing: send and receive primitives
 synchronous or blocking
 asynchronous or non-blocking
 Abstractions defined: channels, sockets, ports.
 Communication patterns: client-server communication
(e.g., RPC, function shipping) and group multicast

32
4.3 SOFTWARE STRUCTURE
 Layers in centralized computer systems:

Applications
Middleware

Operating system
Computer and Network Hardware

33
4.3 SOFTWARE STRUCTURE
 Layers and dependencies in distributed systems:

Applications

Open
Distributed programming services
support

Open system kernel services


Computer and network hardware
34
4.4 SYSTEM ARCHITECTURES
 Client-Server
 Peer-to-Peer

 Services provided by multiple servers

 Proxy servers and caches

 Mobile code and mobile agents

 Network computers

 Thin clients and mobile devices

35
4.4.1 CLIENTS INVOKE INDIVIDUAL SERVERS

Client inv oc ation inv oc ation Server

result result
Server

Client
Key:
Proc es s: Computer:

36
4.4.2 PEER-TO-PEER SYSTEMS
Peer 2

Peer 1
Application

Application

Sharable Peer 3
objects
Application

Peer 4

Application

Peers 5 .... N
37
4.4.3 A SERVICE BY MULTIPLE SERVERS
Service

Server

Client

Server

Client
Server

38
4.4.4 WEB PROXY SERVER

Client Web
s erv er
Prox y
s erv er

Client Web
s erv er

39
4.4.5 WEB APPLETS
a) client request res ults in the dow nloading of applet c ode

Client Web
Applet code s erv er

b) client interac ts w ith the applet

Web
Client Applet s erv er

40
4.4.6 THIN CLIENTS AND COMPUTE SERVERS

Compute server
Network computer or PC

Thin network Application


Client Process

41
Two Different System Models
❑ Synchronous Distributed System
❑ Each message is received within bounded time
❑ Each step in a process takes lb < time < ub
❑ (Each local clock’s drift has a known bound)
❑Asynchronous Distributed System
❑ No bounds on process execution
❑ No bounds on message transmission delays
❑ (The drift of a clock is arbitrary)
The Internet is an asynchronous distributed system
Failure Model
❖Process omission failure
❖ Crash-stop (fail-stop) – a process halts and does not
execute any further operations
❖ Crash-recovery – a process halts, but then recovers
(reboots) after a while
❖Crash-stop failures can be detected in
synchronous systems
❖Next: detecting crash-stop failures in
asynchronous systems
Network partition

Crashed
router
What’s a failure detector?

pi pj
What’s a failure detector?

Crash-stop failure

pi X
pj
What’s a failure detector?
needs to know about pj’s failure
Crash-stop failure

pi X
pj
I. Ping-Ack Protocol
If pj fails, within T time units, pi will send
needs to know about pj’s failure it a ping message, and will time out within
another T time units. Detection time = 2T

ping
pi pj

ack

- pi queries pj once every T time units - pj replies


- if pj does not respond within T time units,
pi marks pj as failed
II. Heart-beating Protocol

In reality, detection time


needs to know about pj’s failure is also T time units (why?)

heartbeat
pi pj

- pj maintains a sequence number


- pj sends pi a heartbeat with incremented
seq. number after every T’(=T) time units
-if pi has not received a new heartbeat for the
past T time units, pi declares pj as failed
If pj has sent x heartbeats until the time it fails, then pi will
timeout within (x+1)*T time units in the worst case, and will detect pj as failed.
Failure Detector Properties
• Completeness = every process failure is eventually
detected (no misses)
• Accuracy = every detected failure corresponds to a
crashed process (no mistakes)
• Given a failure detector that satisfies both
Completeness and Accuracy
– One can show that Consensus is achievable
– FLP => one cannot design a failure detector (for an
asynchronous system) that guarantees both above
properties
FLP Theorem
The problem in which a number of nodes agree on a proposed value. A solution (or
algorithm) to distributed consensus is characterized by 3 properties:

• Safety: the nodes agree on a valid value proposed by one of the node.
• Liveness: the nodes eventually reach agreement, i.e. the system must make
progress.
• Fault-tolerance: the solution must work when there may be failure in the network.

• The FLP roughly states that any solution to distributed consensus in asynchronous
model can have at most 2 out of the 3 properties. In other words, no algorithms can be
safe, live and fault-tolerant at the same time.

FLP Theorem: In an asynchronous network, it is not possible to


achieve safety and liveness when there may be one crash failure.
Centralized Heart-beating
pj

pj, Heartbeat Seq. l++

pi
Needs a separate dissemination component
Downside?
Ring Heart-beating
pj
pj, Heartbeat Seq. l++

pi

Needs a separate dissemination component


Downside?
Gossip-style Heartbeating

Array of pi
Heartbeat Seq. l
for member subset
Every tg units
=gossip period,
send O(N) gossip
message

54
All-to-All Heart-beating
pj, Heartbeat Seq. l++ pj

pi

Does not need a separate dissemination component


Downside?
Concurrency Control
 Modify concurrency control schemes for use in distributed environment.
 We assume that each site participates in the execution of a commit
protocol to ensure global transaction automicity.
 We assume all replicas of any item are updated
 Will see how to relax this in case of site failures later

Database System Concepts - 6th Edition 19.56 ©Silberschatz, Korth and Sudarshan
Single-Lock-Manager Approach
 System maintains a single lock manager that resides in a single
chosen site, say Si
 When a transaction needs to lock a data item, it sends a lock request
to Si and lock manager determines whether the lock can be granted
immediately
 If yes, lock manager sends a message to the site which initiated
the request
 If no, request is delayed until it can be granted, at which time a
message is sent to the initiating site

Database System Concepts - 6th Edition 19.57 ©Silberschatz, Korth and Sudarshan
Single-Lock-Manager Approach (Cont.)
 The transaction can read the data item from any one of the sites at which a
replica of the data item resides.
 Writes must be performed on all replicas of a data item
 Advantages of scheme:
 Simple implementation
 Simple deadlock handling
 Disadvantages of scheme are:
 Bottleneck: lock manager site becomes a bottleneck
 Vulnerability: system is vulnerable to lock manager site failure.

Database System Concepts - 6th Edition 19.58 ©Silberschatz, Korth and Sudarshan
Distributed Lock Manager
 In this approach, functionality of locking is implemented by lock managers at
each site
 Lock managers control access to local data items
 But special protocols may be used for replicas
 Advantage: work is distributed and can be made robust to failures
 Disadvantage: deadlock detection is more complicated
 Lock managers cooperate for deadlock detection
 More on this later
 Several variants of this approach
 Primary copy
 Majority protocol
 Biased protocol
 Quorum consensus

Database System Concepts - 6th Edition 19.59 ©Silberschatz, Korth and Sudarshan
Primary Copy
 Choose one replica of data item to be the primary copy.
 Site containing the replica is called the primary site for that data item
 Different data items can have different primary sites
 When a transaction needs to lock a data item Q, it requests a lock at the
primary site of Q.
 Implicitly gets lock on all replicas of the data item
 Benefit
 Concurrency control for replicated data handled similarly to unreplicated
data - simple implementation.
 Drawback
 If the primary site of Q fails, Q is inaccessible even though other sites
containing a replica may be accessible.

Database System Concepts - 6th Edition 19.60 ©Silberschatz, Korth and Sudarshan
Majority Protocol
 Local lock manager at each site administers lock and unlock requests for
data items stored at that site.
 When a transaction wishes to lock an unreplicated data item Q residing at
site Si, a message is sent to Si ‘s lock manager.
 If Q is locked in an incompatible mode, then the request is delayed until
it can be granted.
 When the lock request can be granted, the lock manager sends a
message back to the initiator indicating that the lock request has been
granted.

Database System Concepts - 6th Edition 19.61 ©Silberschatz, Korth and Sudarshan
Majority Protocol (Cont.)
 In case of replicated data
 If Q is replicated at n sites, then a lock request message must be sent to
more than half of the n sites in which Q is stored.
 The transaction does not operate on Q until it has obtained a lock on a
majority of the replicas of Q.
When writing the data item, transaction performs writes on all replicas.

 Benefit
 Can be used even when some sites are unavailable
 details on how handle writes in the presence of site failure later
 Drawback
 Requires 2(n/2 + 1) messages for handling lock requests, and (n/2 + 1)
messages for handling unlock requests.
 Potential for deadlock even with single item - e.g., each of 3 transactions
may have locks on 1/3rd of the replicas of a data.

Database System Concepts - 6th Edition 19.62 ©Silberschatz, Korth and Sudarshan
Biased Protocol
 Local lock manager at each site as in majority protocol, however, requests
for shared locks are handled differently than requests for exclusive locks.
 Shared locks. When a transaction needs to lock data item Q, it simply
requests a lock on Q from the lock manager at one site containing a replica
of Q.
 Exclusive locks. When transaction needs to lock data item Q, it requests a
lock on Q from the lock manager at all sites containing a replica of Q.
 Advantage - imposes less overhead on read operations.
 Disadvantage - additional overhead on writes

Database System Concepts - 6th Edition 19.63 ©Silberschatz, Korth and Sudarshan
Quorum Consensus Protocol
 A generalization of both majority and biased protocols
 Each site is assigned a weight.
 Let S be the total of all site weights
 Choose two values read quorum Qr and write quorum Qw
 Such that Q r + Qw > S and
2 * Qw > S
 Quorums can be chosen (and S computed) separately for each item

 Each read must lock enough replicas that the sum of the site weights is >=
Qr
 Each write must lock enough replicas that the sum of the site weights is >=
Qw
 For now we assume all replicas are written
 Extensions to allow some sites to be unavailable described later

Database System Concepts - 6th Edition 19.64 ©Silberschatz, Korth and Sudarshan
Timestamping
 Timestamp based concurrency-control protocols can be used in
distributed systems
 Each transaction must be given a unique timestamp
 Main problem: how to generate a timestamp in a distributed fashion
 Each site generates a unique local timestamp using either a logical
counter or the local clock.
 Global unique timestamp is obtained by concatenating the unique
local timestamp with the unique identifier.

Database System Concepts - 6th Edition 19.65 ©Silberschatz, Korth and Sudarshan
Timestamping (Cont.)
 A site with a slow clock will assign smaller timestamps
 Still logically correct: serializability not affected
 But: “disadvantages” transactions
 To fix this problem
 Define within each site Si a logical clock (LCi), which generates the
unique local timestamp
 Require that Si advance its logical clock whenever a request is received
from a transaction Ti with timestamp < x,y> and x is greater that the
current value of LCi.
 In this case, site Si advances its logical clock to the value x + 1.

Database System Concepts - 6th Edition 19.66 ©Silberschatz, Korth and Sudarshan
Replication with Weak Consistency
 Many commercial databases support replication of data with weak degrees
of consistency (I.e., without a guarantee of serializabiliy)
 E.g.: master-slave replication: updates are performed at a single “master”
site, and propagated to “slave” sites.
 Propagation is not part of the update transaction: its is decoupled
 May be immediately after transaction commits
 May be periodic
 Data may only be read at slave sites, not updated
 No need to obtain locks at any remote site
 Particularly useful for distributing information
 E.g. from central office to branch-office
 Also useful for running read-only queries offline from the main database

Database System Concepts - 6th Edition 19.67 ©Silberschatz, Korth and Sudarshan
Replication with Weak Consistency (Cont.)

 Replicas should see a transaction-consistent snapshot of the database


 That is, a state of the database reflecting all effects of all transactions up
to some point in the serialization order, and no effects of any later
transactions.
 E.g. Oracle provides a create snapshot statement to create a snapshot of a
relation or a set of relations at a remote site
 snapshot refresh either by recomputation or by incremental update
 Automatic refresh (continuous or periodic) or manual refresh

Database System Concepts - 6th Edition 19.68 ©Silberschatz, Korth and Sudarshan
Multimaster and Lazy Replication
 With multimaster replication (also called update-anywhere replication)
updates are permitted at any replica, and are automatically propagated to all
replicas
 Basic model in distributed databases, where transactions are unaware of
the details of replication, and database system propagates updates as
part of the same transaction
 Coupled with 2 phase commit
 Many systems support lazy propagation where updates are transmitted
after transaction commits
 Allows updates to occur even if some sites are disconnected from the
network, but at the cost of consistency

Database System Concepts - 6th Edition 19.69 ©Silberschatz, Korth and Sudarshan
Trading Consistency for Availability

Database System Concepts - 6th Edition 19.70 ©Silberschatz, Korth and Sudarshan
What is Consistency?
 Consistency in Databases (ACID):
 Database has a set of integrity constraints
 A consistent database state is one where all integrity constraints are
satisfied
 Each transaction run individually on a consistent database state must
leave the database in a consistent state
 Consistency in distributed systems with replication
 Strong consistency: a schedule with read and write operations on a
replicated object should give results and final state equivalent to some
schedule on a single copy of the object, with order of operations from a
single site preserved
 Weak consistency (several forms)

Database System Concepts - 6th Edition 19.71 ©Silberschatz, Korth and Sudarshan
Availability
 Traditionally, availability of centralized server
 For distributed systems, availability of system to process requests
 For large system, at almost any point in time there’s a good chance that
 a node is down or even
 Network partitioning
 Distributed consensus algorithms will block during partitions to ensure
consistency
 Many applications require continued operation even during a network
partition
 Even at cost of consistency

Database System Concepts - 6th Edition 19.72 ©Silberschatz, Korth and Sudarshan
Brewer’s CAP Theorem
◼ Three properties of a system
❑ Consistency (all copies have same value)

❑ Availability (system can run even if parts have failed)

❑ Via replication
❑ Partitions (network can break into two or more parts, each with
active systems that can’t talk to other parts)
◼ Brewer’s CAP “Theorem”: You can have at most two of these three
properties for any system
◼ Very large systems will partition at some point
➔Choose one of consistency or availablity
❑ Traditional database choose consistency

❑ Most Web applications choose availability

◼ Except for specific parts such as order processing

Database System Concepts - 6th Edition 19.73 ©Silberschatz, Korth and Sudarshan
Eventual Consistency

◼ When no updates occur for a long period of time, eventually all


updates will propagate through the system and all the nodes will
be consistent
◼ For a given accepted update and a given node, eventually either
the update reaches the node or the node is removed from service
◼ Known as BASE (Basically Available, Soft state, Eventual
consistency), as opposed to ACID
❑ Soft state: copies of a data item may be inconsistent
❑ Eventually Consistent – copies becomes consistent at
some later time if there are no more updates to that data
item

Database System Concepts - 6th Edition 19.74 ©Silberschatz, Korth and Sudarshan
Availability vs Latency

 CAP theorem only matters when there is a partition


 Even if partitions are rare, applications may trade off
consistency for latency
 E.g. PNUTS allows inconsistent reads to reduce latency
– Critical for many applications
 But update protocol (via master) ensures consistency over
availability
 Thus there are two questions :
 Ifthere is partitioning, how does system tradeoff availability for
consistency
 else how does system trade off latency for consistency

Database System Concepts - 6th Edition 19.75 ©Silberschatz, Korth and Sudarshan
Distributed Directory Systems

Database System Concepts - 6th Edition 19.76 ©Silberschatz, Korth and Sudarshan
Directory Systems
 Typical kinds of directory information
 Employee information such as name, id, email, phone, office addr, ..
 Even personal information to be accessed from multiple places
 e.g. Web browser bookmarks
 White pages
 Entries organized by name or identifier
 Meant for forward lookup to find more about an entry
 Yellow pages
 Entries organized by properties
 For reverse lookup to find entries matching specific requirements
 When directories are to be accessed across an organization
 Alternative 1: Web interface. Not great for programs
 Alternative 2: Specialized directory access protocols
 Coupled with specialized user interfaces

Database System Concepts - 6th Edition 19.77 ©Silberschatz, Korth and Sudarshan
Directory Access Protocols
 Most commonly used directory access protocol:
 LDAP (Lightweight Directory Access Protocol)
 Simplified from earlier X.500 protocol
 Question: Why not use database protocols like ODBC/JDBC?
 Answer:
 Simplified protocols for a limited type of data access, evolved parallel to
ODBC/JDBC
 Provide a nice hierarchical naming mechanism similar to file system
directories
 Data can be partitioned amongst multiple servers for different parts of
the hierarchy, yet give a single view to user
– E.g. different servers for Bell Labs Murray Hill and Bell Labs
Bangalore
 Directories may use databases as storage mechanism

Database System Concepts - 6th Edition 19.78 ©Silberschatz, Korth and Sudarshan
LDAP Data Model
 LDAP directories store entries
 Entries are similar to objects
 Each entry must have unique distinguished name (DN)
 DN made up of a sequence of relative distinguished names (RDNs)
 E.g. of a DN
 cn=Silberschatz, ou-Bell Labs, o=Lucent, c=USA
 Standard RDNs (can be specified as part of schema)
 cn: common name ou: organizational unit
 o: organization c: country
 Similar to paths in a file system but written in reverse direction

Database System Concepts - 6th Edition 19.79 ©Silberschatz, Korth and Sudarshan
LDAP Data Model (Cont.)
 Entries can have attributes
 Attributes are multi-valued by default
 LDAP has several built-in types
 Binary, string, time types
 Tel: telephone number PostalAddress: postal address
 LDAP allows definition of object classes
 Object classes specify attribute names and types
 Can use inheritance to define object classes
 Entry can be specified to be of one or more object classes
 No need to have single most-specific type

Database System Concepts - 6th Edition 19.80 ©Silberschatz, Korth and Sudarshan
LDAP Data Model (cont.)
 Entries organized into a directory information tree according to their DNs
 Leaf level usually represent specific objects
 Internal node entries represent objects such as organizational units,
organizations or countries
 Children of a node inherit the DN of the parent, and add on RDNs
 E.g. internal node with DN c=USA
– Children nodes have DN starting with c=USA and further RDNs
such as o or ou
 DN of an entry can be generated by traversing path from root
 Leaf level can be an alias pointing to another entry
 Entries can thus have more than one DN
– E.g. person in more than one organizational unit

Database System Concepts - 6th Edition 19.81 ©Silberschatz, Korth and Sudarshan
What is Consensus?
Formal problem statement
•N processes
•Each process p has
input variable xp : initially either 0 or 1
output variable yp : initially b (can be changed only once)
•Consensus problem: design a protocol so that at the
end, either:
1. All processes set their output variables to 0 (all-0’s)
2. Or All processes set their output variables to 1 (all-1’s) 82
What is Consensus? (2)
• Every process contributes a value
• Goal is to have all processes decide same (some) value
– Decision once made can’t be changed
• There might be other constraints
– Validity = if everyone proposes same value, then that’s
what’s decided
– Integrity = decided value must have been proposed by
some process
– Non-triviality = there is at least one initial system state
that leads to each of the all-0’s or all-1’s outcomes 83
Why is it Important?
• Many problems in distributed systems are equivalent to
(or harder than) consensus!
– Perfect Failure Detection
– Leader election (select exactly one leader, and every alive
process knows about it)
– Agreement (harder than consensus)

• So consensus is a very important problem, and solving it


would be really useful!

• Consensus is
– Possible to solve in synchronous systems
– Impossible to solve in asynchronous systems
84
Can’t we just solve Consensus?
• Yes, we can!
• (Whut?)

85
Yes we Can!
•Paxos algorithm
– Most popular “consensus-solving” algorithm
– Does not solve consensus problem (which
would be impossible, because we already
proved that)
– But provides safety and eventual liveness
– A lot of systems use it
• Zookeeper (Yahoo!), Google Chubby, and
many other companies

86
•Paxos invented by? (take a guess)
Yes we Can!
• Paxos invented by Leslie Lamport

• Paxos provides safety and eventual liveness


– Safety: Consensus is not violated
– Eventual Liveness: If things go well sometime in the future
(messages, failures, etc.), there is a good chance consensus
will be reached. But there is no guarantee.

• FLP result still applies: Paxos is not guaranteed to reach


Consensus (ever, or within any bounded time)
87
Political Science 101, i.e., Paxos Groked

• Paxos has rounds; each round has a unique ballot id


• Rounds are asynchronous
– Time synchronization not required
– If you’re in round j and hear a message from round j+1, abort everything and
move over to round j+1
– Use timeouts; may be pessimistic
• Each round itself broken into phases (which are also asynchronous)
– Phase 1: A leader is elected (Election)
– Phase 2: Leader proposes a value, processes ack (Bill)
– Phase 3: Leader multicasts final value (Law) 88
Phase 1 – election
• Potential leader chooses a unique ballot id, higher than seen anything so far
• Sends to all processes
• Processes wait, respond once to highest ballot id
– If potential leader sees a higher ballot id, it can’t be a leader
– Paxos tolerant to multiple leaders, but we’ll only discuss 1 leader case
– Processes also log received ballot ID on disk
• If a process has in a previous round decided on a value v’, it includes value v’ in its response
• If majority (i.e., quorum) respond OK then you are the leader
– If no one has majority, start new round
• (If things go right) A round cannot have two leaders (why?)
Please elect me! OK!

89
Phase 2 – Proposal (Bill)
• Leader sends proposed value v to all
– use v=v’ if some process already decided in a previous
round and sent you its decided value v’
– If multiple such v’ received, use latest one
• Recipient logs on disk; responds OK
Value v ok?
Please elect me! OK! OK!

90
Phase 3 – Decision (Law)
• If leader hears a majority of OKs, it lets everyone know of the
decision
• Recipients receive decision, log it on disk

Value v ok? v!
Please elect me! OK! OK!

91
Which is the point of No-Return?

• That is, when is consensus reached in the system

Value v ok? v!
Please elect me! OK! OK!

92
Which is the point of No-Return?
• If/when a majority of processes hear proposed value and
accept it (i.e., are about to/have respond(ed) with an OK!)
• Processes may not know it yet, but a decision has been made
for the group
– Even leader does not know it yet
• What if leader fails after that?
– Keep having rounds until some round completes
Value v ok? v!
Please elect me! OK! OK!

93
Safety
• If some round has a majority (i.e., quorum) hearing proposed value v’
and accepting it, then subsequently at each round either: 1) the round
chooses v’ as decision or 2) the round fails
• Proof:
– Potential leader waits for majority of OKs in Phase 1
– At least one will contain v’ (because two majorities or quorums always
intersect)
– It will choose to send out v’ in Phase 2
• Success requires a majority, and any two majority sets intersect
Value v ok? v!
Please elect me! OK! OK!

94
What could go Wrong?
• Process fails
– Majority does not include it
– When process restarts, it uses log to retrieve a past decision (if any) and past-seen ballot ids. Tries to know of
past decisions.
• Leader fails
– Start another round
• Messages dropped
– If too flaky, just start another round
• Note that anyone can start a round any time
• Protocol may never end – tough luck, buddy!
– Impossibility result not violated
– If things go well sometime in the future, consensus reached

Value v ok? v!
Please elect me! OK! OK!

95
What could go Wrong?
• A lot more!

• This is a highly simplified view of Paxos.


• See Lamport’s original paper:
http://research.microsoft.com/en-us/um/people/lamport/pubs/paxos-
simple.pdf
Value v ok? v!
Please elect me! OK! OK!

96
Microservices – Design Patterns

Confidential & Proprietary 97


Software Architecture Evolution

Confidential & Proprietary 98


Core Guiding Goals

Scalability
Ability to scale up horizontally(cloning)
and vertically (functional decomposition)

DevOps & Automation Speed


Auto-scale at infra and service level;
Faster time to market
CI/CD for code deployments

Monitoring Cohesion – Independent &


Distributed system mandate autonomous
monitoring and auto recovery Do one thing – and do it well

Resiliency
Recover gracefully from errors – both
during start-up and crashes

Confidential & Proprietary 99


Decomposition - deployment

Strangler (Brownfield Development) Sidecar


Infra layer within your architecture
Transform Coexist Eliminate 1. https://istio.io/
1. Observability and tracing
2. https://linkerd.io/ 2. Encryltion and mutual TLS between services
Switch / LB
3. Service discovery & load balancing
4. Controlling traffic flow by applying routing rules
5. Support for canary and blue/green deployment
Strangler Facade

3 2

1
Microservices

Legacy

Confidential & Proprietary 100


Cross cutting concerns

Common patterns used in all Microservices

External Service
Circuit Breaker Security
Configuration Discovery

01 External Configuration
Externalize all config – override based on environment 02
Service Discovery
Register & discover services – follows cloud native
Can use Spring Cloud Bus to dynamically update architecture to isolate clients from tight coupling to service
“@ConfigurationProperties” and “@RefreshScope” beans Implementation – Eureka, Consul, K8S services etc

04
Circuit Breaker Security
03 Handle inter service failures - When the number of
consecutive failures crosses a threshold, the circuit
API first approach indicates that each endpoint should be
protected based on the authorization schema
breaker trips, and for the duration of a timeout period, all Use OAuth2 and JWT/Header to propagate the user
attempts to invoke the remote service will fail immediately authentication principle

Confidential & Proprietary 101


Database patterns

Database per Service Shared Database Event Sourcing and CQRS

Write API (CUD of CRUD) Read API (R of CRUD)

Users 1
Users

Tasks
Tasks
Event Queue
5

Event Handler
Comments 2

4
Comment 3
s

Event Store
Applicatio
n Store

Confidential & Proprietary 102


Integration – Chained Microservices

Synchronous call state management Message oriented state management Event Driven state management
Experience

Place Order Order Summary Place Order Order Summary Place Order Order Summary

Event Propagation
Products Payments
Products Payments 2
Products Payments
Domain

Event
Store
Message Queue
Orders 3 Shipping Orders
Shipping
Shipping Orders

6 4
Events –
5 orderPlaced,
paymentMade,
paymentDeclined
Authentication , readyToShip etc
System

Authentication Authentication

Audit Trail
Audit Trail Audit Trail

Confidential & Proprietary 103


Integration – API Gateway

Common front for all APIs


1. Proxy service to route requests to concerned
Platform APIs
Scheduler
Microservice, abstract out details of actual
AuthN/
Config Svc data producer
AuthZ cv 2. Prevent access to services that are internal to
Browser Clients Service the application
Reporting
Proxy 3. Can act as Aggregator - fan out requests to
/api/* multiple services (in parallel or in sequence)
and then aggregate the result back to the
consumer
API Gateway

Experience APIs • Either the API gateway can act as an


Dash
APIs for: Order board aggregator
Mobile App Experience Layer, Summary • Or depending on business complexity
Clients Aggregators, Cart and number of clients – we can have a
Approval
Orchestrators
List
dedicated aggregator service
4. Can also convert the protocol request (e.g.
AMQP) to another protocol (e.g. HTTP) and
vice versa so that the producer and consumer
Domain/Core APIs Account
can handle it.
Payments Statement
5. Different options
API Clients APIs for:
• Software gateway like Zuul & Spring
(3rd party Systems) Fine Grained Products
Cloud Gateway
at Domain Entity. Orders
Some Directly • Tools like Apigee, IBM Cloud Gateway,
User
exposed. Kong etc

Confidential & Proprietary 104


Transaction Patterns - Saga

Handle transactions in distributed


systems*
Choreography-based saga Orchestration-based saga

* Order Accepted * Order Rejected Place Order


Place Order
Inventory Reserved

inventory

* Order Pending Order


service

1
Cancel Order 1
Shipping inventory Order payment
Saga Make Payment
3 2

Release Payment

* pending

* Accepted

* Rejected
Inventory Status
3

succes Shipping
Order s
payment

2
* As much as possible – model your bounded context to be
self contained and avoid distributed transactions

Confidential & Proprietary 105


Observation & Monitoring

Critical to troubleshoot issues with distributed systems

Log Aggregator Perf Monitoring Distributed Tracing Health Check


Centralized logging service that Monitor patterns so that Follow a single logical request Provide specific endpoint (such
aggregates logs from each application performance can be (the trace) as it spans through as /health) that provides
service instance – can be used tracked & alerted for issues multiple services( each being a information on whether the
to search & analyze the logs and 1. APM tools like AppDynamics, span) that get called – specially service is healthy or not.
trigger alerts based on patters. NewRelic etc have agents relevant for Microservice Critical to orchestrate APIs using
Eg: ELK / EFK stack, Amazon running with the Microservice chaining and aggregation container orchestration platform
cloud watch, Graylog etc that PUSH data to monitoring Eg: Zipkin, Jaeger, Appdash like Kubernetes
service Eg: Spring Actuator framework,
2. Tools like Prometheus PULL socker based ping checks etc
metrics from the
Microservice

Confidential & Proprietary 106

You might also like