CSE352 Lecture9 DistributedSystemsDesignInto
CSE352 Lecture9 DistributedSystemsDesignInto
Distributed Systems
Design Principles and Patterns: An
Overview
1
1. DISTRIBUTED SYSTEM TYPES
Fully
Control Distributed
Autonomous
fully cooperative
Local data,
Autonomous local directory
transaction based Not fully replicated
master directory
Master-slave Fully replicated
Homog. Homog.
special general Processors
purpose purpose
Heterog. Heterog.
special general 2
purpose purpose
1. WHAT IS A DISTRIBUTED SYSTEM?
Concurrency of components
Lack of a global ‘clock’
4
1.2 DISTRIBUTED SYSTEM CHARACTERISTICS
Internet/World-Wide Web
6
2.1 LOCAL AREA NETWORK
email s erv er Desktop
computers
print and other s erv ers
Loc al area
Web server netw ork
email s erv er
print
File s erv er
other s erv ers
the res t of
the Internet
router/firew all
7
2.2 DATABASE MANAGEMENT SYSTEM
8
2.3 AUTOMATIC TELLER MACHINE NETWORK
9
2.4 INTERNET
intranet %
%
% ISP
backbone
satellite link
desktop computer:
server:
network link:
10
2.4.1 WORLD-WIDE-WEB
11
2.4.2 WEB SERVERS AND WEB BROWSERS
http://www.google.comlsearch?q=lyu
www.google.com
Browsers
Web servers
www.uu.se Internet
http://www.uu.se/
www.w3c.org
Activity.html
12
2.5 MOBILE AND UBIQUITOUS COMPUTING
Internet
Mobile
phone
Printer Laptop
Camera Host site
13
3. COMMON CHARACTERISTICS
14
3.1 HETEROGENEITY
Variety and differences in
Networks
Computer hardware
Operating systems
Programming languages
Implementations by different developers
Middleware as software layers to provide a programming
abstraction as well as masking the heterogeneity of the
underlying networks, hardware, OS, and programming languages
(e.g., CORBA).
Mobile Code to refer to code that can be sent from one computer
to another and run at the destination (e.g., Java applets and Java
virtual machine).
15
3.2 OPENNESS
Openness is concerned with extensions and
improvements of distributed systems.
Detailed interfaces of components need to be
published.
New components have to be integrated with existing
components.
Differences in data representation of interface types
on different processors (of different vendors) have to
be resolved.
16
3.3 SECURITY
In a distributed system, clients send requests to
access data managed by servers, resources in the
networks:
Doctors requesting records from hospitals
Users purchase products through electronic commerce
Security is required for:
Concealing the contents of messages: security and privacy
Identifying a remote user or other agent correctly
(authentication)
New challenges:
Denial of service attack
Security of mobile code
17
3.4 SCALABILITY
Adaptation of distributed systems to
accommodate more users
respond faster (this is the hard one)
18
3.5 FAILURE HANDLING (FAULT TOLERANCE)
19
3.6 CONCURRENCY
Components in distributed systems are executed in
concurrent processes.
Components access and update shared resources (e.g.
variables, databases, device drivers).
Integrity of the system may be violated if concurrent
updates are not coordinated.
Lost updates
Inconsistent analysis
20
3.7 TRANSPARENCY
Distributed systems should be perceived by users and
application programmers as a whole rather than as a
collection of cooperating components.
Transparency has different aspects.
These represent various properties that distributed
systems should have.
21
4. BASIC DESIGN ISSUES
General software engineering principles include
rigor and formality, separation of concerns,
modularity, abstraction, anticipation of
change, …
Specific issues for distributed systems:
Naming
Communication
Software structure
System architecture
Workload allocation
30
Consistency maintenance
4.1 NAMING
A name is resolved when translated into an interpretable
form for resource/object reference.
Communication identifier (IP address + port number)
Name resolution involves several translation steps
Design considerations
Choice of name space for each resource type
Name service to resolve resource names to comm. id.
31
4.2 COMMUNICATION
Separated components communicate with sending
processes and receiving processes for data transfer and
synchronization.
Message passing: send and receive primitives
synchronous or blocking
asynchronous or non-blocking
Abstractions defined: channels, sockets, ports.
Communication patterns: client-server communication
(e.g., RPC, function shipping) and group multicast
32
4.3 SOFTWARE STRUCTURE
Layers in centralized computer systems:
Applications
Middleware
Operating system
Computer and Network Hardware
33
4.3 SOFTWARE STRUCTURE
Layers and dependencies in distributed systems:
Applications
Open
Distributed programming services
support
Network computers
35
4.4.1 CLIENTS INVOKE INDIVIDUAL SERVERS
result result
Server
Client
Key:
Proc es s: Computer:
36
4.4.2 PEER-TO-PEER SYSTEMS
Peer 2
Peer 1
Application
Application
Sharable Peer 3
objects
Application
Peer 4
Application
Peers 5 .... N
37
4.4.3 A SERVICE BY MULTIPLE SERVERS
Service
Server
Client
Server
Client
Server
38
4.4.4 WEB PROXY SERVER
Client Web
s erv er
Prox y
s erv er
Client Web
s erv er
39
4.4.5 WEB APPLETS
a) client request res ults in the dow nloading of applet c ode
Client Web
Applet code s erv er
Web
Client Applet s erv er
40
4.4.6 THIN CLIENTS AND COMPUTE SERVERS
Compute server
Network computer or PC
41
Two Different System Models
❑ Synchronous Distributed System
❑ Each message is received within bounded time
❑ Each step in a process takes lb < time < ub
❑ (Each local clock’s drift has a known bound)
❑Asynchronous Distributed System
❑ No bounds on process execution
❑ No bounds on message transmission delays
❑ (The drift of a clock is arbitrary)
The Internet is an asynchronous distributed system
Failure Model
❖Process omission failure
❖ Crash-stop (fail-stop) – a process halts and does not
execute any further operations
❖ Crash-recovery – a process halts, but then recovers
(reboots) after a while
❖Crash-stop failures can be detected in
synchronous systems
❖Next: detecting crash-stop failures in
asynchronous systems
Network partition
Crashed
router
What’s a failure detector?
pi pj
What’s a failure detector?
Crash-stop failure
pi X
pj
What’s a failure detector?
needs to know about pj’s failure
Crash-stop failure
pi X
pj
I. Ping-Ack Protocol
If pj fails, within T time units, pi will send
needs to know about pj’s failure it a ping message, and will time out within
another T time units. Detection time = 2T
ping
pi pj
ack
heartbeat
pi pj
• Safety: the nodes agree on a valid value proposed by one of the node.
• Liveness: the nodes eventually reach agreement, i.e. the system must make
progress.
• Fault-tolerance: the solution must work when there may be failure in the network.
• The FLP roughly states that any solution to distributed consensus in asynchronous
model can have at most 2 out of the 3 properties. In other words, no algorithms can be
safe, live and fault-tolerant at the same time.
pi
Needs a separate dissemination component
Downside?
Ring Heart-beating
pj
pj, Heartbeat Seq. l++
pi
Array of pi
Heartbeat Seq. l
for member subset
Every tg units
=gossip period,
send O(N) gossip
message
54
All-to-All Heart-beating
pj, Heartbeat Seq. l++ pj
pi
Database System Concepts - 6th Edition 19.56 ©Silberschatz, Korth and Sudarshan
Single-Lock-Manager Approach
System maintains a single lock manager that resides in a single
chosen site, say Si
When a transaction needs to lock a data item, it sends a lock request
to Si and lock manager determines whether the lock can be granted
immediately
If yes, lock manager sends a message to the site which initiated
the request
If no, request is delayed until it can be granted, at which time a
message is sent to the initiating site
Database System Concepts - 6th Edition 19.57 ©Silberschatz, Korth and Sudarshan
Single-Lock-Manager Approach (Cont.)
The transaction can read the data item from any one of the sites at which a
replica of the data item resides.
Writes must be performed on all replicas of a data item
Advantages of scheme:
Simple implementation
Simple deadlock handling
Disadvantages of scheme are:
Bottleneck: lock manager site becomes a bottleneck
Vulnerability: system is vulnerable to lock manager site failure.
Database System Concepts - 6th Edition 19.58 ©Silberschatz, Korth and Sudarshan
Distributed Lock Manager
In this approach, functionality of locking is implemented by lock managers at
each site
Lock managers control access to local data items
But special protocols may be used for replicas
Advantage: work is distributed and can be made robust to failures
Disadvantage: deadlock detection is more complicated
Lock managers cooperate for deadlock detection
More on this later
Several variants of this approach
Primary copy
Majority protocol
Biased protocol
Quorum consensus
Database System Concepts - 6th Edition 19.59 ©Silberschatz, Korth and Sudarshan
Primary Copy
Choose one replica of data item to be the primary copy.
Site containing the replica is called the primary site for that data item
Different data items can have different primary sites
When a transaction needs to lock a data item Q, it requests a lock at the
primary site of Q.
Implicitly gets lock on all replicas of the data item
Benefit
Concurrency control for replicated data handled similarly to unreplicated
data - simple implementation.
Drawback
If the primary site of Q fails, Q is inaccessible even though other sites
containing a replica may be accessible.
Database System Concepts - 6th Edition 19.60 ©Silberschatz, Korth and Sudarshan
Majority Protocol
Local lock manager at each site administers lock and unlock requests for
data items stored at that site.
When a transaction wishes to lock an unreplicated data item Q residing at
site Si, a message is sent to Si ‘s lock manager.
If Q is locked in an incompatible mode, then the request is delayed until
it can be granted.
When the lock request can be granted, the lock manager sends a
message back to the initiator indicating that the lock request has been
granted.
Database System Concepts - 6th Edition 19.61 ©Silberschatz, Korth and Sudarshan
Majority Protocol (Cont.)
In case of replicated data
If Q is replicated at n sites, then a lock request message must be sent to
more than half of the n sites in which Q is stored.
The transaction does not operate on Q until it has obtained a lock on a
majority of the replicas of Q.
When writing the data item, transaction performs writes on all replicas.
Benefit
Can be used even when some sites are unavailable
details on how handle writes in the presence of site failure later
Drawback
Requires 2(n/2 + 1) messages for handling lock requests, and (n/2 + 1)
messages for handling unlock requests.
Potential for deadlock even with single item - e.g., each of 3 transactions
may have locks on 1/3rd of the replicas of a data.
Database System Concepts - 6th Edition 19.62 ©Silberschatz, Korth and Sudarshan
Biased Protocol
Local lock manager at each site as in majority protocol, however, requests
for shared locks are handled differently than requests for exclusive locks.
Shared locks. When a transaction needs to lock data item Q, it simply
requests a lock on Q from the lock manager at one site containing a replica
of Q.
Exclusive locks. When transaction needs to lock data item Q, it requests a
lock on Q from the lock manager at all sites containing a replica of Q.
Advantage - imposes less overhead on read operations.
Disadvantage - additional overhead on writes
Database System Concepts - 6th Edition 19.63 ©Silberschatz, Korth and Sudarshan
Quorum Consensus Protocol
A generalization of both majority and biased protocols
Each site is assigned a weight.
Let S be the total of all site weights
Choose two values read quorum Qr and write quorum Qw
Such that Q r + Qw > S and
2 * Qw > S
Quorums can be chosen (and S computed) separately for each item
Each read must lock enough replicas that the sum of the site weights is >=
Qr
Each write must lock enough replicas that the sum of the site weights is >=
Qw
For now we assume all replicas are written
Extensions to allow some sites to be unavailable described later
Database System Concepts - 6th Edition 19.64 ©Silberschatz, Korth and Sudarshan
Timestamping
Timestamp based concurrency-control protocols can be used in
distributed systems
Each transaction must be given a unique timestamp
Main problem: how to generate a timestamp in a distributed fashion
Each site generates a unique local timestamp using either a logical
counter or the local clock.
Global unique timestamp is obtained by concatenating the unique
local timestamp with the unique identifier.
Database System Concepts - 6th Edition 19.65 ©Silberschatz, Korth and Sudarshan
Timestamping (Cont.)
A site with a slow clock will assign smaller timestamps
Still logically correct: serializability not affected
But: “disadvantages” transactions
To fix this problem
Define within each site Si a logical clock (LCi), which generates the
unique local timestamp
Require that Si advance its logical clock whenever a request is received
from a transaction Ti with timestamp < x,y> and x is greater that the
current value of LCi.
In this case, site Si advances its logical clock to the value x + 1.
Database System Concepts - 6th Edition 19.66 ©Silberschatz, Korth and Sudarshan
Replication with Weak Consistency
Many commercial databases support replication of data with weak degrees
of consistency (I.e., without a guarantee of serializabiliy)
E.g.: master-slave replication: updates are performed at a single “master”
site, and propagated to “slave” sites.
Propagation is not part of the update transaction: its is decoupled
May be immediately after transaction commits
May be periodic
Data may only be read at slave sites, not updated
No need to obtain locks at any remote site
Particularly useful for distributing information
E.g. from central office to branch-office
Also useful for running read-only queries offline from the main database
Database System Concepts - 6th Edition 19.67 ©Silberschatz, Korth and Sudarshan
Replication with Weak Consistency (Cont.)
Database System Concepts - 6th Edition 19.68 ©Silberschatz, Korth and Sudarshan
Multimaster and Lazy Replication
With multimaster replication (also called update-anywhere replication)
updates are permitted at any replica, and are automatically propagated to all
replicas
Basic model in distributed databases, where transactions are unaware of
the details of replication, and database system propagates updates as
part of the same transaction
Coupled with 2 phase commit
Many systems support lazy propagation where updates are transmitted
after transaction commits
Allows updates to occur even if some sites are disconnected from the
network, but at the cost of consistency
Database System Concepts - 6th Edition 19.69 ©Silberschatz, Korth and Sudarshan
Trading Consistency for Availability
Database System Concepts - 6th Edition 19.70 ©Silberschatz, Korth and Sudarshan
What is Consistency?
Consistency in Databases (ACID):
Database has a set of integrity constraints
A consistent database state is one where all integrity constraints are
satisfied
Each transaction run individually on a consistent database state must
leave the database in a consistent state
Consistency in distributed systems with replication
Strong consistency: a schedule with read and write operations on a
replicated object should give results and final state equivalent to some
schedule on a single copy of the object, with order of operations from a
single site preserved
Weak consistency (several forms)
Database System Concepts - 6th Edition 19.71 ©Silberschatz, Korth and Sudarshan
Availability
Traditionally, availability of centralized server
For distributed systems, availability of system to process requests
For large system, at almost any point in time there’s a good chance that
a node is down or even
Network partitioning
Distributed consensus algorithms will block during partitions to ensure
consistency
Many applications require continued operation even during a network
partition
Even at cost of consistency
Database System Concepts - 6th Edition 19.72 ©Silberschatz, Korth and Sudarshan
Brewer’s CAP Theorem
◼ Three properties of a system
❑ Consistency (all copies have same value)
❑ Via replication
❑ Partitions (network can break into two or more parts, each with
active systems that can’t talk to other parts)
◼ Brewer’s CAP “Theorem”: You can have at most two of these three
properties for any system
◼ Very large systems will partition at some point
➔Choose one of consistency or availablity
❑ Traditional database choose consistency
Database System Concepts - 6th Edition 19.73 ©Silberschatz, Korth and Sudarshan
Eventual Consistency
Database System Concepts - 6th Edition 19.74 ©Silberschatz, Korth and Sudarshan
Availability vs Latency
Database System Concepts - 6th Edition 19.75 ©Silberschatz, Korth and Sudarshan
Distributed Directory Systems
Database System Concepts - 6th Edition 19.76 ©Silberschatz, Korth and Sudarshan
Directory Systems
Typical kinds of directory information
Employee information such as name, id, email, phone, office addr, ..
Even personal information to be accessed from multiple places
e.g. Web browser bookmarks
White pages
Entries organized by name or identifier
Meant for forward lookup to find more about an entry
Yellow pages
Entries organized by properties
For reverse lookup to find entries matching specific requirements
When directories are to be accessed across an organization
Alternative 1: Web interface. Not great for programs
Alternative 2: Specialized directory access protocols
Coupled with specialized user interfaces
Database System Concepts - 6th Edition 19.77 ©Silberschatz, Korth and Sudarshan
Directory Access Protocols
Most commonly used directory access protocol:
LDAP (Lightweight Directory Access Protocol)
Simplified from earlier X.500 protocol
Question: Why not use database protocols like ODBC/JDBC?
Answer:
Simplified protocols for a limited type of data access, evolved parallel to
ODBC/JDBC
Provide a nice hierarchical naming mechanism similar to file system
directories
Data can be partitioned amongst multiple servers for different parts of
the hierarchy, yet give a single view to user
– E.g. different servers for Bell Labs Murray Hill and Bell Labs
Bangalore
Directories may use databases as storage mechanism
Database System Concepts - 6th Edition 19.78 ©Silberschatz, Korth and Sudarshan
LDAP Data Model
LDAP directories store entries
Entries are similar to objects
Each entry must have unique distinguished name (DN)
DN made up of a sequence of relative distinguished names (RDNs)
E.g. of a DN
cn=Silberschatz, ou-Bell Labs, o=Lucent, c=USA
Standard RDNs (can be specified as part of schema)
cn: common name ou: organizational unit
o: organization c: country
Similar to paths in a file system but written in reverse direction
Database System Concepts - 6th Edition 19.79 ©Silberschatz, Korth and Sudarshan
LDAP Data Model (Cont.)
Entries can have attributes
Attributes are multi-valued by default
LDAP has several built-in types
Binary, string, time types
Tel: telephone number PostalAddress: postal address
LDAP allows definition of object classes
Object classes specify attribute names and types
Can use inheritance to define object classes
Entry can be specified to be of one or more object classes
No need to have single most-specific type
Database System Concepts - 6th Edition 19.80 ©Silberschatz, Korth and Sudarshan
LDAP Data Model (cont.)
Entries organized into a directory information tree according to their DNs
Leaf level usually represent specific objects
Internal node entries represent objects such as organizational units,
organizations or countries
Children of a node inherit the DN of the parent, and add on RDNs
E.g. internal node with DN c=USA
– Children nodes have DN starting with c=USA and further RDNs
such as o or ou
DN of an entry can be generated by traversing path from root
Leaf level can be an alias pointing to another entry
Entries can thus have more than one DN
– E.g. person in more than one organizational unit
Database System Concepts - 6th Edition 19.81 ©Silberschatz, Korth and Sudarshan
What is Consensus?
Formal problem statement
•N processes
•Each process p has
input variable xp : initially either 0 or 1
output variable yp : initially b (can be changed only once)
•Consensus problem: design a protocol so that at the
end, either:
1. All processes set their output variables to 0 (all-0’s)
2. Or All processes set their output variables to 1 (all-1’s) 82
What is Consensus? (2)
• Every process contributes a value
• Goal is to have all processes decide same (some) value
– Decision once made can’t be changed
• There might be other constraints
– Validity = if everyone proposes same value, then that’s
what’s decided
– Integrity = decided value must have been proposed by
some process
– Non-triviality = there is at least one initial system state
that leads to each of the all-0’s or all-1’s outcomes 83
Why is it Important?
• Many problems in distributed systems are equivalent to
(or harder than) consensus!
– Perfect Failure Detection
– Leader election (select exactly one leader, and every alive
process knows about it)
– Agreement (harder than consensus)
• Consensus is
– Possible to solve in synchronous systems
– Impossible to solve in asynchronous systems
84
Can’t we just solve Consensus?
• Yes, we can!
• (Whut?)
85
Yes we Can!
•Paxos algorithm
– Most popular “consensus-solving” algorithm
– Does not solve consensus problem (which
would be impossible, because we already
proved that)
– But provides safety and eventual liveness
– A lot of systems use it
• Zookeeper (Yahoo!), Google Chubby, and
many other companies
86
•Paxos invented by? (take a guess)
Yes we Can!
• Paxos invented by Leslie Lamport
89
Phase 2 – Proposal (Bill)
• Leader sends proposed value v to all
– use v=v’ if some process already decided in a previous
round and sent you its decided value v’
– If multiple such v’ received, use latest one
• Recipient logs on disk; responds OK
Value v ok?
Please elect me! OK! OK!
90
Phase 3 – Decision (Law)
• If leader hears a majority of OKs, it lets everyone know of the
decision
• Recipients receive decision, log it on disk
Value v ok? v!
Please elect me! OK! OK!
91
Which is the point of No-Return?
Value v ok? v!
Please elect me! OK! OK!
92
Which is the point of No-Return?
• If/when a majority of processes hear proposed value and
accept it (i.e., are about to/have respond(ed) with an OK!)
• Processes may not know it yet, but a decision has been made
for the group
– Even leader does not know it yet
• What if leader fails after that?
– Keep having rounds until some round completes
Value v ok? v!
Please elect me! OK! OK!
93
Safety
• If some round has a majority (i.e., quorum) hearing proposed value v’
and accepting it, then subsequently at each round either: 1) the round
chooses v’ as decision or 2) the round fails
• Proof:
– Potential leader waits for majority of OKs in Phase 1
– At least one will contain v’ (because two majorities or quorums always
intersect)
– It will choose to send out v’ in Phase 2
• Success requires a majority, and any two majority sets intersect
Value v ok? v!
Please elect me! OK! OK!
94
What could go Wrong?
• Process fails
– Majority does not include it
– When process restarts, it uses log to retrieve a past decision (if any) and past-seen ballot ids. Tries to know of
past decisions.
• Leader fails
– Start another round
• Messages dropped
– If too flaky, just start another round
• Note that anyone can start a round any time
• Protocol may never end – tough luck, buddy!
– Impossibility result not violated
– If things go well sometime in the future, consensus reached
Value v ok? v!
Please elect me! OK! OK!
95
What could go Wrong?
• A lot more!
96
Microservices – Design Patterns
Scalability
Ability to scale up horizontally(cloning)
and vertically (functional decomposition)
Resiliency
Recover gracefully from errors – both
during start-up and crashes
3 2
1
Microservices
Legacy
External Service
Circuit Breaker Security
Configuration Discovery
01 External Configuration
Externalize all config – override based on environment 02
Service Discovery
Register & discover services – follows cloud native
Can use Spring Cloud Bus to dynamically update architecture to isolate clients from tight coupling to service
“@ConfigurationProperties” and “@RefreshScope” beans Implementation – Eureka, Consul, K8S services etc
04
Circuit Breaker Security
03 Handle inter service failures - When the number of
consecutive failures crosses a threshold, the circuit
API first approach indicates that each endpoint should be
protected based on the authorization schema
breaker trips, and for the duration of a timeout period, all Use OAuth2 and JWT/Header to propagate the user
attempts to invoke the remote service will fail immediately authentication principle
Users 1
Users
Tasks
Tasks
Event Queue
5
Event Handler
Comments 2
4
Comment 3
s
Event Store
Applicatio
n Store
Synchronous call state management Message oriented state management Event Driven state management
Experience
Place Order Order Summary Place Order Order Summary Place Order Order Summary
Event Propagation
Products Payments
Products Payments 2
Products Payments
Domain
Event
Store
Message Queue
Orders 3 Shipping Orders
Shipping
Shipping Orders
6 4
Events –
5 orderPlaced,
paymentMade,
paymentDeclined
Authentication , readyToShip etc
System
Authentication Authentication
Audit Trail
Audit Trail Audit Trail
inventory
1
Cancel Order 1
Shipping inventory Order payment
Saga Make Payment
3 2
Release Payment
* pending
* Accepted
* Rejected
Inventory Status
3
succes Shipping
Order s
payment
2
* As much as possible – model your bounded context to be
self contained and avoid distributed transactions