0% found this document useful (0 votes)

116 views34 pages

Computer Science 425 Distributed Systems: CS 425 / ECE 428

This document summarizes a lecture on failure detection in distributed systems. It discusses how failures are common in large datacenters and the importance of building a failure detector. It presents two common approaches for failure detection: (1) ping-ack protocols that check if a process responds to a ping within a timeout period, and (2) heartbeat protocols where each process periodically sends heartbeat messages. It notes that while these approaches can guarantee completeness, they cannot guarantee accuracy in asynchronous distributed systems due to potential message delays or losses. Most implementations prioritize completeness over perfect accuracy. The document also discusses scaling failure detection to an entire distributed system and metrics for evaluating failure detector efficiency.

Uploaded by

sadia

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

116 views34 pages

Computer Science 425 Distributed Systems: CS 425 / ECE 428

Uploaded by

sadia

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

You are on page 1/ 34

Computer Science 425

Distributed Systems

CS 425 / ECE 428

Fall 2013

Indranil Gupta (Indy)

September 5, 2013
Lecture 4
Failure Detection
Reading: Section 15.1 and parts of 2.4.2

 2013, I. Gupta Lecture 4-1

Your new datacenter

• You’ve been put in charge of a datacenter (think

of the Prineville Facebook DC), and your manager
has told you, “Oh no! We don’t have any failures
in our datacenter!”

• Do you believe him/her?

• What would be your first responsibility?

• Build a failure detector
• What are some things that could go wrong if you
didn’t do this?
Lecture 4-2
Failures are the norm

… not the exception, in datacenters.

Say, the rate of failure of one machine

(OS/disk/motherboard/network, etc.) is once every
10 years (120 months) on average.

When you have 120 servers in the DC, the mean

time to failure (MTTF) of the next machine is 1
month.

When you have 12,000 servers in the DC, the MTTF

is about once every 7.2 hours!

Lecture 4-3
To build a failure detector

• You have a few options

1. Hire 1000 people, each to monitor one machine in the

datacenter and report to you when it fails.
2. Write a failure detector program (distributed) that automatically
detects failures and reports to your workstation.

Which is more preferable, and why?

Lecture 4-4
Two Different System Models
Whenever someone gives you a distributed computing
problem, the first question you want to ask is, “What is the
system model under which I need to solve the problem?”

 Synchronous Distributed System

 Each message is received within bounded time
 Each step in a process takes lb < time < ub
 (Each local clock’s drift has a known bound)
Examples: Multiprocessor systems
Asynchronous Distributed System
 No bounds on message transmission delays
 No bounds on process execution
 (The drift of a clock is arbitrary)
Examples: Internet, wireless networks, datacenters, most real
systems
Lecture 4-5
Failure Model

Process omission failure

 Crash-stop (fail-stop) – a process halts and does not
execute any further operations
 Crash-recovery – a process halts, but then recovers
(reboots) after a while
Special case of crash-stop model (use a new identifier
on recovery)

We will focus on Crash-stop failures

They are easy to detect in synchronous
systems
Not so easy in asynchronous systems
Lecture 4-6
What’s a failure detector?

pi pj

Lecture 4-7
What’s a failure detector?

Crash-stop failure
(pj is a failed process)

pi X
pj

Lecture 4-8
What’s a failure detector?

needs to know about pj’s failure

(pi is a non-faulty process
Crash-stop failure
or alive process)
(pj is a failed process)

pi X
pj

There are two main flavors of Failure Detectors…

Lecture 4-9
I. Ping-Ack Protocol
needs to know about pj’s failure

ping
pi pj

ack

- pj replies
- pi queries pj once every T time units
- if pj does not respond within another T time units of being sent the ping,
pi detects pj as failed
Worst case Detection time = 2T
If pj fails, then within T time units, pi will send
it a ping message. pi will time out within
another T time units. Lecture 4-10
The waiting time ‘T’ can be parameterized.
II. Heartbeating Protocol
needs to know about pj’s failure

heartbeat
pi pj

- pj maintains a sequence number

- pj sends pi a heartbeat with incremented
seq. number after every T time units

-if pi has not received a new heartbeat for the

past, say 3*T time units, since it received the last heartbeat,
then pi detects pj as failed

If T >> round trip time of messages, then worst case detection time ~ 3*T (why?)
Lecture 4-11
The ‘3’ can be changed to any positive number since it is a parameter
In a Synchronous System

• The Ping-ack and Heartbeat failure

detectors are always “correct”
– If a process pj fails, then pi will detect its failure
as long as pi itself is alive
• Why?
– Ping-ack: set waiting time ‘T’ to be > round—trip time
upper bound
» pi->pj latency + pj processing + pj->pi latency + pi
processing time
– Heartbeat: set waiting time ‘3*T’ to be > round—trip time
upper bound

Lecture 4-12
Failure Detector Properties

• Completeness = every process failure is eventually

detected (no misses)
• Accuracy = every detected failure corresponds to a
crashed process (no mistakes)
• What is a protocol that is 100% complete?
• What is a protocol that is 100% accurate?
• Completeness and Accuracy
– Can both be guaranteed 100% in a synchronous distributed system
– Can never be guaranteed simultaneously in an asynchronous
distributed system
Why?

Lecture 4-13
Satisfying both Completeness and
Accuracy in Asynchronous
Systems
• Impossible because of arbitrary message delays,
message losses
– If a heartbeat/ack is dropped (or several are dropped) from pj,
then pj will be mistakenly detected as failed => inaccurate
detection
– How large would the T waiting period in ping-ack or 3*T
waiting period in heartbeating, need to be to obtain 100%
accuracy?
– In asynchronous systems, delay/losses on a network link are
impossible to distinguish from a faulty process
• Heartbeating – satisfies completeness but not
accuracy (why?)
• Ping-Ack – satisfies completeness but not
accuracy (why?)
Lecture 4-14
Completeness or Accuracy?
(in asynchronous system)
• Most failure detector implementations are willing
to tolerate some inaccuracy, but require 100%
Completeness
• Plenty of distributed apps designed assuming
100% completeness, e.g., p2p systems
– “Err on the side of caution”.
– Processes not “stuck” waiting for other processes
• But it’s ok to mistakenly detect once in a while
since – the victim process need only rejoin as a
new process and catch up
• Both Hearbeating and Ping-ack provide
– Probabilistic accuracy: for a process detected as failed, with some
probability close to 1.0 (but not equal), it is true that it has actually
crashed.

Lecture 4-15
Failure Detection in a Distributed System

• That was for one process pj being detected and

one process pi detecting failures
• Let’s extend it to an entire distributed system
• Difference from original failure detection is
– We want failure detection of not merely one process (pj), but all
processes in system

Lecture 4-16
Centralized Heartbeating

pj, Heartbeat Seq. l++

Downside? Lecture 4-17

Ring Heartbeating

pj
pj, Heartbeat Seq. l++

No SPOF (single point of failure)

Downside? Lecture 4-18
All-to-All Heartbeating

pj, Heartbeat Seq. l++ pj

Advantage: Everyone is able to keep track of everyone

Lecture 4-19
Downside?
Efficiency of Failure Detector: Metrics

• Bandwidth: the number of messages sent in the

system during steady state (no failures)
– Small is good
• Detection Time
– Time between a process crash and its detection
– Small is good
• Scalability: How do bandwidth and detection
properties scale with N, the number of
processes?
• Accuracy
– Large is good (lower inaccuracy is good)

Lecture 4-20
Accuracy metrics

• False Detection Rate/False Positive Rate

(inaccuracy)
– Multiple possible metrics
– 1. Average number of failures detected per second, when there
are in fact no failures
– 2. Fraction of failure detections that are false

• Tradeoffs: If you increase the T waiting period in

ping-ack or 3*T waiting period in heartbeating
what happens to:
– Detection Time?
– False positive rate?
– Where would you set these waiting periods?

Lecture 4-21
Membership Protocols

• Maintain a list of other alive (non-faulty)

processes at each process in the system
• Failure detector is a component in membership
protocol
– Failure of pj detected -> delete pj from membership list
– New machine joins -> pj sends message to everyone -> add pj
to membership list
• Flavors
– Strongly consistent: all membership lists identical at all times
(hard, may not scale)
– Weakly consistent: membership lists not identical at all times
– Eventually consistent: membership lists always moving
towards becoming identical eventually (scales well)

Lecture 4-22
Gossip-style Membership

 Good accuracy
pi properties
Array of
Heartbeat Seq. l
for member subset

23
Gossip-Style Failure Detection 1 10118 64
2 10110 64

1 10120 66 3 10090 58

2 10103 62 4 10111 65

3 10098 63 2
4 10111 65 1
Address Time (local) 1 10120 70
Heartbeat Counter 2 10110 64
3 10098 70
Protocol
4 4 10111 65
•Each process maintains a
membership list
3
Current time : 70 at node 2
•Each process periodically (asynchronous clocks)
increments its own heartbeat
counter
•Each process periodically
gossips its membership list
•On receipt, the heartbeats are
merged, and local times are 24
updated Fig and animation by: Dongyun Jin and Thuy Ngyuen
Gossip-Style Failure Detection

• Well-known result (you’ll see it in a few months)

– In a group of N processes, it takes O(log(N)) time for a
heartbeat update to propagate to everyone with high
probability
– Very robust against failures – even if a large number of
processes crash, most/all of the remaining processes still
receive all heartbeats
• Failure detection: If the heartbeat has not
increased for more than Tfail seconds,
the member is considered failed
– Tfail usually set to O(log(N)).
• But entry not deleted immediately: wait another
Tcleanup seconds (usually = Tfail)
• Why not delete it immediately after the Tfail
timeout?
25
Gossip-Style Failure Detection

• What if an entry pointing to a failed node is deleted

right after Tfail (=24) seconds?

1 10120 66
2 10110 64

1 10120 66 3
4 10098
10111 75
50
65
2 10103 62 4 10111 65

3 10098 55 2
4 10111 65 1
Current time : 75 at node 2

• Fix: remember for another Tfail

4
3 26
Other Types of Failures

• Let’s discuss the other types of failures

• Failure detectors exist for them too (but we won’t
discuss those)

Lecture 4-27
Processes and Channels

proc essp proc ess q

send m receive

Communic ati on channel

Outgoing mes s age buffer Inc oming m es s age buffer

Lecture 4-28
Other Failure Types
Communication omission
failures
 Send-omission: loss of messages between the sending
process and the outgoing message buffer (both inclusive)
What might cause this?
 Channel omission: loss of message in the
communication channel
What might cause this?
 Receive-omission: loss of messages between the
incoming message buffer and the receiving process (both
inclusive)
What might cause this?
Lecture 4-29
Other Failure Types
Arbitrary failures
 Arbitrary process failure: arbitrarily omits intended
processing steps or takes unintended processing
steps.
 Arbitrary channel failures: messages may be
corrupted, duplicated, delivered out of order, incur
extremely large delays; or non-existent messages may
be delivered.

 Above two are Byzantine failures, e.g., due to hackers,

man-in-the-middle attacks, viruses, worms, etc., and even
bugs in the code
 A variety of Byzantine fault-tolerant protocols have been
designed in literature!

Lecture 4-30
Omission and Arbitrary Failures

Class of failure Affects Description

Fail-stop Process Process halts and remains halted. Other processes may
or Crash-stop detect this state.

Omission Channel A message inserted in an outgoing message buffer never

arrives at the other end’s incoming message buffer.
Send-omission Process A process completes send,but
a the message is not put
in its outgoing message buffer.
Receive-omissionProcess A message is put in a process’s incoming message
buffer, but that process does not receive it.
Arbitrary Process orProcess/channel exhibits arbitrary behaviour: it may
(Byzantine) channel send/transmit arbitrary messages at arbitrary times,
commit omissions; a process may stop or take an
incorrect step.

Lecture 4-31
Summary
• Failure Detectors
• Completeness and Accuracy
• Ping-ack and Heartbeating
• Gossip-style
• Suspicion, Membership

Lecture 4-32
Next Week

• Reading for Next Two Lectures: Sections

14.1-14.5
– Time and Synchronization
– Global States and Snapshots

• HW1 already out, due Sep 19th

• MP1 already out, due 9/15: By now you
should
– Be in a group (send email to us at [email protected] TODAY,
subject line: “425 MP group”), use Piazza to find partners
– Have a basic design.
– If you’ve already started coding, you’re doing well.

Lecture 4-33
Optional - Suspicion

• Augment failure detection with suspicion count

• Ex: In all-to-all heartbeating, suspicion count =
number of machines that have timed out waiting
for heartbeats from a particular machine M
– When suspicion count crosses a threshold, declare M failed
– Issues: Who maintains this count? If distributed, need to
circulate the count
• Lowers mistaken detections (e.g., message
dropped, Internet path bad), e.g., in Cassandra
key-value store
• Can also keep much longer-term failure counts,
and use this to blacklist and greylist machines,
e.g., in OpenCorral CDN
Lecture 4-34

FailureDetector ds14
No ratings yet
FailureDetector ds14
33 pages
FailureDetector ds14
No ratings yet
FailureDetector ds14
33 pages
Lecture 4 - Failure Detection and Membership
No ratings yet
Lecture 4 - Failure Detection and Membership
36 pages
Failure Detection in Datacenters
No ratings yet
Failure Detection in Datacenters
18 pages
Failure Detection in Distributed Systems
No ratings yet
Failure Detection in Distributed Systems
66 pages
T5 Failure Detectors
No ratings yet
T5 Failure Detectors
67 pages
Lecture 04
No ratings yet
Lecture 04
49 pages
Lecture 04
No ratings yet
Lecture 04
49 pages
Lect 2
No ratings yet
Lect 2
48 pages
CS 425 / ECE 428 Distributed Systems Fall 2016: Lecture 16-A: Impossibility of Consensus
No ratings yet
CS 425 / ECE 428 Distributed Systems Fall 2016: Lecture 16-A: Impossibility of Consensus
40 pages
Consensus Failure
No ratings yet
Consensus Failure
79 pages
Cs3551 - Dss-Unit - IV Notes Final
No ratings yet
Cs3551 - Dss-Unit - IV Notes Final
46 pages
Fault Tolerant Message Passing Systems
No ratings yet
Fault Tolerant Message Passing Systems
26 pages
Fault Tolerance: Click To Add Text Dealing Successfully With Partial System. Key Technique: Redundancy
No ratings yet
Fault Tolerance: Click To Add Text Dealing Successfully With Partial System. Key Technique: Redundancy
48 pages
Synchronous Consensus in Faulty Systems
No ratings yet
Synchronous Consensus in Faulty Systems
9 pages
Cs3551 DC Unit IV Notes Unit4
No ratings yet
Cs3551 DC Unit IV Notes Unit4
31 pages
Core Challenges in Distributed Systems
No ratings yet
Core Challenges in Distributed Systems
22 pages
Document 32distributed Computing Concept
No ratings yet
Document 32distributed Computing Concept
16 pages
Chapter 7-Fault Tolerance
No ratings yet
Chapter 7-Fault Tolerance
71 pages
Chapter 8-Fault Tolerance
No ratings yet
Chapter 8-Fault Tolerance
30 pages
Chen 07
No ratings yet
Chen 07
39 pages
CS3551 - Unit IV
No ratings yet
CS3551 - Unit IV
34 pages
Fault Tolerance in Distributed Systems
No ratings yet
Fault Tolerance in Distributed Systems
11 pages
DS Chapter 8-Fault Tolerance
No ratings yet
DS Chapter 8-Fault Tolerance
68 pages
DS Unit-3 Notes
No ratings yet
DS Unit-3 Notes
35 pages
Chap 15
No ratings yet
Chap 15
72 pages
DC (Unit 4)
No ratings yet
DC (Unit 4)
14 pages
Unit 3 Coordinaton and Agreement Algorithm
No ratings yet
Unit 3 Coordinaton and Agreement Algorithm
119 pages
3 Synchronization
No ratings yet
3 Synchronization
45 pages
DS Chapter V8.0fault Tolerance
No ratings yet
DS Chapter V8.0fault Tolerance
23 pages
Distributed Consensus Essentials
No ratings yet
Distributed Consensus Essentials
36 pages
Scalable Gossip-Based Failure Detection
No ratings yet
Scalable Gossip-Based Failure Detection
16 pages
Chapter 15
No ratings yet
Chapter 15
33 pages
Chapte Four DS
No ratings yet
Chapte Four DS
37 pages
Fault Tolerance in Distributed Systems
100% (1)
Fault Tolerance in Distributed Systems
21 pages
Chapter 8-Fault Tolerance
100% (1)
Chapter 8-Fault Tolerance
71 pages
Group Communication
No ratings yet
Group Communication
4 pages
Da10 Byzantine
No ratings yet
Da10 Byzantine
28 pages
DS CH7 - Fault Tolerance
No ratings yet
DS CH7 - Fault Tolerance
17 pages
BCS 413 - Lecture7 - Fault Tolerance
No ratings yet
BCS 413 - Lecture7 - Fault Tolerance
47 pages
System Recovery Mechanisms Explained
No ratings yet
System Recovery Mechanisms Explained
38 pages
Intro To DS Chapter 6
No ratings yet
Intro To DS Chapter 6
51 pages
Consensus & Byzantine Protocols
No ratings yet
Consensus & Byzantine Protocols
38 pages
11 Errors
No ratings yet
11 Errors
33 pages
04.1 Fault Tolerance 2
No ratings yet
04.1 Fault Tolerance 2
24 pages
Chapter 8
No ratings yet
Chapter 8
107 pages
Distributed Consensus Overview
No ratings yet
Distributed Consensus Overview
35 pages
Unit5 Compressed Fault Tolerance - PACE
No ratings yet
Unit5 Compressed Fault Tolerance - PACE
11 pages
Protocol Verification/Validation: Network Model
No ratings yet
Protocol Verification/Validation: Network Model
33 pages
Distributed System Recovery Guide
No ratings yet
Distributed System Recovery Guide
119 pages
Consensus and Recovery in Distributed Computing
No ratings yet
Consensus and Recovery in Distributed Computing
37 pages
Fault Tolerance in Distributed Systems
No ratings yet
Fault Tolerance in Distributed Systems
21 pages
Failure Detectors For Large-Scale Distributed Systems: Naohiro Hayashibara Adel Cherif
No ratings yet
Failure Detectors For Large-Scale Distributed Systems: Naohiro Hayashibara Adel Cherif
6 pages
Chapter Seven
No ratings yet
Chapter Seven
13 pages
Nikil DS Report
No ratings yet
Nikil DS Report
4 pages
Distributed Systems Overview and Protocols
No ratings yet
Distributed Systems Overview and Protocols
10 pages
Midterm Solutions
No ratings yet
Midterm Solutions
8 pages
Another Advantage of Free Choice: Completely Asynchronous Agreement Protocols (Extended Abstract)
No ratings yet
Another Advantage of Free Choice: Completely Asynchronous Agreement Protocols (Extended Abstract)
5 pages
Ch8 Distributed
No ratings yet
Ch8 Distributed
12 pages
TVWS Coexistence for Wireless Experts
No ratings yet
TVWS Coexistence for Wireless Experts
10 pages
A New Approach in The Design of First Lining Steel Rib
No ratings yet
A New Approach in The Design of First Lining Steel Rib
8 pages
Urban Design & Community Architecture Guide
No ratings yet
Urban Design & Community Architecture Guide
42 pages
Debit Note 25-26 DN 83
No ratings yet
Debit Note 25-26 DN 83
1 page
Android Library Management Prototype
No ratings yet
Android Library Management Prototype
23 pages
BT4 Mats
No ratings yet
BT4 Mats
11 pages
Poulomi Infinity
No ratings yet
Poulomi Infinity
30 pages
RNC Boards
No ratings yet
RNC Boards
19 pages
Restore Toolbars and Menus in Word
No ratings yet
Restore Toolbars and Menus in Word
6 pages
Grub4dos Guide - PXE Booting
No ratings yet
Grub4dos Guide - PXE Booting
4 pages
Calcul Mental Serie D Exercices 2
No ratings yet
Calcul Mental Serie D Exercices 2
1 page
Area Specific Design Guidelines
No ratings yet
Area Specific Design Guidelines
37 pages
Servicemanual Panasonic
100% (1)
Servicemanual Panasonic
62 pages
Introduction To Programming With ABAP (BC100)
No ratings yet
Introduction To Programming With ABAP (BC100)
4 pages
NEMO Security Issues with TOR Architecture
No ratings yet
NEMO Security Issues with TOR Architecture
5 pages
Distributed Systems Assignment
No ratings yet
Distributed Systems Assignment
13 pages
Staff Tiles Reno Detail Quotation
No ratings yet
Staff Tiles Reno Detail Quotation
2 pages
1-Grd & Louver Submittal
No ratings yet
1-Grd & Louver Submittal
114 pages
Understanding Site Investigation in Construction
No ratings yet
Understanding Site Investigation in Construction
7 pages
AMAT Sample Test
67% (3)
AMAT Sample Test
6 pages
Conplast SP500
No ratings yet
Conplast SP500
2 pages
Guidelines On Use of Ready Mixed Concrete
No ratings yet
Guidelines On Use of Ready Mixed Concrete
6 pages
Lift and Waffle Slab
100% (1)
Lift and Waffle Slab
34 pages
Comfortboard Brochure
No ratings yet
Comfortboard Brochure
8 pages
BACKBASE Portal Best Practices
No ratings yet
BACKBASE Portal Best Practices
23 pages
8085 Microprocessor
100% (1)
8085 Microprocessor
37 pages
Galleries in RCC
100% (2)
Galleries in RCC
5 pages
Star Lion College of Engineering & Technology: Cs2354 Aca-2 Marks & 16 Marks
No ratings yet
Star Lion College of Engineering & Technology: Cs2354 Aca-2 Marks & 16 Marks
14 pages
Research Paper On Bearings
No ratings yet
Research Paper On Bearings
9 pages
Data Guard and Fail Safe
No ratings yet
Data Guard and Fail Safe
104 pages

Computer Science 425 Distributed Systems: CS 425 / ECE 428

Uploaded by

Computer Science 425 Distributed Systems: CS 425 / ECE 428

Uploaded by

Computer Science 425

CS 425 / ECE 428

Indranil Gupta (Indy)

 2013, I. Gupta Lecture 4-1

• You’ve been put in charge of a datacenter (think

• Do you believe him/her?

• What would be your first responsibility?

… not the exception, in datacenters.

Say, the rate of failure of one machine

When you have 120 servers in the DC, the mean

When you have 12,000 servers in the DC, the MTTF

• You have a few options

1. Hire 1000 people, each to monitor one machine in the

Which is more preferable, and why?

 Synchronous Distributed System

Process omission failure

We will focus on Crash-stop failures

needs to know about pj’s failure

There are two main flavors of Failure Detectors…

- pj maintains a sequence number

-if pi has not received a new heartbeat for the

• The Ping-ack and Heartbeat failure

• Completeness = every process failure is eventually

• That was for one process pj being detected and

pj, Heartbeat Seq. l++

Downside? Lecture 4-17

No SPOF (single point of failure)

pj, Heartbeat Seq. l++ pj

Advantage: Everyone is able to keep track of everyone

• Bandwidth: the number of messages sent in the

• False Detection Rate/False Positive Rate

• Tradeoffs: If you increase the T waiting period in

• Maintain a list of other alive (non-faulty)

• Well-known result (you’ll see it in a few months)

• What if an entry pointing to a failed node is deleted

• Fix: remember for another Tfail

• Let’s discuss the other types of failures

proc essp proc ess q

Communic ati on channel

 Above two are Byzantine failures, e.g., due to hackers,

Class of failure Affects Description

Omission Channel A message inserted in an outgoing message buffer never

• Reading for Next Two Lectures: Sections

• HW1 already out, due Sep 19th

• Augment failure detection with suspicion count

You might also like