0% found this document useful (0 votes)

22 views22 pages

Checkpointing and Rollback Recovery Overview

Uploaded by

yarzarmon.yzm

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

22 views22 pages

Checkpointing and Rollback Recovery Overview

Uploaded by

yarzarmon.yzm

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 22

Rollback Recovery Methods:

a Quick Overview

CSEP590SG – University of Washington

Steve Gribble (gribble@[Link])

[This material is taken from the paper “A Survey of Rollback-

Recovery Protocols in Message-Passing Systems”, by Elnozahy,
Alvisi, Wang, and Johnson.]
Basic goal

• Fault tolerance of a long-running, distributed computation

– Ability to restart global computation to a “consistent” snapshot
– Coordinate local process states and (causal) dependencies
• Model: collection of processes, message-oriented
computation
– Fail-stop: processes suddenly disappear when crash
• No Byzantine failures (incorrect events are never generated)
• Goal: recovery is transparent to both programmer and
application

CSEP590SG, Winter 2004 ©2004, Steven D. Gribble

Basic model

• Finite number of processes in system

– Process “birth” is same as process doesn’t interact with other
processes, outside world, until “birthday”
– Process “death” must be that process doesn’t generate any
events, or receive input from outside world after death
• Communication network
– Message-oriented [don’t worry about bytestreams]
– Arbitrary topology
– Unreliable message delivery [lose, duplicate, reorder messages]
• Some protocols assume reliable delivery, in which case system state
includes channel state [why?]

CSEP590SG, Winter 2004 ©2004, Steven D. Gribble

Picture of basic system

• Process execution modeled as sequence of state intervals

– Deterministic computation started by a non-deterministic event
– Non-determinism: in model, message reception
» what about message transmission?
– In reality: also read physical clock, input from world, execute most
system calls (failure, variable return values), …
P0
m1 m3 m4
P1
m2
P2

CSEP590SG, Winter 2004 ©2004, Steven D. Gribble

Bigger picture

• The “outside world” matters too

system input visible event

outside world
message passing system

P0
m1 m3 m4
P1
m2
P2

CSEP590SG, Winter 2004 ©2004, Steven D. Gribble

A computation

• A “computation” represents the evolution of the system

state over time
– System state means {process state}, possibly state of channels
– “Consistent system state”: may occur in failure-free, correct execution
• Iff. If a process’s state reflects a message receipt, then state of
corresponding sender reflects sending that message
– Is this the same as Lamport’s causal ordering?

• Goal of rollback recovery protocol:

– Bring system back into consistent state when inconsistencies occur
because of a failure.
• Reconstructed state may not be one that occurred before the failure. It is
sufficient that it “could” have occurred.

CSEP590SG, Winter 2004 ©2004, Steven D. Gribble

Consistent vs. Inconsistent State

CSEP590SG, Winter 2004 ©2004, Steven D. Gribble

Drilling down on network channel state

• Two models:
– reliable communications substrate is underneath recovery
– or, reliability is is implemented above recovery mechanisms

CSEP590SG, Winter 2004 ©2004, Steven D. Gribble

Checkpointing protocols

• Basic hammer: each process periodically saves its state

on stable storage
– State contains enough information to restart process execution
• Goal is to construct a “consistent global checkpoint”
– Set of local checkpoints, one from each process, forming
consistent system state.
– Can restart system from any consistent global checkpoint after
failure
• generally want to use the most recent consistent global checkpoint
[called recovery line]

CSEP590SG, Winter 2004 ©2004, Steven D. Gribble

What makes this hard: Domino Effect

• Suppose P2 fails, and rolls back to checkpoint C

– Where is the recovery line?

CSEP590SG, Winter 2004 ©2004, Steven D. Gribble

Answer:

• Rollback “invalidates” sending of message m6, so P1

must roll back to B to invalidate the receipt of message
– Otherwise P1 becomes an “orphan process”
• But, rollback of P1 invalidates sending of m7, so P0
must roll back to A.
• Etc., until you get all the way back to the beginning.

CSEP590SG, Winter 2004 ©2004, Steven D. Gribble

Getting around the Domino effect

• Must be careful about coordinating checkpoints

– Simplest way: execute some sort of consensus process to
synchronously begin checkpoint at all processes
• E.g., 2-phase commit
• Very expensive!
• Another way: log events to supplement checkpoints
– Log non-deterministic events after checkpoint
– Checkpoint + log guarantees that a process computation
proceeds identically to prefailure computation
• Identical until first non-logged, non-deterministic event after the
last checkpoint

CSEP590SG, Winter 2004 ©2004, Steven D. Gribble

What about outside events?

• Input events:
– must log them, since not guaranteed that outside world is
recoverable

• Output events:
– this is the Lowell paper
• locally, must log before generating output event
• globally, must ensure consistent checkpoint before generating
output event
– expensive to handle, but necessary
• alternative is “compensation events”

CSEP590SG, Winter 2004 ©2004, Steven D. Gribble

Logging Protocols

• Non-deterministic events (incl. input) must be logged

– Alternative: checkpoints must be taken before process induces a side-
effect after non-deterministic event
– Logs depend on piecewise deterministic (PWD) assumption
• Ability for application to log a “determinant” of non-deterministic events
• Determinant contains all info necessary to replay event after failure
• Process state interval is recoverable if:
– enough information in checkpoints/logs to replay execution up to that state
interval, despite any future failures in system
• State interval is stable if:
– Determinant of non-deterministic event that started it is in the log

• Q: does recoverable interval  stable interval?

• Q: does stable interval  recoverable interval?

Pop quiz

P0
m0 m1 m4 m7
P1
m3
m2 m5 m6
P2

• What is the “maximum recoverable state”?

– (most recent recoverable consistent system state)

maximum recoverable state

P0
m0 m1 m4 m7
P1
m3
m2 m5 m6
P2

Recap: 2 main strategies for recovery

• Checkpoint-based rollback recovery

– Depend only on sequence of checkpoints to recover system
• No logging of events
– Challenge: overcoming domino effect to find “recovery line”
• Log-based rollback recovery
– In addition to checkpoints, log non-deterministic events
• Essentially adds to checkpoint by logging non-deterministic
decisions since last checkpoint
– Challenge: overcoming cost of (synchronously) logging events

Uncoordinated Checkpointing

• Checkpoint-based recovery, but uncoordinated:

maximum autonomy across processes
– Purely local policy dicates when to record a checkpoint
– Requires “dependency graphs” to calculate recovery line
• Dependency information piggybacked on messages
• Problems:
– domino effect
– “useless” checkpoints that will never be part of a recovery line
– need for global “garbage collection” to reclaim no-longer-
necessary checkpoints

Coordinated checkpoint recovery

• Recovery line is constructed by cooperation

– Synchronous (blocking) checkpoints: two-phase commit, computation
ceases during checkpoint
– Asynchronous (nonblocking) checkpoints: Lamport’s snapshot
• Eliminate FIFO by piggybacking marker on all post-checkpoint messages
– marker gets through on first message that gets through
– Synchronized physical clocks: at time T, each process takes
checkpoint, and then “freezes” to account for skew
• Freeze time = max clock error + max failure detection time
• Abort if detect failure
– Communication-induced checkpoints: hybrid approach (Lowell)
• Autonomous local checkpoints, but occasional forced checkpoints
– e.g., when receive message

Logging protocols

• Protocols phrased in terms of consistency conditions

– No-orphans: the set of processes that depend on a non-
deterministic event is a subset of those that have logged it
• Various flavors:
– Pessimistic: synchronously log all non-deterministic events
• Observable state of each process can always be recovered
– processes can output to world without a special protocol!
– processes can always restart from most recent checkpoint!
– process failure never affects other processes!
• Can relax this slightly by only logging an event when the process
is about to affect another process (e.g., output to world, or send
message to process)

Log-based recovery cont.

• More flavors:
– Optimistic: log non-deterministic events asynchronously
• “hope” that entry makes it to disk before failure
– those that don’t are lost on failure
– need to compute recovery line
• Recovery can be synchronous or asynchronous
• Orphans are possible, need to roll them back
– Causal: piggyback causal dependency on messages
• Non-deterministic event is either stable on log, or its determinant is
piggybacked on all messages sent from that process
– and transitively through “happens-before” relationship
• Non-failed process can “guide” recovery of others

CheckpointingRecovery ds14
No ratings yet
CheckpointingRecovery ds14
35 pages
DC Unit4
No ratings yet
DC Unit4
33 pages
4th Unit Topics Recovery
No ratings yet
4th Unit Topics Recovery
73 pages
DC Unit4
No ratings yet
DC Unit4
32 pages
Checkpointing and Rollback Recovery in Distributed Systems
No ratings yet
Checkpointing and Rollback Recovery in Distributed Systems
34 pages
Unit 4
No ratings yet
Unit 4
32 pages
Unit 4
No ratings yet
Unit 4
94 pages
System Recovery Mechanisms Explained
No ratings yet
System Recovery Mechanisms Explained
38 pages
Unit 4
No ratings yet
Unit 4
32 pages
Dc-3551 Unit IV Notes
No ratings yet
Dc-3551 Unit IV Notes
32 pages
CS8603 U.iv
No ratings yet
CS8603 U.iv
33 pages
Failure Recovery in Distributed Systems
No ratings yet
Failure Recovery in Distributed Systems
24 pages
Distributed System Recovery Guide
No ratings yet
Distributed System Recovery Guide
119 pages
Unit-3 Part2
No ratings yet
Unit-3 Part2
74 pages
Unit 4 Part 2
No ratings yet
Unit 4 Part 2
21 pages
Unit - Iv
No ratings yet
Unit - Iv
10 pages
Unit 4 Part 3
No ratings yet
Unit 4 Part 3
21 pages
Unit Iv Recovery
No ratings yet
Unit Iv Recovery
27 pages
Checkpointing and Rollback Recovery For Distributed Systems 5cvcuy5txm
No ratings yet
Checkpointing and Rollback Recovery For Distributed Systems 5cvcuy5txm
23 pages
Concurrent Checkpointing and Recovery in Distributed Systems
No ratings yet
Concurrent Checkpointing and Recovery in Distributed Systems
61 pages
Cs3551 Unit IV Notes
No ratings yet
Cs3551 Unit IV Notes
34 pages
Distributed Systems Recovery Guide
No ratings yet
Distributed Systems Recovery Guide
15 pages
Module 4 - Distributed Shared Memory and Failure Recovery - Sreerag Sanilkumar
No ratings yet
Module 4 - Distributed Shared Memory and Failure Recovery - Sreerag Sanilkumar
14 pages
Checkpointing Algorithms in Recovery
No ratings yet
Checkpointing Algorithms in Recovery
40 pages
DS NOTES Unit 4 PDF
No ratings yet
DS NOTES Unit 4 PDF
36 pages
Fault Tolerance:-: Introduction, Process Resilience, Distributed Commit, Recovery
No ratings yet
Fault Tolerance:-: Introduction, Process Resilience, Distributed Commit, Recovery
52 pages
4.1.5. Log Based Roll Back Recovery-1
No ratings yet
4.1.5. Log Based Roll Back Recovery-1
12 pages
Checkpoint Recovery in Distributed Systems
100% (1)
Checkpoint Recovery in Distributed Systems
26 pages
Define The Terms: Rollback Propagation.: Coordinated Checkpointing
No ratings yet
Define The Terms: Rollback Propagation.: Coordinated Checkpointing
5 pages
Unit 4 Part 3
No ratings yet
Unit 4 Part 3
33 pages
Checkpoints Recovery
No ratings yet
Checkpoints Recovery
35 pages
Assignment 4 - 044
No ratings yet
Assignment 4 - 044
4 pages
Design Patterns For Checkpoint-Based Rollback Recovery
No ratings yet
Design Patterns For Checkpoint-Based Rollback Recovery
26 pages
15-440 Distributed Systems: Fault Tolerance, Logging and Recovery Thursday Oct 8, 2015
No ratings yet
15-440 Distributed Systems: Fault Tolerance, Logging and Recovery Thursday Oct 8, 2015
30 pages
Distributed Computing Series 2 Important Topics
No ratings yet
Distributed Computing Series 2 Important Topics
24 pages
Distributed Failure Recovery
No ratings yet
Distributed Failure Recovery
30 pages
4.1.4. Checkpoint Based Recovery-1
No ratings yet
4.1.4. Checkpoint Based Recovery-1
10 pages
Coordinated Checkpoint vs. Message Log
No ratings yet
Coordinated Checkpoint vs. Message Log
27 pages
Key Topics in Distributed Computing
No ratings yet
Key Topics in Distributed Computing
23 pages
DC (Unit 4)
No ratings yet
DC (Unit 4)
14 pages
Checkpointing and Rollback Recovery in Distributed Systems
No ratings yet
Checkpointing and Rollback Recovery in Distributed Systems
24 pages
A 161126
No ratings yet
A 161126
26 pages
Chapter 8 Fault Tolerance
No ratings yet
Chapter 8 Fault Tolerance
20 pages
Recovery DC
No ratings yet
Recovery DC
6 pages
16 - Issues in Failure Recovery
No ratings yet
16 - Issues in Failure Recovery
5 pages
Distributed Checkpoints Guide
No ratings yet
Distributed Checkpoints Guide
16 pages
DC 4unit
No ratings yet
DC 4unit
8 pages
Distributed Systems Checkpointing
No ratings yet
Distributed Systems Checkpointing
2 pages
Module4 Distributed
No ratings yet
Module4 Distributed
6 pages
DS CH7 - Fault Tolerance
No ratings yet
DS CH7 - Fault Tolerance
17 pages
Module 4
No ratings yet
Module 4
59 pages
3 Synchronization
No ratings yet
3 Synchronization
45 pages
Fault Tolerance in Distributed Systems
100% (1)
Fault Tolerance in Distributed Systems
21 pages
Distributed Checkpointing Guide
No ratings yet
Distributed Checkpointing Guide
33 pages
Unit5 Compressed Fault Tolerance - PACE
No ratings yet
Unit5 Compressed Fault Tolerance - PACE
11 pages
Chapte Four DS
No ratings yet
Chapte Four DS
37 pages
Week 04
No ratings yet
Week 04
49 pages
Chapter 3
No ratings yet
Chapter 3
40 pages
Fault Tolerance in Distributed Systems
No ratings yet
Fault Tolerance in Distributed Systems
21 pages
FNB Private Wealth Newsletter - Life and Times Issue 18
No ratings yet
FNB Private Wealth Newsletter - Life and Times Issue 18
12 pages
Internet Addiction in Young Adults A Meta-Analysis and Systematic Review
No ratings yet
Internet Addiction in Young Adults A Meta-Analysis and Systematic Review
10 pages
Global Understandin1.docx Explanation
No ratings yet
Global Understandin1.docx Explanation
3 pages
Kubota D1803 V2403 Parts Manual
No ratings yet
Kubota D1803 V2403 Parts Manual
46 pages
But First, Coffee SlidesMania
No ratings yet
But First, Coffee SlidesMania
24 pages
Advanced Biochemical Methods
No ratings yet
Advanced Biochemical Methods
23 pages
Key Composers of Classical Opera
No ratings yet
Key Composers of Classical Opera
7 pages
A Molasses Based Fermentation Medium For Marine Yeast Biomass Production
No ratings yet
A Molasses Based Fermentation Medium For Marine Yeast Biomass Production
7 pages
Calculate Formula
No ratings yet
Calculate Formula
18 pages
IT Essentials: PC Hardware Exam Guide
No ratings yet
IT Essentials: PC Hardware Exam Guide
8 pages
55.1.amelia Jones, The Artist Is Present
No ratings yet
55.1.amelia Jones, The Artist Is Present
31 pages
Healthy Horizons Assessment Center
No ratings yet
Healthy Horizons Assessment Center
2 pages
KCSE History Paper 1 1999 Questions
No ratings yet
KCSE History Paper 1 1999 Questions
2 pages
Read Entity Data
No ratings yet
Read Entity Data
2 pages
8th International Conference Goa (CRISEA-2025)
No ratings yet
8th International Conference Goa (CRISEA-2025)
6 pages
Materials at Equilibrium. G. Ceder Fall 2002
No ratings yet
Materials at Equilibrium. G. Ceder Fall 2002
6 pages
Optical Network Sol
50% (2)
Optical Network Sol
85 pages
AWSBackup DG PDF
No ratings yet
AWSBackup DG PDF
232 pages
Test Questions P.e.health Grade2 Q1
No ratings yet
Test Questions P.e.health Grade2 Q1
3 pages
Ancient Egyptian Magic
100% (1)
Ancient Egyptian Magic
20 pages
TLE 9 Household-Services Q1 W6 M6 LDS Types of Stain ALG RTP
No ratings yet
TLE 9 Household-Services Q1 W6 M6 LDS Types of Stain ALG RTP
4 pages
Monster Synthesis Activity
No ratings yet
Monster Synthesis Activity
5 pages
Pre Assess Report 4237783
No ratings yet
Pre Assess Report 4237783
12 pages
Tib Gi Developer Guide
No ratings yet
Tib Gi Developer Guide
318 pages
Welding Productivity: SMAW vs FCAW
No ratings yet
Welding Productivity: SMAW vs FCAW
6 pages
MIL LAS Q3 Wk3 MELC3
No ratings yet
MIL LAS Q3 Wk3 MELC3
5 pages
Chapter 4 - Magnetic Field, Transformer and Motor - v3
No ratings yet
Chapter 4 - Magnetic Field, Transformer and Motor - v3
36 pages
Ieee 1115-2000
No ratings yet
Ieee 1115-2000
24 pages
Elderly Care: Nursing Homes vs. Family
No ratings yet
Elderly Care: Nursing Homes vs. Family
16 pages
Capital Asset and Capital Gains Loss
No ratings yet
Capital Asset and Capital Gains Loss
4 pages

Checkpointing and Rollback Recovery Overview

Uploaded by

Checkpointing and Rollback Recovery Overview

Uploaded by

Rollback Recovery Methods:

CSEP590SG – University of Washington

[This material is taken from the paper “A Survey of Rollback-

• Fault tolerance of a long-running, distributed computation

CSEP590SG, Winter 2004 ©2004, Steven D. Gribble

• Finite number of processes in system

CSEP590SG, Winter 2004 ©2004, Steven D. Gribble

• Process execution modeled as sequence of state intervals

CSEP590SG, Winter 2004 ©2004, Steven D. Gribble

• The “outside world” matters too

system input visible event

CSEP590SG, Winter 2004 ©2004, Steven D. Gribble

• A “computation” represents the evolution of the system

• Goal of rollback recovery protocol:

CSEP590SG, Winter 2004 ©2004, Steven D. Gribble

CSEP590SG, Winter 2004 ©2004, Steven D. Gribble

CSEP590SG, Winter 2004 ©2004, Steven D. Gribble

CSEP590SG, Winter 2004 ©2004, Steven D. Gribble

• Basic hammer: each process periodically saves its state

CSEP590SG, Winter 2004 ©2004, Steven D. Gribble

• Suppose P2 fails, and rolls back to checkpoint C

CSEP590SG, Winter 2004 ©2004, Steven D. Gribble

• Rollback “invalidates” sending of message m6, so P1

CSEP590SG, Winter 2004 ©2004, Steven D. Gribble

• Must be careful about coordinating checkpoints

CSEP590SG, Winter 2004 ©2004, Steven D. Gribble

CSEP590SG, Winter 2004 ©2004, Steven D. Gribble

• Non-deterministic events (incl. input) must be logged

• Q: does recoverable interval  stable interval?

CSEP590SG, Winter 2004 ©2004, Steven D. Gribble

• What is the “maximum recoverable state”?

CSEP590SG, Winter 2004 ©2004, Steven D. Gribble

CSEP590SG, Winter 2004 ©2004, Steven D. Gribble

• Checkpoint-based rollback recovery

CSEP590SG, Winter 2004 ©2004, Steven D. Gribble

• Checkpoint-based recovery, but uncoordinated:

CSEP590SG, Winter 2004 ©2004, Steven D. Gribble

• Recovery line is constructed by cooperation

CSEP590SG, Winter 2004 ©2004, Steven D. Gribble

• Protocols phrased in terms of consistency conditions

CSEP590SG, Winter 2004 ©2004, Steven D. Gribble

CSEP590SG, Winter 2004 ©2004, Steven D. Gribble

You might also like