0% found this document useful (0 votes)

85 views47 pages

Systems That Never Stop (And Erlang) : Joe Armstrong

Making reliable distributed systems in the presence of sofware errors. Building reliable systems using Erlang and Otp

Uploaded by

kishorenayark

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

85 views47 pages

Systems That Never Stop (And Erlang) : Joe Armstrong

Making reliable distributed systems in the presence of sofware errors. Building reliable systems using Erlang and Otp

Uploaded by

kishorenayark

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 47

Systems that never

stop (and Erlang)

Joe Armstrong
How can we get

10 nines reliability?
SIX LAWS
ONE

ISOLATION
ISOLATION

 10 nines = 99.99999999% availability

 P(fail) = 10-10
 If P(fail | one computer) = 10-3 then
P(fail | four computers) = 10-12
 Fixed
TWO

CONCURRENCY
Concurrency

 World is concurrent
 Need at least TWO computers to make a non-stop
sytem
 TWO computer is concurrent and distributed
“My first message is that
concurrency
is best regarded as a program
structuring principle”

Structured concurrent programming

– Tony Hoare
Redmond, July 2001
THREE

MUST
DETECT FAILURES
Failure detection
 If you can’t detect a failure you can’t fix it
 Must work across machine boundaries
the entire machine might fail
 Implies distributed error handling,
no shared state,
asynchronous messaging
FOUR

FAULT
IDENTIFICATION
Failure Identification

 Fault detection is not enough - you must no why

the failure occurred
 Implies that you have sufficient information for
post hock debugging
FIVE

LIVE
CODE
UPGRADE
Live code upgrade

 Must upgrade software while it is running

 Want zero down time
SIX

STABLE
STORAGE
Stable storage

 Must store stuff forever

 No backup necessary - storage just works
 Implies multiple copies, distribution, ...
 Must keep crash reports
HISTORY

Those who cannot learn from history are

doomed to repeat it.

George Santayana
GRAY
As with hardware, the key to software fault-tolerance is to
hierarchically decompose large systems into modules, each module being
a unit of service and a unit of failure. A failure of a module does
not propagate beyond the module.

...

The process achieves fault containment by sharing no state with

other processes; its only contact with other processes is via messages
carried by a kernel message system

- Jim Gray
- Why do computers stop and what can be done about it
- Technical Report, 85.7 - Tandem Computers,1985
SCHNEIDER
Halt on failure in the event of an error a processor
should halt instead of performing a possibly erroneous
operation.

Failure status property when a processor fails,

other processors in the system must be informed. The
reason for failure must be communicated.

Stable Storage Property The storage of a processor

should be partitioned into stable storage (which
survives a processor crash) and volatile storage which
is lost if a processor crashes.
Schneider
ACM Computing Surveys 22(4):229-319, 1990
GRAY
 Fault containment through fail-fast software modules.
 Process-pairs to tolerant hardware and transient software faults.
 Transaction mechanisms to provide data and message integrity.
 Transaction mechanisms combined with process-pairs to ease
exception handling and tolerate software fault
 Software modularity through processes and messages.
KAY
Folks --

Just a gentle reminder that I took some pains at the last OOPSLA to
try to remind everyone that Smalltalk is not only NOT its syntax or
the class library, it is not even about classes. I'm sorry that I long ago
coined the term "objects" for this topic because it gets many people to
focus on the lesser idea.

The big idea is "messaging" -- that is what the kernal of Smalltalk/

Squeak is all about (and it's something that was never quite completed
in our Xerox PARC phase)....

http://lists.squeakfoundation.org/pipermail/squeak-dev/1998-October/
017019.html
GRAY
Software modularity through processes
and messages. As with hardware, the key
to software fault-tolerance is to
hierarchically decompose large systems
into modules, each module being a unit of
service and a unit of failure. A failure of a
module does not propagate beyond the
module.
Fail Fast
The process approach to fault isolation advocates that the process
software be fail-fast, it should either function correctly or it
should detect the fault, signal failure and stop operating.

Processes are made fail-fast by defensive programming. They check

all their inputs, intermediate results and data structures as a matter
of course. If any error is detected, they signal a failure and stop. In
the terminology of [Cristian], fail-fast software has small fault
detection latency.

Gray
Why ...
Fail Early
A fault in a software system can cause one or more
errors. The latency time which is the interval between
the existence of the fault and the occurrence of the
error can be very high, which complicates the
backwards analysis of an error ...

For an effective error handling we must detect errors and

failures as early as possible

Renzel -
Error Handling for Business Information Systems,
Software Design and Management, GmbH & Co. KG, München, 2003
ARMSTRONG
 Processes are the units of error encapsulation. Errors
occurring in a process will not affect other processes in the
system. We call this property strong isolation.
 Processes do what they are supposed to do or fail as soon
as possible.
 Failure and the reason for failure can be detected by
remote processes.
 Processes share no state, but communicate by message
passing.

Armstrong
Making reliable systems in the presence of software errors
PhD Thesis, KTH, 2003
COMMERCIAL
BREAK
Joe’s 2’nd theorem

 Whatever Joe starts talking about, He will end up

talking about Erlang
Erlang was
designed
to program
fault-tolerant
systems
Concurrent
programming Functional
programming

Concurrency
Oriented
programming
Erlang

Fault Multicore
tolerance
Erlang
 Very light-weight processes
 Very fast message passing
 Total separation between processes
 Automatic marshalling/demarshalling
 Fast sequential code
 Strict functional code
 Dynamic typing
 Transparent distribution
 Compose sequential AND concurrent code
Properties
 No sharing
 Hot code replacement
 Pure message passing
 No locks
 Lots of computers (= fault tolerant scalable ...)
 Functional programming (no side effects)
What is COP?
Machine

Process

Message

➡
Large numbers of processes
➡ Complete isolation between processes
➡ Location transparency

➡ No Sharing of data

➡ Pure message passing systems

Thread Safety
Erlang programs are
automatically thread
safe if they don't use
an external resource.
Functional
If you call the
same function twice with
the same arguments
it should return the same value

“jolly good”
Joe Armstrong
No Mutable State
 Mutable state needs locks
 No mutable state = no locks = programmers bliss
Multicore ready
The rise of the cores
 2 cores won't hurt you
 4 cores will hurt a little
 8 cores will hurt a bit
 16 will start hurting
 32 cores will hurt a lot (2009)
 ...
 1 M cores ouch (2019)
 (complete paradigm shift)

 1997 1 Tflop = 850 KW

 2007 1 Tflop = 24 W (factor 35,000)
 2017 1 Tflop = ?
LAWS
ISOLATION
CONCURRENCY
Pid = spawn(.....)
Pid = spawn(Node, ....)

Pid ! Message receive

Pattern1 -> Actions1;
Pattern2 -> Actions2;
...
end
FAULT
IDENTIFICATION
link(Pid),
receive
{Pid, ‘EXIT’, Why} ->
...
end
LIVE CODE
UPGRADE
 Can upgrade code while its running

 Existing processes continue to use original code, new

processes run new code - no mixups of namespaces

 Sophisticated roll-forward, roll-back, roll-back-on-error

functions in OTP libraries

 Properly designed systems can be rolled-forward and

back with no loss of service. Not easy, but possible
STABLE STORAGE
 Performed in libraries

mnesia:transaction(
fun() ->
Val = mnesia:read(Key),
mnesia:write({Key,Val}),
...
end)
Projects
 CouchDB
 Amazon SimpleDB
 Mochiweb (facebook chat)
 Scalaris
 Nitrogren
 Ejabberd (xmpp)
 Rabbit MQ (amqp)
 ....
Companies
 Ericsson
 Amazon
 Tail-f
 Kreditor
 Synapse
 ...
Books
THE END

Fault Tolerant Message Passing Systems
No ratings yet
Fault Tolerant Message Passing Systems
26 pages
National University of Science and Technology
No ratings yet
National University of Science and Technology
11 pages
Inter-Process Communication Guide
No ratings yet
Inter-Process Communication Guide
16 pages
A History of Erlang Joe Armstrong Hopl-Iii
No ratings yet
A History of Erlang Joe Armstrong Hopl-Iii
45 pages
Critical Sections With Lots of Threads
No ratings yet
Critical Sections With Lots of Threads
34 pages
Fault Tolerance: Click To Add Text Dealing Successfully With Partial System. Key Technique: Redundancy
No ratings yet
Fault Tolerance: Click To Add Text Dealing Successfully With Partial System. Key Technique: Redundancy
48 pages
w9s1 FaultTolerance1
No ratings yet
w9s1 FaultTolerance1
34 pages
Parallel Programming Challenges Explained
No ratings yet
Parallel Programming Challenges Explained
77 pages
Turing
No ratings yet
Turing
15 pages
Fault Tolerance in Distributed Systems
No ratings yet
Fault Tolerance in Distributed Systems
6 pages
Unit III Exception
No ratings yet
Unit III Exception
6 pages
Fault Tolerance:-: Introduction, Process Resilience, Distributed Commit, Recovery
No ratings yet
Fault Tolerance:-: Introduction, Process Resilience, Distributed Commit, Recovery
52 pages
Os Chapter Two
No ratings yet
Os Chapter Two
40 pages
Chapter 7-Fault Tolerance
No ratings yet
Chapter 7-Fault Tolerance
71 pages
Chapter Seven
No ratings yet
Chapter Seven
13 pages
Chapter 8-Fault Tolerance
100% (1)
Chapter 8-Fault Tolerance
71 pages
Distributed Os
No ratings yet
Distributed Os
13 pages
Chapter 8
No ratings yet
Chapter 8
107 pages
Overview Concurrent and Distributed Systems
No ratings yet
Overview Concurrent and Distributed Systems
67 pages
Designing Software With Complex Configuration
No ratings yet
Designing Software With Complex Configuration
17 pages
Summary Midterm Concurrency
No ratings yet
Summary Midterm Concurrency
22 pages
FILE5 Process Synchronisation
No ratings yet
FILE5 Process Synchronisation
7 pages
Chen 07
No ratings yet
Chen 07
39 pages
Slides 08 PDF
No ratings yet
Slides 08 PDF
95 pages
Fault Tolerance in Distributed Systems
No ratings yet
Fault Tolerance in Distributed Systems
9 pages
Operating Systems Exam Notes Slide 3
No ratings yet
Operating Systems Exam Notes Slide 3
8 pages
Intro To DS Chapter 6
No ratings yet
Intro To DS Chapter 6
51 pages
Concurrency - Mutual Exclusion and Synchronisation OS
No ratings yet
Concurrency - Mutual Exclusion and Synchronisation OS
21 pages
PDS Unit 1
No ratings yet
PDS Unit 1
59 pages
Unit 3 Coordinaton and Agreement Algorithm
No ratings yet
Unit 3 Coordinaton and Agreement Algorithm
119 pages
Op Chapter 3 and 4
No ratings yet
Op Chapter 3 and 4
39 pages
Slides 08
No ratings yet
Slides 08
107 pages
Concurrency Insights by Kedar Namjoshi
No ratings yet
Concurrency Insights by Kedar Namjoshi
37 pages
Low-Power Sensor Networks: A Case Study in Seeking Distributed Predictability
No ratings yet
Low-Power Sensor Networks: A Case Study in Seeking Distributed Predictability
59 pages
Reliable Distributed Programming Overview
No ratings yet
Reliable Distributed Programming Overview
55 pages
System Recovery Mechanisms Explained
No ratings yet
System Recovery Mechanisms Explained
38 pages
OS KCA203 Unit-2.1
No ratings yet
OS KCA203 Unit-2.1
7 pages
Rajib Mall Lecture Notes
No ratings yet
Rajib Mall Lecture Notes
78 pages
Chapte Four DS
No ratings yet
Chapte Four DS
37 pages
Concurrency Oriented Programming in Erlang
No ratings yet
Concurrency Oriented Programming in Erlang
35 pages
Interprocess Communication Guide
No ratings yet
Interprocess Communication Guide
33 pages
Lecture 7
No ratings yet
Lecture 7
57 pages
(2014, Manifesto) The Reactive Manifesto v2.0 - Glossary
No ratings yet
(2014, Manifesto) The Reactive Manifesto v2.0 - Glossary
5 pages
15 Synchronization
No ratings yet
15 Synchronization
120 pages
Chapter 8 - Fault Tolerance
No ratings yet
Chapter 8 - Fault Tolerance
19 pages
Concurrency and Thread Management
No ratings yet
Concurrency and Thread Management
39 pages
3 StaticAnalysisPREfast
No ratings yet
3 StaticAnalysisPREfast
36 pages
Lesson 2 - Fault and Error Modelling
No ratings yet
Lesson 2 - Fault and Error Modelling
7 pages
Thread
No ratings yet
Thread
13 pages
OS Unit - 3
No ratings yet
OS Unit - 3
14 pages
Software Architecture: P E R F O R M A N C E Error Recovery O A & M
No ratings yet
Software Architecture: P E R F O R M A N C E Error Recovery O A & M
42 pages
Distrsyslectureset7 Win20
No ratings yet
Distrsyslectureset7 Win20
114 pages
An Overview of The Singularity Project
No ratings yet
An Overview of The Singularity Project
44 pages
Failure Model
No ratings yet
Failure Model
14 pages
Chapter 8-Fault Tolerance
No ratings yet
Chapter 8-Fault Tolerance
37 pages
Bucket Sort Algorithm Explained
No ratings yet
Bucket Sort Algorithm Explained
8 pages
Best Practices for Clean Code Writing
No ratings yet
Best Practices for Clean Code Writing
9 pages
Essential Linux Commands Cheat Sheet
No ratings yet
Essential Linux Commands Cheat Sheet
8 pages
Low Hanging Java Challenges for Growth
No ratings yet
Low Hanging Java Challenges for Growth
12 pages
Understanding Machine Learning Theory Algorithms
No ratings yet
Understanding Machine Learning Theory Algorithms
449 pages
Array Fire GPU Programming in C++
No ratings yet
Array Fire GPU Programming in C++
32 pages
Image Classification with kNN Basics
No ratings yet
Image Classification with kNN Basics
16 pages
Learning Bayesian Networks (Neapolitan, Richard) PDF
100% (1)
Learning Bayesian Networks (Neapolitan, Richard) PDF
704 pages
ML Algorithms for Climate Science
No ratings yet
ML Algorithms for Climate Science
46 pages
Erlang Parse Transformations Guide
No ratings yet
Erlang Parse Transformations Guide
49 pages
Scalable Elixir Application Design Guide
No ratings yet
Scalable Elixir Application Design Guide
1 page
Erlang Tutorial
100% (3)
Erlang Tutorial
185 pages
Metaprogramming For Erlang. Abstract Format & Core
No ratings yet
Metaprogramming For Erlang. Abstract Format & Core
42 pages
Intro To Erlang
No ratings yet
Intro To Erlang
2 pages
Erlang PDF Parser Tools Guide
No ratings yet
Erlang PDF Parser Tools Guide
2 pages
A History of Erlang: Joe Armstrong
No ratings yet
A History of Erlang: Joe Armstrong
26 pages
Armstrong Erlang History
No ratings yet
Armstrong Erlang History
26 pages
Testing Elixir Andrea Leopardi PDF Version
No ratings yet
Testing Elixir Andrea Leopardi PDF Version
120 pages
High-Performance Distributed Web Crawler
No ratings yet
High-Performance Distributed Web Crawler
10 pages
ThinkingElixir - Pattern Matching Resource
No ratings yet
ThinkingElixir - Pattern Matching Resource
74 pages
N B S F W W: O Ullshit ANE Ramework FOR ILD EB
No ratings yet
N B S F W W: O Ullshit ANE Ramework FOR ILD EB
81 pages
Erlang Programming Rules
No ratings yet
Erlang Programming Rules
21 pages
Systems That Never Stop (And Erlang) : Joe Armstrong
No ratings yet
Systems That Never Stop (And Erlang) : Joe Armstrong
47 pages
Beam Disasm
No ratings yet
Beam Disasm
21 pages
Comp 348
No ratings yet
Comp 348
7 pages
Elixir/Erlang Data Storage Solutions
No ratings yet
Elixir/Erlang Data Storage Solutions
25 pages
Otp System Documentation
No ratings yet
Otp System Documentation
361 pages
UNIT-III CC r22
No ratings yet
UNIT-III CC r22
30 pages
Cloud Haskel
No ratings yet
Cloud Haskel
3 pages
Telecom Network Engineers Guide
No ratings yet
Telecom Network Engineers Guide
22 pages
Erlang Programming Style Guide
No ratings yet
Erlang Programming Style Guide
9 pages
Erlang Cheat Sheet: Key Syntax & Functions
100% (1)
Erlang Cheat Sheet: Key Syntax & Functions
1 page
Introducing Records by Refactoring: László Lövei, Zoltán Horváth, Tamás Kozsik, Roland Király
No ratings yet
Introducing Records by Refactoring: László Lövei, Zoltán Horváth, Tamás Kozsik, Roland Király
30 pages
MultiMLton &erlang
No ratings yet
MultiMLton &erlang
14 pages
Functional Programming
No ratings yet
Functional Programming
23 pages
A Stream Library Using Erlang Binaries: Jay Nelson
No ratings yet
A Stream Library Using Erlang Binaries: Jay Nelson
8 pages
Hachemaoui Sidi Mohammed
No ratings yet
Hachemaoui Sidi Mohammed
3 pages
Elixir in Action Third Edition Saša Jurić Ready To Read
100% (8)
Elixir in Action Third Edition Saša Jurić Ready To Read
96 pages

Systems That Never Stop (And Erlang) : Joe Armstrong

Uploaded by

Systems That Never Stop (And Erlang) : Joe Armstrong

Uploaded by

Systems that never

stop (and Erlang)

 10 nines = 99.99999999% availability

Structured concurrent programming

 Fault detection is not enough - you must no why

 Must upgrade software while it is running

 Must store stuff forever

Those who cannot learn from history are

The process achieves fault containment by sharing no state with

Failure status property when a processor fails,

Stable Storage Property The storage of a processor

The big idea is "messaging" -- that is what the kernal of Smalltalk/

Processes are made fail-fast by defensive programming. They check

For an effective error handling we must detect errors and

 Whatever Joe starts talking about, He will end up

➡ Pure message passing systems

 1997 1 Tflop = 850 KW

Pid ! Message receive

 Existing processes continue to use original code, new

 Sophisticated roll-forward, roll-back, roll-back-on-error

 Properly designed systems can be rolled-forward and

You might also like