Scale-out and
redundancy
BITS Pilani
Pilani Campus
BITS Pilani
Pilani Campus
Redundancy
Reliability Availability Serviceability (RAS)
BITS Pilani
• When designing robust, highly available systems three terms
are often used together: reliability, availability, and
serviceability (RAS).
Reliability: MTTF
Serviceability: MTTR
Availability: MTTF/(MTTF+MTTR)
• Operate-repair cycle
4
Availability
BITS Pilani
System Type Availability (%) *Downtime per year
Conventional workstation 99 3.6 days
HA System 99.9 8.5 hours
Fault-resilient system 99.99 1 hour
Fault-tolerant system 99.999 5 minutes
*May not include planned downtime (on weekends or midnight)
which may be significant
Failures
BITS Pilani
• (Planned) Shutdown vs. (Unplanned) Failure
System may be offline for (planned) changes / maintenance:
How do you define MTTF and MTTR in this case?
No of planned down times and their duration.
• Transient vs. Permanent Failures
Transient: A rollback or restart is sufficient
Permanente failures require component replacements. E.g. hard-disk failure
• Partial vs. Total Failures
Single Points of Failure leads to total failures
A key approach to enhance availability is to convert total failures to partial
failures.
Increase redundancy to avoid SPoF
Increasing MTTF vs. Reducing MTTR
Partial vs Total Failures
BITS Pilani
If Bus fails, entire system fails. If Ethernet fails, entire system fails.
If one node fails, other If one node fails, we can still use the
nodes can offer service. Data. Use logs and checkpoints.
• What are SPoF here?
Availability
BITS Pilani
• Assume that network and RAID is 100% available
• Workstation is 99% available.
• If a node fails, workload is switched over to the other node in zero time.
• The what is the availability of the cluster?
Prob that a node is not available is 0.01. Two nodes at the same time:
0.0001 or 0.01%. Availability is 99.99%. Fault-resilient cluster.
• What is the availability if cluster needs a planned downtime of 1 hour per
week?
52 hours/(365*24)=0.0059 or .59%. Total = 0.01+0.59=0.6%. i.e.
99.4% availability.
Aspects of High Availability
BITS Pilani
• Redundancy
2N Model, N+M model, Nway model
Active & passive
• Replication
Copying state of system (replication)
Lock-step strategy (send same operations to be executed by all
machines)
• Monitoring
push vs poll
Zookeeper watches
• Failure Detection
Ping, heartbeat
Zookeeper watches
• Recovery
Redundancy Models
BITS Pilani
CSIS Dept. BITS Pilani, Pilani Campus
Redundancy Models
BITS Pilani
• 2N Model
Passive standby for every active node
This provides the highest level of fault tolerance, as the system can
continue to function even if up to half of the nodes fail
• N+M Model
At most M standby passive nodes for N nodes.
The N+M model is a more cost-effective alternative to the 2N model
• N-Way model
There can be several standbys including active and passive
• N-Way Active
There are only active standbys and can be many standbys
Two-Node Failover Configurations
BITS Pilani
Possible cluster configurations:
1. Active-passive clusters
2. Active-active clusters
Fault Tolerant Clusters – Hot Standby
BITS Pilani
1. Active-Passive/hot standby
Primary (active) vs. Standby (passive) node:
Primary node mirrors any data to shared storage,
which is accessible by the standby node
Standby node monitors the health of the primary
but does not handle any workload
Asymmetric
2. Active-active clusters
Active-Passive Configuration
BITS Pilani
Cost of additional system
Highly available
because chances that
passive one will fail are
less.
Due to extra cost and admin
overhead, only critical
applications run in active-
passive config
Fault Tolerant Clusters – Active Takeover
BITS Pilani
1. Active-Passive / Hot standby
clusters
2. Active-takeover/active-active
clusters
Symmetry:
all nodes act as primaries:
i.e. they handle normal
workload
but they also monitor each
other
If one fails, the survivor will
step in and handle double
workload until other node
comes back.
Failover vs. Failback
Active-active
BITS Pilani
1. Hot standby clusters
2. Active-active clusters
Symmetry of nodes
Failover
When a node fails, applications fail-over to a
designated node that is available
Fail-over delay may result in increased
response time, data loss etc.
Failback
When a node recovers from a failure, the failed-
over applications fail-back to the recovered node.
Active-active configurations
BITS Pilani
Manageable costs
Performance impact when a single node handles double
load.
Partner node may itself be subjected to failures because it is
also running critical applications.
Example
BITS Pilani
• Storage is shared storage.
Assume that network is 100%
available. MTTF and MTTR of
shared storage is 200 days
and 5 days respectively.
Assume that availability of a
node is 99%.
• Considering that failure of any
node will bring down the whole
cluster, calculate the cluster
availability
Example
BITS Pilani
• Storage is shared storage. Assume
that network is 100% available.
MTTF and MTTR of shared storage
is 200 days and 5 days respectively.
Assume that availability of a node is
99%.
• Considering that Node1 is the
primary server and all other
nodes are passive hot
standbys, calculate the cluster
availability? Assume that
failover takes zero time
Example
BITS Pilani
• Storage is shared storage. Assume that
network is 100% available. MTTF and
MTTR of shared storage is 200 days and
5 days respectively. Assume that
availability of a node is 99%.
• Consider that web server is deployed
on Node1 and Node2. Database
server is deployed on Node3 and
Node4. {Node 1, Node2} is
configured to be in active-active
configuration. Similarly {Node3,
Node4} are configured to be in
active-active cluster configuration.
Web application needs both web
server and database to be available.
Calculate the availability of the web
application? Assume that failover
takes zero time.
Large Cluster Configurations
BITS Pilani
• N-to-1 Clusters
• N+1 clusters
N to1 Clusters
BITS Pilani
• A designated node acts as stand-by.
This node is required to access all disks
• Failure in any node, failover happens to designated node.
• Failback is required when failed node comes back so that
designated node is freed
N plus 1 Configuration
BITS Pilani
22
• N-to-1 requires failback which
affects availability.
• In N-plus-1, all nodes have
access shared storage using
SAN.
• Assume initially node6 is
standby. If node1 fails, it will
failover to node6.
• When node1 comes back,
node1 becomes standby.
• This way cluster config changes
over time.
Example
BITS Pilani
• Storage is shared storage. Assume that
network is 100% available. MTTF and
MTTR of shared storage is 200 days and
5 days respectively. Assume that
availability of a node is 99%.
• Assume that four nodes are
configured as N-to-1 cluster
configuration where Node4 is the
designated passive hot standby
node. Applications running on
Node1, Node2 and Node3 are
required to be available
simultaneously for the cluster to be
available. Failover and Failback
doesn't take any time. Calculate the
cluster availability
Example
BITS Pilani
• Storage is shared storage. Assume that
network is 100% available. MTTF and
MTTR of shared storage is 200 days and
5 days respectively. Assume that
availability of a node is 99%.
• Assume that four nodes are
configured as N-to-1 cluster
configuration where Node4 is the
designated passive hot standby
node. Applications running on
Node1, Node2 and Node3 are
required to be available
simultaneously for the cluster to be
available. Failover and Failback
doesn't take any time. Calculate the
cluster availability
Failover
BITS Pilani
The migration of services from one node to another is called
failover.
Criteria:
Transparent to users
Quick
Minimal Manual intervention
Guaranteed data access
Failover
BITS Pilani
Critical elements to be moved
Network identity
IP address the client uses should be transferred to
takeover node.
Access to shared disks
Set of processes
The collection of these elements is called a “service group”.
A service group is the unit that moves from one cluster
node to another.
A cluster may have multiple service groups.
Failover Requirements
BITS Pilani
Two nodes
Network connections
Pair of heartbeat networks
Public network
Admin network
Disks
Unshared disks for OS and
failover process
Shared disks for critical data
App portability
Should have binary
compatibility
No SPoF
Fail Over Management
28
Diagnosis
detection of failure and location of the failed component
heartbeat messages – a common way to
Notification
Recovery
Forward Recovery
Backward Recovery
Component Monitoring
29
Hardware components monitoring
Application health monitoring
Process table
Process is running
Process is running properly??
Slow response times??
Query as an end user and check.
Component Monitoring
30
Split-Brain syndrome
If disks fail but network is connected, then no
problem.
If network fails but disks function, then
Standby server thinks primary has failed. So it tries
to takeover. Primary server goes on with its job.
Standby also accepts connections. Both access the
data disks simultaneously.
May lead to data corruption and unexpected
responses.
Checkpointing
31
Processes periodically save consistent state information
(a.k.a. checkpoint) on a stable storage
Useful for process migration and fault tolerance (fail
over)
Single process case:
Stack, heap, registers, pending signals, fds, fd state
etc are written to file.
OS after restarting creates a process and the objects
associated with that. The process is set to the state in
the file.
08-Nov-23
Checkpointing – Levels of Implementation
32
OS Kernel
OS transparently checkpoints and (on failure) restarts processes
All data structures are accessible. Can save any data to file.
Difficulty in implementing.
Library
User space checkpointing program
Imposes restrictions on which systems calls can be used. (forbids IPC).
May not require source code modifications
Program has to link to this library
explicit calls for checkpointing and restarting
link library with source code
implicit checkpointing
i.e. Compiler inserts checkpointing library calls
Application
Highest efficiency because user knows what to store.
Requires modification of the source code.
Implementations
35
Library
libckpt
Condor
Libtckpt
System
VMADump
CRAK
libckpt
36
One of the first library implementations for UNIX
Provides a number of special optimizations to reduce
the size of checkpoint files
Memory exclusion (mark unused pages or pages that
will not be modified)
Incremental checkpoint using mprotect()
Forked checkpointing
Synchronous checkpoint
Allows the application to suggest libckpt at what
times to checkpoint.
The application must be recompiled and statically linked
to libckpt
Failure Recovery – Backward Recovery Scheme
37
Backward recovery
Checkpointing:
processes periodically save consistent state
information (a.k.a. checkpoint) on a stable storage
Rollback
Post failure,
the failed component is isolated,
previous checkpoint is restored, and
normal operation is resumed
Pros and Cons:
Easy to implement, application independent
Rollback results in wasted execution
Failure Recovery – Forward Recovery Scheme
38
Forward recovery is useful in systems
Forward error recovery attempts to continue the current
computation by restoring the system to a consistent state,
compensating for the inconsistencies found in the current
state.
Fault masking using Triple Modular Redundancy (TMR) / N-
version programming.
where execution time is critical
e.g. Real-Time System
Space systems /Aero systems
but may
be application specific and
require additional hardware
Combing outputs during continuous execution may
place overhead on OS. Special hardware (processor) is
reqd.
References
40
Q&A
CSIS Dept. BITS Pilani, Pilani Campus
BITS Pilani
Pilani Campus
Thank You