Failure model Description
Timing failure A system’s component transmits a
message way before or after the
expected time interval.
Omission failure A message that never seems to be
transmitted. We call it also an
“infinitely late” timing failure. It
takes two forms, send omission
and receive omission failures.
Crash failure A component faces an omission
failure and then quits replying
entirely.
Response failure A component delivers an
erroneous response, whether by
giving an incorrect value or
transferring it through a wrong
control flow.
Arbitrary failure A component generates random
responses at random times. It’s
the worst failure scenario, known
also as a byzantine failure,
because of its behavior’s
inconsistency.
How Failure Models Help in Designing Fault-Tolerant Systems:
Redundancy:
Implementing multiple copies of components or data allows the system to continue operating
even if one component fails.
Error detection and recovery:
Mechanisms like checksums, timestamps, and retry mechanisms can be used to detect errors
and recover from failures. Consensus algorithms:
These algorithms allow multiple components to agree on a shared state, even in the presence of
failures
Isolation and protection:
Techniques like the bulkhead pattern can isolate components, preventing failures in one
component from cascading to others.
Failover mechanisms:
Automatically switching to backup components or systems when a failure is detected.
Monitoring and logging:
Continuously monitoring system health and logging failures allows for proactive identification
and resolution of issues.