Fundamental Safety Engineering and Risk Management Concepts, 2012/2013
by M. J. Baker and H. Tan
CLASSICAL RELIABILITY THEORY
1 Introduction
Classical reliability theory describes the probability of a system completing its expected function
during an interval of time. It is the basis of reliability engineering, which is an area of study focused on
optimizing the reliability, or probability of successful functioning, of systems, such as airplanes, linear
accelerators, and any other product. It developed apart from the mainstream of probability and
statistics. It was originally a tool to help nineteenth century maritime insurance and life insurance
companies compute profitable rates to charge their customers.
The failure of mechanical devices such as ships, trains, and cars, is similar in many ways to the life or
death of biological organisms. Death or failure is called an "event", and the goal for the classical
reliability theory is to project or forecast the rate of events for a given population or the probability of
an event for an individual.
2 Reliability function
Assume that nominally identical components are put into service under conditions which are also
nominally identical and that the random time to failure of each component is observed. Furthermore,
assume that a probability distribution is fitted to the random time to failure data. Take ) (t F
T
to be the
resulting distribution function and ) (t f
T
to be the corresponding density function.
Then the reliability function, ) (t R
T
, is defined as
( ) ( ) t T P t T P t F t R
T T
> = s = = 1 ) ( 1 ) ( (1)
Hence,
}
=
t
T T
d f t R
0
) ( 1 ) ( t t (2)
Eventually all components fail, so
lim ( ) 0
T
t
R t
= (3)
Fundamental Safety Engineering and Risk Management Concepts, 2012/2013
by M. J. Baker and H. Tan
3 Failure rate
Failure rate is the frequency with which an engineered system or component fails, expressed for
example in failures per hour. It is often denoted by the Greek letter (lambda) and is important in
reliability engineering. The failure rate of a system usually depends on time, with the rate varying over
the life cycle of the system. For example, an automobile's failure rate in its fifth year of service may be
many times greater than its failure rate during its first year of service. One does not expect to replace an
exhaust pipe, overhaul the brakes, or have major transmission problems in a new vehicle.
In the following we will approach the issue of failure rate from the starting point of the probability
distribution of a components random time to failure. The average failure rate at time t over a short
interval of time t for an engineering component with slowly varying failure rate was defined from a
statistical sampling point of view as
t n
n
t
f
n
A
=
lim
) ( (4)
where
f
n
is the number of failures during a small time interval and n is the number of survived
components in the sample being monitored. If the time interval t is now reduced to zero we obtain
what is known as the instantaneous failure rate at time t, also known as the hazard rate ) (t h :
t n
n
t t h
f
t
n
A
= =
A
lim
0
) ( ) ( (5)
with unit as failures per unit time. This gives
0
( ) ( ) ( )
( ) ( )
( ) ( )
lim
T
n
t
f t R t R t t
h t t
R t t R t
A
+ A
= = =
A
(6)
Note: R(t) in the denominator of the above expression is a normalising factor which allows for the fact
that for a component to fail during the interval
| |
, t t t + A it must have already survived up to time T t >
.
density function of the random time to failure at time
( )
reliability function at time
t
h t
t
= (7)
Note that this use of the word hazard is a very specific and classical usage and should not be confused
with the general definition of hazard given in earlier lectures.
Example: In practice it is of interest to see how the hazard rate ( ) h t varies with time for different types
of distributions of the random time to failure T.
Fundamental Safety Engineering and Risk Management Concepts, 2012/2013
by M. J. Baker and H. Tan
Let us consider the special case where T is assumed to be exponentially distributed. Recall from earlier
lectures for an exponential distribution,
0
( )
0 0
t
T
e t
f t
t
>
=
<
.
(8)
The probability distribution is
0
( ) 1
t
t
T
F t e d e
t
t
= =
}
. This gives
( )
( )
( ) (a constant)
( ) 1 1
t
T
t
T
f t e
h t
R t e
= = =
.
(9)
Hence for an exponential distribution, the hazard function is equal to the single parameter defining
the exponential distribution. Moreover, h(t) is a constant (i.e. does not change with time).
Failure rates are important factors in the insurance, finance, commerce and regulatory industries and
fundamental to the design of safe systems in a wide variety of applications. Failure rate data can be
obtained in several ways. The most common means are:
- Historical data about the device or system under consideration. Many organizations maintain
internal databases of failure information on the devices or systems that they produce, which can
be used to calculate failure rates for those devices or systems. For new devices or systems, the
historical data for similar devices or systems can serve as a useful estimate.
- Government and commercial failure rate data. Handbooks of failure rate data for various
components are available from government and commercial sources.
- Testing. The most accurate source of data is to test samples of the actual devices or systems in
order to generate failure data. This is often prohibitively expensive or impractical, so that the
previous data sources are often used instead.
4 Bathtub curve
Over many years, and across a wide variety of mechanical and electronic components and systems,
people have calculated empirical population failure rates as units age over time and repeatedly obtained
a graph such as shown below. Because of the shape of this failure rate curve, it has become widely
known as the "Bathtub" curve, as depicted in Figure 1.
The initial region that begins at time zero when a customer first begins to use the product is
characterized by a high but rapidly decreasing failure rate. This region is known as the Early Failure
Period (also referred to as Infant Mortality Period, from the actuarial origins of the first bathtub curve
plots).
Then the failure rate levels off and remains roughly constant for (hopefully) the majority of the useful
life of the product. This long period of a level failure rate is known as the Intrinsic Failure Period (also
Fundamental Safety Engineering and Risk Management Concepts, 2012/2013
by M. J. Baker and H. Tan
called the Stable Failure Period) and the constant failure rate level is called the Intrinsic Failure Rate.
Note that most systems spend most of their lifetimes operating in this flat portion of the bathtub curve
Finally, if units from the population remain in use long enough, the failure rate begins to increase as
materials wear out and degradation failures occur at an ever increasing rate. This is the Wearout Failure
Period.
Figure 1, The bathtub curve.
5 Concept of Expected Life or Mean Time To Failure (MTTF)
This answers the question: How long do we need to wait on average before a component fails?
If the distribution of the random time to failure of a component is known (i.e. f
T
(t) ) then by definition
the expected life or expected time before failure is given by:
( )
0
MTTF ( )
T
E T f d t t t
= =
} .
(10)
Example 4.1:
Proof that MTTF can also be calculate from
0
MTTF ( )
T
R t dt
=
} .
(11)
Solution:
Fundamental Safety Engineering and Risk Management Concepts, 2012/2013
by M. J. Baker and H. Tan
0
0
0
0
0
0
0
MTTF ( )
( )
( )
( )
( ) ( )
( )
T
T
T
T
T T
T
f d
dF
d
d
dR
d
d
dR
R R d
R d
t t t
t
t t
t
t
t t
t
t t
t t t t
t t
=
=
=
=
= +
=
}
}
}
}
}
}
.
(12)
Example 4.2:
If T is exponentially distributed then
0 0
1
MTTF ( )
t
T
R t dt e dt
= = =
} } .