Reliability Explained
Reliability Explained
Introduction:
The changing size of projects and designs has caused an increased adoption of reliability and
availability determinations. These activities have been in place for many years but generally fell
into a select few design areas. The increased interest in these methods has been driven by
several factors. If a design proves to be unreliable, a significant cost can be incurred to repair
the devices. As scientific detectors become larger and more complicated, repairing tens of
thousands of circuit boards would have a severe cost impact and has the potential to cause the
experiment to be canceled after a great deal of money and effort has been invested.
The following article is reprinted with permission from Prenscia HBK (Reliasoft).
HBMPrenscia.com
It does not represent an endorsement of the company or its products but rather it is an
example of a methodology for Reliability and Availability calculation methods.
MIL-217, Bellcore/Telcordia and Other Reliability Prediction Methods for Electronic Products -
ReliaSoft
In today's competitive electronic products market, having higher reliability than competitors is
one of the key factors for success. To obtain high product reliability, consideration of reliability
issues should be integrated from the very beginning of the design phase. This leads to the
concept of reliability prediction. Historically, this term has been used to denote the process of
applying mathematical models and component data for the purpose of estimating the field
reliability of a system before failure data are available for the system. However, the objective of
reliability prediction is not limited to predicting whether reliability goals, such as MTBF, can be
reached. It can also be used for:
• Identifying potential design weaknesses
• Evaluating the feasibility of a design
• Comparing different designs and life-cycle costs
• Providing models for system reliability/availability analysis
• Establishing goals for reliability tests
• Aiding in business decisions such as budget allocation and scheduling
Once the prototype of a product is available, lab tests can be utilized to obtain more accurate
reliability predictions. Accurate prediction of the reliability of electronic products requires
knowledge of the components, the design, the manufacturing process and the expected operating
conditions. Several different approaches have been developed to achieve the reliability prediction
of electronic systems and components. Each approach has its unique advantages and
disadvantages. Among these approaches, three main categories are often used within government
and industry: empirical (standards based), physics of failure and life testing. In this article, we
will provide an overview of all three approaches.
First, we will discuss empirical prediction methods, which are based on the experiences of
engineers and on historical data. Standards, such as MIL-HDBK-217 and Bellcore/Telcordia, are
widely used for reliability prediction of electronic products. Next, we will discuss physics of
failure methods, which are based on root-cause analysis of failure mechanisms, failure modes
and stresses. This approach is based upon an understanding of the physical properties of the
materials, operation processes and technologies used in the design. Finally, we will discuss life
testing methods, which are used to determine reliability by testing a relatively large number of
samples at their specified operation stresses or higher stresses and using statistical models to
analyze the data.
Empirical prediction methods are based on models developed from statistical curve fitting of
historical failure data, which may have been collected in the field, in-house or from
manufacturers. These methods tend to present good estimates of reliability for similar or slightly
modified parts. Some parameters in the curve function can be modified by integrating
engineering knowledge. The assumption is made that system or equipment failure causes are
inherently linked to components whose failures are independent of each other. There are many
different empirical methods that have been created for specific applications. Some have gained
popularity within industry in the past three decades. The table below lists some of the available
prediction standards and the following sections describe two of the most commonly used
methods in a bit more detail.
MIL-HDBK-217 is very well known in military and commercial industries. It is probably the
most internationally recognized empirical prediction method, by far. The latest version is MIL-
HDBK-217F, which was released in 1991 and had two revisions: Notice 1 in 1992 and Notice 2
in 1995.
The MIL-HDBK-217 predictive method consists of two parts; one is known as the parts
count method and the other is called the part stress method [1]. The parts count method assumes
typical operating conditions of part complexity, ambient temperature, various electrical stresses,
operation mode and environment (called reference conditions). The failure rate for a part under
the reference conditions is calculated as:
where:
Since the parts may not operate under the reference conditions, the real operating conditions will
result in failure rates that are different from those given by the "parts count" method. Therefore,
the part stress method requires the specific part’s complexity, application stresses, environmental
factors, etc. (called Pi factors). For example, MIL-HDBK-217 provides many environmental
conditions (expressed as πE) ranging from "ground benign" to "cannon launch." The standard also
provides multi-level quality specifications (expressed as πQ). The failure rate for parts under
specific operating conditions can be calculated as:
where:
Figure 1 shows an example using the MIL-HDBK-217 method (in ReliaSoft Lambda
Predict software) to predict the failure rate of a ceramic capacitor. According to the handbook,
the failure rate of a commercial ceramic capacitor of 0.00068 μF capacitance with 80% operation
voltage, working under 30 degrees ambient temperature and "ground benign" environment is
0.0217 / 106 hours. The corresponding MTBF (mean time before failure) or MTTF (mean time to
failure) is estimated to be 4.6140 / 107 hours.
Bellcore was a telecommunications research and development company that provided joint R&D
and standards setting for AT&T and its co-owners. Because of dissatisfaction with military
handbook methods for their commercial products, Bellcore designed its own reliability
prediction standard for commercial telecommunication products. In 1997, the company was
acquired by Science Applications International Corporation (SAIC) and the company's name was
changed to Telcordia. Telcordia continues to revise and update the standard. The latest two
updates are SR-332 Issue 2 (September 2006) and SR-332 Issue 3 (January 2011), both called
"Reliability Prediction Procedure for Electronic Equipment."
The Bellcore/Telcordia standard assumes a serial model for electronic parts and it addresses
failure rates at the infant mortality stage and at the steady-state stage with Methods I, II and III
[2-3]. Method I is similar to the MIL-HDBK-217F parts count and part stress methods. The
standard provides the generic failure rates and three part stress factors: device quality factor ( πQ),
electrical stress factor (πS) and temperature stress factor (T). Method II is based on combining
Method I predictions with data from laboratory tests performed in accordance with specific SR-
332 criteria. Method III is a statistical prediction of failure rate based on field tracking data
collected in accordance with specific SR-332 criteria. In Method III, the predicted failure rate is
a weighted average of the generic steady-state failure rate and the field failure rate.
Figure 2 shows an example in Lambda Predict using SR-332 Issue 3 to predict the failure rate of
the same capacitor in the previous MIL-HDBK-217 example (shown in Figure 1). The failure
rate is 9.655 Fits, which is 9.655 / 109 hours. In order to compare the predicted results from MIL-
HBK-217 and Bellcore SR-332, we must convert the failure rate to the same units. 9.655 Fits is
0.0009655 / 106 hours. So the result of 0.0217 / 106 hours in MIL-HDBK-217 is much higher
than the result in Bellcore/Telcordia SR-332. There are reasons for this variation. First, MIL-
HDBK-217 is a standard used in the military so it is more conservative than the commercial
standard. Second, the underlying methods are different and more factors that may affect the
failure rate are considered in MIL-HDBK-217.
Although empirical prediction standards have been used for many years, it is always wise to use
them with caution. The advantages and disadvantages of empirical methods have been discussed
a lot in the past three decades. A brief summary from the publications in industry, military and
academia is presented next [5-9].
In contrast to empirical reliability prediction methods, which are based on the statistical analysis
of historical failure data, a physics of failure approach is based on the understanding of the
failure mechanism and applying the physics of failure model to the data. Several popularly used
models are discussed next.
Arrhenius's Law
One of the earliest and most successful acceleration models predicts how the time-to-failure of a
system varies with temperature. This empirically based model is known as the Arrhenius
equation. Generally, chemical reactions can be accelerated by increasing the system temperature.
Since it is a chemical process, the aging of a capacitor (such as an electrolytic capacitor) is
accelerated by increasing the operating temperature. The model takes the following form.
where:
• L(T ) is the life characteristic related to temperature
• A is the scaling factor
• Ea is the activation energy
• k is the Boltzmann constant
• T is the temperature.
While the Arrhenius model emphasizes the dependency of reactions on temperature, the Eyring
model is commonly used for demonstrating the dependency of reactions on stress factors other
than temperature, such as mechanical stress, humidity or voltage.
where:
• L(T ,S) is the life characteristic related to temperature and another stress
• A, α, B and C are constants
• S is a stress factor other than temperature
• T is absolute temperature
According to different physics of failure mechanisms, one more term (i.e., stress) can be either
removed or added to the above standard Eyring model. Several models are similar to the standard
Eyring model. They are:
Electronic devices with aluminum or aluminum alloy with small percentages of copper and
silicon metallization are subject to corrosion failures and therefore can be described with the
following model [11]:
where:
Hot carrier injection describes the phenomena observed in MOSFETs by which the carrier gains
sufficient energy to be injected into the gate oxide, generate interface or bulk oxide defects and
degrade MOSFETs characteristics such as threshold voltage, transconductance, etc. [11]:
where:
where:
Electromigration is a failure mechanism that results from the transfer of momentum from the
electrons, which move in the applied electric field, to the ions, which make up the lattice of the
interconnect material. The most common failure mode is "conductor open." With the decreased
structure of Integrated Circuits (ICs), the increased current density makes this failure mechanism
very important in IC reliability.
At the end of the 1960s, J. R. Black developed an empirical model to estimate the MTTF of a
wire, taking electromigration into consideration, which is now generally known as the Black
model. The Black model employs external heating and increased current density and is given by:
where:
The current density (J) and temperature (T) are factors in the design process that affect
electromigration. Numerous experiments with different stress conditions have been reported in
the literature, where the values have been reported in the range between 2 and 3.3 for N, and 0.5
to 1.1eV for Ea. Usually, the lower the values, the more conservative the estimation.
Fatigue failures can occur in electronic devices due to temperature cycling and thermal shock.
Permanent damage accumulates each time the device experiences a normal power-up and power-
down cycle. These switch cycles can induce cyclical stress that tends to weaken materials and
may cause several different types of failures, such as dielectric/thin-film cracking, lifted bonds,
solder fatigue, etc. A model known as the (modified) Coffin-Manson model has been used
successfully to model crack growth in solder due to repeated temperature cycling as the device is
switched on and off. This model takes the form [9]:
where:
A given electronic component will have multiple failure modes and the component's failure rate
is equal to the sum of the failure rates of all modes (i.e., humidity, voltage, temperature, thermal
cycling and so on). The system's failure rate is equal to the sum of the failure rates of the
components involved. In using the above models, the model parameters can be determined from
the design specifications or operating conditions. If the parameters cannot be determined without
conducting a test, the failure data obtained from the test can be used to get the model parameters.
Software products such as ReliaSoft ALTA can help you analyze the failure data.
We will give an example of using ALTA to analyze the Arrhenius model. For this example, the
life of an electronic component is considered to be affected by temperature. The component is
tested under temperatures of 406, 416 and 426 Kelvin. The usage temperature level is 400
Kelvin. The Arrhenius model and the Weibull distribution are used to analyze the failure data in
ALTA. Figure 4 shows the data and calculated parameters. Figure 5 shows the reliability plot
and the estimated B10 life at the usage temperature level.
Figure 4: Data and analysis results in ALTA with the Arrhenius-Weibull model
Figure 5: Reliability vs. Time plot and calculated B10 life
From Figure 4, we can see that the estimated activation energy in the Arrhenius model is 0.92.
Note that, in ALTA, the Arrhenius model is simplified to a form of:
Using this equation, the parameters B and C calculated by ALTA can easily be transformed to
the parameters described above for the Arrhenius relationship.
As mentioned above, time-to-failure data from life testing may be incorporated into some of the
empirical prediction standards (i.e., Bellcore/Telcordia Method II) and may also be necessary to
estimate the parameters for some of the physics of failure models. However, in this section of the
article, we are using the term life testing method to refer specifically to a third type of approach
for predicting the reliability of electronic products. With this method, a test is conducted on a
sufficiently large sample of units operating under normal usage conditions. Times-to-failure are
recorded and then analyzed with an appropriate statistical distribution in order to estimate
reliability metrics such as the B10 life. This type of analysis is often referred to as Life Data
Analysis or Weibull Analysis.
ReliaSoft Weibull++ software is a tool for conducting life data analysis. As an example, suppose
that an IC board is tested in the lab and the failure data are recorded. Figure 6 shows the data
entered into Weibull++ and analyzed with the 2-parameter Weibull lifetime distribution, while
Figure 7 shows the Reliability vs. Time plot and the calculated B10 life for the analysis.
Figure 6: Data and analysis results in Weibull++ with the Weibull distribution
Figure 7: Reliability vs. Time plot and calculated B10 life for the analysis
The life testing method can provide more information about the product than the empirical
prediction standards. Therefore, the prediction is usually more accurate, given that enough
samples are used in the testing.
The life testing method may also be preferred over both the empirical and physics of failure
methods when it is necessary to obtain realistic predictions at the system (rather than component)
level. This is because the empirical and physics of failure methods calculate the system failure
rate based on the predictions for the components (e.g., using the sum of the component failure
rates if the system is considered to be a serial configuration). This assumes that there are no
interaction failures between the components but, in reality, due to the design or manufacturing,
components are not independent. (For example, if the fan is broken in your laptop, the CPU will
fail faster because of the high temperature.) Therefore, in order to consider the complexity of the
entire system, life tests can be conducted at the system level, treating the system as a "black
box," and the system reliability can be predicted based on the obtained failure data.
Conclusions
In this article, we discussed three approaches for electronic reliability prediction. The empirical
(or standards based) methods can be used in the design stage to quickly obtain a rough estimation
of product reliability. The physics of failure and life testing methods can be used in both design
and production stages. In physics of failure approaches, the model parameters can be determined
from design specs or from test data. On the other hand, with the life testing method, since the
failure data from your own particular products are obtained, the prediction results usually are
more accurate than those from a general standard or model.
References
[1] MIL-HDBK-217F, Reliability Prediction of Electronic Equipment, 1991. Notice 1 (1992) and
Notice 2 (1995).
[2] SR-332, Issue 1, Reliability Prediction Procedure for Electronic Equipment, Telcordia, May
2001.
[3] SR-332, Issue 2, Reliability Prediction Procedure for Electronic Equipment, Telcordia,
September 2006.
[4] ITEM Software and ReliaSoft, D490 Course Notes: Introduction to Standards Based
Reliability Prediction and Lambda Predict, 2015.
[5] B. Foucher, J. Boullie, B. Meslet and D. Das, "A Review of Reliability Prediction Methods
for Electronic Devices," Microelectron. Wearout., vol. 42, no. 8, August 2002, pp. 1155-1162.
[6] M. Pecht, D. Das and A. Ramarkrishnan, "The IEEE Standards on Reliability Program and
Reliability Prediction Methods for Electronic Equipment," Microelectron. Wearout., vol. 42,
2002, pp. 1259-1266.
[7] M. Talmor and S. Arueti, "Reliability Prediction: The Turnover Point," 1997 Proc. Ann.
Reliability and Maintainability Symp., 1997, pp. 254-262.
[8] W. Denson, "The History of Reliability Prediction," IEEE Trans. On Reliability, vol. 47, no.
3-SP, September 1998.
[9] D. Hirschmann, D. Tissen, S. Schroder and R.W. de Doncker, "Reliability Prediction for
Inverters in Hybrid Electrical Vehicles," IEEE Trans. on Power Electronics, vol. 22, no. 6,
November 2007, pp. 2511-2517.
[11] Semiconductor Device Reliability Failure Models. [Online document] Available HTTP:
www.sematech.org/docubase/document/3955axfr.pdf
Reliability links:
A Guide to Reliability Prediction Standards & Failure Rate | Relyence
How to Perform Reliability Predictions Easily and Efficiently (relyence.com)
▪ FMEA (Failure Mode and Effects Analysis) identifies potential failures, provides a
way to assess the criticality of those failures, and then tracks ways to eliminate
or mitigate them.
▪ FRACAS (Failure, Reporting, Analysis and Corrective Action System) and its
related CAPA (Corrective and Preventive Action) enable you to effectively track
and manage your corrective action process.
▪ FTA (Fault Tree Analysis) assesses the risk of catastrophic events.
▪ Reliability Prediction computes MTBF metrics and provides a platform for
“designing-in” reliability.
▪ RBD (Reliability Block Diagram) offers full scale system modeling and analysis of
complex designs including those that use redundancy.
▪ Maintainability Prediction provides the ability to ensure repair and maintenance
procedures are effective and efficient.
▪ Weibull analysis is a versatile tool for predictive analytics using life data.
▪ ALT (Accelerated Life Testing) allows you to take accelerated life data and
extrapolate real world system performance.
FMEA, or Failure Mode and Effects Analysis, is an organized, systematic approach for
assessing potential system failures and the resulting consequences of those failures. The
objective of a FMEA is to evaluate the risk associated with the identified failure effects and
come up with a plan to detect, prevent, or mitigate those deemed most critical.
Fault Tree Analysis (FTA) uses a top-down deductive approach to assess the likelihood of
occurrence of an undesired, often catastrophic, event. FTA provides an important measured-
based approach for risk analysis.
RAMS analysis is a well-established approach for evaluating four critical factors related to
system performance: reliability, availability, maintainability, and safety. Widely used in
engineering disciplines, RAMS analysis ensures that systems meet operational requirements
throughout the lifecycle. The objective of RAMS analysis is to assess reliability, availability,
maintainability, and safety in an organized way, identify areas of concern, and facilitate
improvements to ensure that program goals are met.
Reliability is defined as the probability, or likelihood, that an item will perform a desired function
without failure under stated conditions for a stated period of time. In general, reliability is an
indicator of the likelihood a product will operate without failure.
Availability is defined as the probability that a repairable system is in a working state when it is
required to be operational
Maintainability is defined in MIL-STD-721 as “the measure of the ability of an item to be
retained in or restored to a specified condition when maintenance is performed by personnel
having specified skill levels, using prescribed procedures and resources, at each prescribed
level of maintenance and repair.”
Safety is a term with a much clearer definition! When used in reference to RAMS analysis,
safety analysis is performed in order evaluate ways to prevent harm to people and the
environment.
A Reliability Analysis
of the
Mu2e Calorimeter Front End Amplifier Board
1
Fermi National Accelerator Laboratory, Batavia, IL, USA
2
INFN, Inst. Nazionale Di Fisica Nucleare, Frascati, Italy
Abstract
This note describes the estimation process and the calculation results in performing a
reliability analysis for the Mu2e Calorimeter Front End Amplifier Board. The analysis
is based upon the procedures set forth in the military handbook, “Reliability Prediction of
Electronic Equipment,” also known as MIL-HDBK-217F. The analysis shows that the value
of the Mean Time to Failure for this board is estimated to be 1.92 x 106 hours.
Estimates of the probability of failure and a prediction of the number of board failures as
a function of time are presented.
2
Table of Contents
1. Introduction ..................................................................................................................................... 3
1.1. Scope ........................................................................................................................................ 3
1.2. Description of the Board ...................................................................................................... 4
1.3. Limits of Scope ....................................................................................................................... 7
2. Methodology ..................................................................................................................................... 9
2.1. Overview of the Analysis Process ....................................................................................... 9
2.2. Resistors ................................................................................................................................ 11
2.3. Capacitors ............................................................................................................................. 15
2.4. Low Frequency Diodes ....................................................................................................... 19
2.4.1. Temperature Factor for General Purpose Diodes .......................................... 19
2.4.2. Temperature Factor for Voltage Regulator Diodes........................................ 20
2.4.3. Stress, Contact Construction, Quality, and Environmental Factors ............ 21
2.5. Low-Frequency, Silicon MOSFETS .................................................................................... 24
2.6. Low-Frequency Bipolar Transistors ................................................................................ 27
2.7. Linear Integrated Circuits.................................................................................................. 32
2.7.1. Temperature Factor for Linear ICs.................................................................... 32
2.7.2. Environmental Factor for Linear ICs................................................................. 34
2.7.3. Quality and Learning Factors for Linear ICs .................................................... 35
2.8. Connectors ............................................................................................................................ 37
3. Analysis Results ............................................................................................................................. 40
3.1. Analysis Results 1 – No Harsh Environment Factors .................................................... 40
3.2. Analysis Results 2 – With Harsh Environment Factors ................................................ 45
3.3. Discussion ............................................................................................................................. 49
3.4. Interpretation of this Analysis .......................................................................................... 51
4. Appendix I – Overview of Reliability Analysis Methodology ................................................ 53
5. References ...................................................................................................................................... 59
3
1. Introduction
1.1. Scope
This note describes an analysis of the reliability of the Mu2e Calorimeter Front
End Electronics Board. The analysis is based upon the methodology described in the
military handbook, “Reliability Prediction of Electronic Equipment,” also known as MIL-
HDBK-217F [1] (hereafter referred to as “the handbook,”) which was developed by the
Dept. of Defense for analyzing the reliability of military and aerospace systems. The results
from this reliability analysis provide a prediction of the average failure rate for this board
in the Mu2e Calorimeter instrumentation system. The analysis is limited to the evaluation
of the components on the board, using the guidance set forth in the handbook, which uses
specific weighting or acceleration factors in the calculation of the failure rate for each
individual part on the board. These factors are functions of certain aspects of the
application and parts choices, including temperature, voltage, power, packaging,
complexity, fabrication technology, environment, and quality of manufacturing. Once the
failure rate for each part is calculated, they are then combined to obtain the overall failure
rate for the board. From this calculation, the “Mean Time to Failure,” or MTTF, can be
calculated. This is the standard quantity used in reliability analysis. From the MTTF,
estimates of the probability of failure and expected number of failures as functions of time
for the system can be obtained.
4
The Mu2e Calorimeter [2-3] is comprised of two disks, each constructed as an array
of cesium iodide (CsI) crystals. A rendering of the detector is shown in Fig. 1.2.1. There
are a total of 1348 crystals in the detector, split evenly between the two disks. Each crystal
is configured with four sets of three silicon photo-multipliers (SiPMs) across the face of
the crystal as shown in Fig. 1.2.2. Each set of three SiPMs is connected in series. The
signals from two such groups are summed together and instrumented with a Front End
Board connected to the back side of the SiPM holder as shown. Thus, each crystal is read
out using two Front End Boards. This yields a total of 2,696 Front End Boards in the
system.
Fig. 1.2.1. (A) Configuration of two Calorimeter Disks in the Mu2e detector
(B) Configuration of CsI Crystals Looking into the Face of a Disk
Fig. 1.2.2. Configuration of SiPMs and Associated Front End Boards on a CsI Crystal
5
A diagram of the readout electronics is shown in Figure 1.2.3. The Front End
Boards process the charge signals from the SiPMs, and send analog voltages off-board
differentially to be digitized. The analog signals are passed through a Mezzanine Board,
and on to the waveform digitizer board, which is call the DIRAC. These boards reside a
short distance away from the Front End Boards in crates located on the outer ring of the
Calorimeter Disks, as sown in Fig. 1.2.1. Each DIRAC digitizes signals from 20 Front End
Boards, and sends the digitized data off-detector to the back-end data acquisition system
over optical data cables.
A block diagram of a Front End Board is shown in Fig. 1.2.4. The board has two
main functions. One is to process the charge signals from the SiPMs as described
previously. The second is to control and monitor the bias voltage that is needed by the
SiPMs. To achieve this, the board contains an analog-to-digitize converter (ADC) for
digitizing the bias voltage, and a digital-to-analog converter (DAC) for producing an
analog voltage that controls the bias voltage value. The overall control of the bias circuit
is implemented using an ARM microprocessor that resides on the Mezzanine Board. The
microprocessor distributes the bias voltage reference values to the DACs on the Front End
Boards, and then reads back the digitized values from the ADCs, adjusting the DAC values
as needed to achieve the desired bias voltage. The bias control and monitor data are also
read out through the DIRAC and then passed to the back-end Detector Control System
(DCS) over the optical data cable. Groups of 20 Front End Boards are controlled by each
Mezzanine Board. The Front End Board also contains a regulator section for regulating
the voltages needed by the board, as shown in the figure. Design notes and performance
reports for the Front End Board can be found in [4-12].
6
Pictures of the Front End Board are shown in Fig. 1.2.5. The top side contains
circuitry for the amplifier section, while the back side contains circuitry for the bias control
and voltage regulators. There is a total of 86 different parts, some having multiple instances
that yield 176 parts total. All of the parts have surface mount packaging (SMT), including
resistors and capacitors with several different package sizes, diodes, discrete transistors,
integrated circuits (ICs), and connectors.
In the reliability analysis described in this note, the parts are categorized according
to the types defined in the handbook. The handbook then defines acceleration factors for
each part type and prescribes how to calculate them based upon operating conditions,
packaging, etc. The details of these calculations are presented in Section 2 for each part
type. Note that this analysis only pertains to the Front End Boards. The reliability of the
DIRAC and Mezzanine Boards will be covered separately.
7
This analysis does not include consideration of the quality or reliability of the
printed circuit board itself, neither the fabrication nor the assembly. These aspects are
more difficult to assess and require intimate knowledge of the practices and materials used
by individual vendors, which may be a function of time. Fortunately, experience in
designing and supporting large High Energy Physics (HEP) detector instrumentation
systems [13-21] has shown that the dominate failure mode of electronics tends to be due
component failures, provided that care is taken to select printed circuit board fabrication
and assembly vendors who have been qualified, such as ISO900x certification.
This analysis does not include consideration of radiation damage. Indeed, for HEP
applications, this can be a significant aspect in reliability analysis, and this is true for the
Mu2e experiment. Specifications for radiation tolerance have been developed for the
different subsystems in the experiment [22], and radiation tolerance measurement
campaigns are either in progress or have been completed. However, at the time of this
report, results for the Calorimeter Front End Board were not available in a form that lend
themselves to the framework of this analysis, i.e. the multiplicative acceleration factors
that modify base failure rates for individual parts. Radiation damage aspects that affect the
reliability of these electronics will therefore not be addressed here, and instead will be
reported separately.
In lieu of formal consideration of radiation damage, the handbook does provide the
means for considering aspects that affect reliability related to the nature of the environment
that the equipment operates in. Not surprisingly, since the handbook was developed by the
Dept. of Defense, the environments defined in the handbook tend to be related to military
applications, such as ground-based military, naval, aerospace, missile launch, etc. One of
these environments, AUC, Airborne, Uninhabited Cargo, has similarities to a HEP
experiment, where human access is limited, and with somewhat harsh environmental
conditions. The analysis performed for the Calorimeter Front End Board includes
consideration of the impact on the MTTF if this environmental factor is applied. Even for
the military applications, these environmental factors are estimates or approximations,
providing a means to evaluate the trend in decreased reliability that various harsh
conditions can cause, albeit with factors that likely have large uncertainties. To the extent
that the AUC environment has similarities to an HEP experiment, this analysis serves to
illustrate how the reliability of this board can degrade by extreme environmental
conditions, albeit with the caveat concerning uncertainties. This effect will be described in
the discussion of the results in Section 3.
This analysis also does not include any part-specific reliability information from
manufacturers or vendors. In general, it has proved to be very difficult to obtain reliability
information for non-mil-spec, commercial off-the-shelf (COTS) parts. If manufacturers
have this information at all, it is often not published in data sheets. Sometimes it can be
obtained through private inquiry, but this is rare, and it is often difficult to find the right
contact person. From a manufacturer perspective, performing reliability measurements and
adhering to advertised limits on a production line certainly adds cost, in a competitive
8
environment where cost is often weighted more highly than reliability. Indeed, COTS
parts are generally not manufactured with the same quality regimen as high-reliability
parts, so publishing reliability information can have a negative marketing effect. Lastly,
potential liability concerns are also factors that disfavor publishing reliability data. The
MIL-HDBK-217F handbook provides a means for electronics design teams to evaluate
system reliability as a function of different parts choices, quality level choices, testing
levels, etc., when information from the component vendors is not available. This has
inherent limitations, which are discussed further in Section 3.
9
2. Methodology
2.1. Overview of the Analysis Process
where λ is a constant, called the hazard rate or the average failure rate. The units of λ are
in “number of failures per unit time.” In the electronics industry, this is often expressed
as the number of failures in 1E9 hours of operation, and is called “Failures in Time,” or
FITs.
The quantity R(t), called the Reliability function or the Survival function, is defined
as the number of units that survive at time t, again normalized as a fraction of the total
number of a given component type. R(t) is related to F(t) as:
= 1− , 0≤t≤∞ (2.1.2)
For the case where the probability of failure is exponential in nature, R(t) has the form:
= 1− = , 0≤t≤∞ (2.1.3)
For a printed circuit board containing M components, each with hazard rates λ1, λ2, …λM
respectively, the hazard rates are added together to give an overall hazard rate for the
board:
= ∑ (2.1.4)
10
Once the hazard rate for a board is known, the probability of having a failure in
the system at a time ti, can be calculated as:
For any given component on a board, there may be several factors that contribute
to the hazard rate. Examples include temperature, mechanical or electrical stress, overall
quality of the part, the environment, etc. One could model this as individual hazards or
failure mechanisms. The approach used in MIL-HDBK-217F is to define a base hazard
rate for each type of part, λb, and then define multiplicative factors that are functions of a
particular failure mechanism. These factors are called weighting factors or acceleration
factors, or sometimes “pi factors.” The resulting hazard rate for a part then has the form,
The base rate can be thought of as the failure rate under baseline conditions when all
acceleration factors equal 1.
In the subsections that follow, a description of the failure mechanisms and hazard
rate factors is presented addressing specifically the components on this board, which
again is based upon the approach used in MIL-HDBK-217F.
11
2.2. Resistors
From MIL-HDBK-217F, Section 9.1, the hazard rate for resistors is specified as
the following:
λp = λb * πT * πP * πS * πQ * πE (2.2.1)
012 6 6
/ ∗: AB
-. = 3.567∗ 6809 ;2 <=7> ;?1@ <=7>
(2.2.3)
where Ea is the activation energy, Ta is the ambient temperature, and TREF is a reference
temperature. For the type RM resistor, the temperature parameters correspond to column
2 in the handbook. The activation energy is specified as 0.08 Joules. The reference
temperature is usually taken to be room temperature (25 C.) A plot of the temperature
factor as a function of operating temperature for a chip resistor is shown in Fig. 2.2.1. For
the type RTH thermistor, temperature is not a factor in the aging, and the temperature factor
is specified to be 1.
12
-C = D E F G H.IJ
(2.2.4)
Fig. 2.2.2. Power Factor vs. Power Dissipation for Chip Resistors
equation (2.2.1). For the type RM resistor, column 1 of the stress table for resistors is used.
The stress is a function of the actual power compared to the rating. The relationship is
modeled by:
MN OPQ CRSTU
-K 0.71 ∗ . ∗K
CRSTU VP WX
, where S = (2.2.5)
A plot of the stress factor is shown in Fig. 2.2.3. For the type RTH thermistor, stress is
not a factor in the aging, and the stress factor is specified to be 1.
Fig. 2.2.3. Stress Factor vs. Stress Ratio for Chip Resistors
The quality factor, πQ, is an attribute of the quality level in manufacturing from
the vendor. The table from MIL-HDBK-217F for resistors is shown in Table 2.2.1.
Unless specifically called out in the Bill of Materials, it will be assumed that the quality
level is “Non-Established Reliability.”
Designation πQ
S 0.03
R 0.1
P 0.3
Q 1.0
Non-Established Reliability 3.0
Commercial or Unknown 10.0
The environmental factor, πE, is an attribute of the environment. The table from
MIL-HDBK-217F for resistors is shown in Table 2.2.2. As discussed earlier, this analysis
will use the “AUC, Airborne, Uninhabited Cargo” for consideration of harsh conditions in
HEP experiments that may accelerate failure. This will be discussed in the Section 3.
Designation Meaning πE
2.3. Capacitors
From MIL-HDBK-217F, Section 10.1, the hazard rate for capacitors is given as the
following:
λp = λb * πT * πC * πV * πSR * πQ * πE (2.3.1)
There are several different types of capacitors listed in the handbook, most notably
differing by the type of dielectric, including: paper, metalized plastic, mica, ceramic, glass,
electrolytic, tantalum, etc. There are many different packaging options listed as well. Each
type tends to have the acceleration factors shown above, although the values may differ
from type to type. For this board, the capacitors are all surface mount with ceramic
dielectric, type CDR, Capacitor, Chip, Multiple Layer, Fixed, Ceramic Dielectric,
Established Reliability. The base rate for the CDR capacitor is specified as 2.0 failures
per 1E9 hours of operation.
012 6 6
/ ∗: AB
-. 3.567∗ 6809 ;2 <=7> ;?1@ <=7>
(2.3.3)
where Ea is the activation energy, Ta is the ambient temperature, and TREF is a reference
temperature. For the type CDR capacitor, the temperature parameters correspond to
column 2 in the handbook. The activation energy is specified as 0.35 Joules. The reference
temperature is usually taken to be room temperature (25 C.) A plot of the temperature
factor as a function of operating temperature for a chip capacitor is shown in Fig. 2.3.1.
16
The acceleration factor for capacitance, πC, modifies the base rate as shown in
equation 2.3.1. Generally, the larger the capacitance, the higher the probability of failure.
For the type CDR capacitor, column 1 of the capacitance factor table is used. The data is
modeled by the equation:
-Y Z F [ G[ G H.HJ
(2.3.4)
A plot of the capacitance factor as a function of capacitance value for a chip capacitor is
shown in Fig. 2.3.2.
The operating voltage applied to a capacitor also creates stress, which can lead to
accelerated failures. The acceleration factor for voltage stress, πV, also modifies that base
rate as shown in equation (2.3.1). For the type CDR capacitor, column 3 of the stress table
for capacitors is used. The stress is a function of the applied voltage compared to the
voltage rating. The relationship is modeled by:
A plot of the voltage factor as a function of applied voltage for a chip capacitor is shown
in Fig. 2.3.3.
Fig. 2.3.3. Voltage Stress Factor vs. Stress Ratio for Chip Capacitors
The quality factor, πQ, is an attribute of the quality level in manufacturing from
the vendor. The table from MIL-HDBK-217F for capacitors is shown in Table 2.3.1.
Unless specifically called out in the Bill of Materials, it will be assumed that the quality
level is “Non-Established Reliability.”
18
Designation πQ
D 0.001
C 0.01
S,B 0.03
R 0.1
P 0.3
M 1.0
L 1.5
Non-Established Reliability 3.0
Commercial or Unknown 10.0
The environmental factor, πE, is an attribute of the environment. The table from
MIL-HDBK-217F for capacitors is shown in Table 2.3.2. Consideration of the reliability
of this board under harsh conditions as characterized by the military designation AUC will
be discussed in Section 3.
Designation Meaning πQ
From MIL-HDBK-217F, Section 6.1, the hazard rate for low frequency diodes is
specified as the following:
λp = λb * πT * πS * πC * πQ * πE (2.4.1)
There are several different types of low frequency diodes defined in the handbook.
The types that are included in this category are: general purpose analog, switching, fast
recovery, power rectifier, transient suppressor, current regulator, voltage regulator, and
voltage reference. Each type tends to have the failure factors shown above, although the
values may differ from type to type. In this design, there are two types of diodes used:
general purpose (GP), and voltage regulator (VR). For the general purpose diode, the
base failure rate is specified as 3.8 failures per 1E9 hours. For the voltage regulator diode,
the base failure rate is 2.0 failures per 1E9 hours.
cd cP + e * P) (2.4.3)
Once the junction temperature is known, the temperature factor, πT, can be
found. For the general purpose diode, the first set of the temperature tables is used.
The value of πT is modeled by the equation:
6 6
f IHJ ∗/ Bh
-. =
;g<=7> ;2 <=7>
(2.4.5)
20
Fig. 2.4.1. Temperature Factor vs. Temperature for General Purpose Diodes
(Referenced to 25 Deg. C Ambient)
For the voltage regulator diode, the second set of the temperature tables is
used. The value of πT is modeled by the equation:
6 6
f Jij ∗/ Bh
-.
;g<=7> ;2 <=7>
(2.4.6)
Fig. 2.4.2. Temperature Factor vs. Temperature for Voltage Regulator Diodes
(Referenced to 25 Deg. C Ambient)
Voltage stress can occur in diodes under reverse bias conditions. The
acceleration factor for stress, πS, also modifies that base rate as shown in equation
(2.2.1). For the voltage regulator diode, the stress factor is specified to be 1.0,
voltage stress is not a factor for these devices. For all low frequency diodes, the
stress is modeled by:
-K 0.54 ∗ i.mI ∗ K
, for 0.3 < VS ≤ 1, (2.4..8)
MN OPQ VTnTUoT \RQ PXT
Pp VTnTUoT \RQ PXT VP WX
where VS = .
Fig. 2.4.3. Stress Factor vs. Stress Ratio for Low Frequency Diodes under Reverse Bias
Designation πQ
JANTXV 0.7
JANTX 1.0
JAN 2.4
Lower 5.5
Plastic 8.0
Table 2.4.1. Quality Factors for Low Frequency Diodes as a Function of Quality Levels
Designation Meaning πQ
Table 2.4.2. Environmental Factors for Low Frequency Diodes as a Function of Environment
24
There are several different types of Field Effect Transistors (FETs), including N-
channel, P-channel, enhancement mode, depletion mode, power, JFETs, GaAsSFETs, etc.
They differ in construction depending on the application, such as small-signal, switching,
or power. FETS can come as discrete, or as part of a larger integrated circuit. Also, FETs
are fabricated in many different technologies and feature sizes. The MIL-HDBK-217F has
chosen to divide FETs into three main categories: low-frequency silicon MOSFETS (less
than or equal to 400 MHz); high frequency silicon MOSFETs, and GaAsFETs. JFETs are
included in the low-frequency silicon MOSFETs. Integrated circuits are considered
separately. In this design, only low-frequency silicon MOSFETs are used, designated as
MOS, LF.
From MIL-HDBK-217F, Section 6.4, the hazard rate for low-frequency, silicon
MOSFETs is given as the following:
λp = λb * πT * πA * πQ * πE (2.5.1)
The base hazard rate, λb, is the hazard rate of a part under normal operation. For
the low-frequency MOSFETs, the base rate is specified as 12.0 failures per 1E9 hours of
operation.
cd cP + e * P) (2.5.3)
Once the junction temperature is known, the temperature factor, πT, can be found.
For the MOSFET, the value of πT is modeled by the equation:
25
6 6
f Jij ∗/ Bh
-.
;g<=7> ;2 <=7>
(2.5.5)
where the temperatures are in Celsius. A plot of the temperature factor as a function of
the junction temperature for the low-frequency MOSFET is shown in Fig. 2.5.1.
Fig. 2.5.1. Temperature Factor vs. Temperature for the Low-frequency MOSFET
(Referenced to 25 Deg. C Ambient)
The application factor, πA, accounts for stress on the device that is
application-dependent. The handbook divides into three categories: linear,
switching, and power. The application factor table from MIL-HDBK-217F for
low-frequency silicon MOSFETs is shown in Table 2.5.1. For this design, the
application is linear, so the value for πA will be taken to be 1.5.
Application πA
Linear 1.5
Small Signal Switching 0.7
Power, 2W≤ P < 5W 2.0
Power, 5W≤ P < 50W 4.0
Power, 50W≤ P < 250W 8.0
Power, P ≥ 250W 10
Designation πQ
JANTXV 0.5
JANTX 1.0
JAN 2.0
Lower 5.0
Designation Meaning πQ
There are three main types of junction transistors: bipolar junction transistors
(BJT), unijunction, and heterojunction. For bipolar junction transistors, there are two main
topologies: NPN and PNP. The types differ in fabrication and construction details
depending the application, including small signal, high frequency, or power. Different
fabrication technologies can be used, including silicon, germanium, and gallium arsenide,
and can have different feature sizes. The MIL-HDBK-217F has chosen to divide junction
transistors into three main categories: low-frequency silicon bipolar (less than or equal to
200 MHz); low-noise, high-frequency silicon bipolar; high-frequency power bipolar, and
unijunction. In this design, only low-frequency silicon bipolar transistors are used,
designated as BJT, LF.
From MIL-HDBK-217F, Section 6.3, the hazard rate for low-frequency (< 200
MHz) silicon bipolar transistors is specified as the following:
λp = λb * πT * πA * πR * πS * πQ * πE (2.6.1)
The base hazard rate, λb, is the hazard rate of a part under normal operation. For
the low-frequency bipolar transistors, the base rate is specified as 0.74 failures per 1E9
hours of operation. It is the same for NPN and PNP devices.
cd cP + e * P) (2.6.3)
Once the junction temperature is known, the temperature factor, πT, can be found.
For the BJT, the value of πT is modeled by the equation:
6 6
f i m ∗/ Bh
-.
;g<=7> ;2 <=7>
(2.6.5)
where the temperatures are in Celsius. A plot of the temperature factor as a function of
the junction temperature for the low-frequency bipolar transistor is shown in Fig. 2.6.1.
Fig. 2.6.1. Temperature Factor vs. Junction Temperature for the Low-frequency MOSFET
(Referenced to 25 Deg. C Ambient)
The application factor, πA, accounts for stress on the device that is
application-dependent. The handbook divides into two categories: linear and
switching. The application factor table from MIL-HDBK-217F for low-
frequency bipolar transistors is shown in Table 2.6.1. For this design, the
application is linear, so the value for πA will be taken to be 1.5.
Application πA
Linear 1.5
Switching 0.7
-V = D Gr H.Is
for PR > 0.1Watt (2.6.7)
A plot of the power rating factor as a function of operating temperature is shown in Fig.
2.6.2.
12.0000
10.0000
8.0000
6.0000
4.0000
2.0000
0.0000
-3.00 -2.00 -1.00 0.00 1.00 2.00 3.00 4.00
Log(Power Rating)
The applied voltage to a transistor also creates stress, which can lead to early
failure. The acceleration factor for voltage stress, πS, also modifies that base rate as shown
in equation (2.6.1). The stress is a function of the applied voltage compared to the voltage
rating. For the low-frequency BJT, the relationship is modeled by:
\
-K 0.045 ∗ I. ∗ \t
, where VS = \ u1 (2.6.8)
u1v
A plot of the stress factor as a function of the ratio VS for a LF BJT is shown in Fig. 2.6.3.
30
Fig. 2.6.3. Voltage Stress Factor vs. Stress Ratio for LF BJTs
Designation πQ
JANTXV 0.7
JANTX 1.0
JAN 2.4
Lower 5.5
Plastic 8.0
Designation Meaning πQ
Factors C2, πT, πE, πQ, and πL are defined in sections 5.8 – 5.11. Note that there is no
base hazard rate defined for this category.
The temperature factor includes the die complexity factor, C1, which
accounts for increasing failure rate with increasing complexity. Generally,
complexity is defined as being a function of the number of transistors in a device.
This is shown in Table 2.7.1. The handbook specifies the same table for both
bipolar and CMOS ICs, and makes no distinction on feature size. In general, unless
the IC design is custom, it is very difficult to ascertain how many devices are in an
IC design. For the purposes of this analysis, the guidance shown in the righthand
column of Table 2.7.1, developed by the author based upon will be used.
The temperature factor, πT, modifies the base rate as shown in equation
(2.7.1). It is a function of the junction temperature, Tj, which is given by:
cd cP ` e * P) (2.7.3)
Once the junction temperature is known, the temperature factor can be found.
It is modeled by the equation:
012 6 6
f ∗/ Bh
-. 0.1 ∗
3.567∗ 6809 ;g<=7> ;2 <=7>
(2.7.5)
where the temperatures are in Celsius. Ea is the activation energy, which for
linear ICs is taken to be 0.65. A plot of the temperature factor as a function of the
junction temperature for a linear IC is shown in Fig. 2.7.1.
Fig. 2.7.1. Temperature Factor vs. Junction Temperature for Linear ICs
(Referenced to 25 Deg. C Ambient)
34
The environmental factor includes the package failure rate factor, C2, which
is a function of the number of pins in a package, and also the package construction.
This is described in section 5.9 of the handbook. Package types considered include
hermetic DIPs and SMTs, cans, and DIPs with glass seals, and non-hermetic DIPs
and SMTs. Generally, the factors grow larger with package type, respectively. For
this analysis, all of the linear ICs are non-hermetic SMT. The relationship of the
factor C2 as a function of the number of pins in the package for the non-hermetic
SMT package is modeled by the equation:
.Hw
Zi 3.6 ∗ 10 m
∗ !,a & 2.7.6
where Np is the number of pins. A plot of this relationship is shown in Fig. 2.7.2.
The environmental factor, πE, for linear ICs is defined in Section 5.10 of the
handbook. The table is shown in Table 2.7.2. Consideration of the reliability of
this board under harsh conditions as characterized by the military designation AUC
will be discussed in the analysis section.
Designation Meaning πQ
-y 10 2.7.7
The handbook defines a learning factor, πL, which takes into account the
number of years that a part has been in production. The general idea is that for
complex integrated circuits, there may be bugs in the design that are evident early
in the production, but become identified and are addressed as time goes on so that
later production cycles have a lower probability of having problems. After a couple
of years, the handbook projects that all potential bugs have been addressed. The
model as specified in Section 5.10 is given as:
36
2.0000
1.5000
1.0000
0.5000
0.0000
0 0.5 1 1.5 2 2.5 3 3.5
Years in production
Fig. 2.7.3. Learning Factor vs. Number of Years in production for Linear ICs
37
2.8. Connectors
From MIL-HDBK-217F, Section 15.1, the hazard rate for a mated pair of
connectors is specified as the following:
λp = λb * πT * πK * πQ * πE (2.8.1)
There are many different types of connectors identified in the handbook, including:
circular/cylindrical, card edge (PCB), hexagonal, rack and panel, rectangular, RF coaxial,
telephone, power, and triaxial. Each type tends to have the acceleration factors shown
above, although the values may differ from type to type. For this board, there two types of
connectors used: a Rectangular Connector, RC, and mating pins for the the SiPM
connection. There is no type defined in the handbook for the mating pins, so it will be
assumed that they are similar in reliability to Power Connectors, PC, since they are robust
and large gauge.
The base hazard rate, λb, is the hazard rate of a part under nominal operation. For
the type RC connector, the base rate is specified as 46 failures per 1E9 hours of operation.
For the type PC connector, the base rate is specified as 7 failures per 1E9 hours of
operation.
08.6| 6 6
/ ∗: AB
-. 3.567∗ 6809 ;8<=7> ;?1@ <=7>
(2.8.3)
where T0 is the contact temperature, and TREF is a reference temperature. The contact
temperature has a provision to include self-heating due to current flow, having a general
form:
cH = cP + ∆ . = cP + [ • ∗ € .wj
] (2.2.4)
relationship is the same for all connector types. Since the current flowing through the
connectors is significantly less than 1 amp, the difference between the contact temperature
and the ambient temperature is of order ~a few degrees, and will be ignored since it is
small. A plot of the temperature factor as a function of operating temperature for
connectors is shown in Fig. 2.8.1.
Mating and un-mating a connector pair creates stress on the connector contacts, as
well as in the connections of the pins and sockets of the connector to the wires or cables.
The acceleration factor for mating/un-mating, πK, modifies the base rate as shown in
equation (2.8.1), and is a function of the frequency of the interconnects, as shown in Table
2.8.1. A cycle includes both connect and interconnect. The values are the same for all
connector types. In normal operation, the plugging and unplugging of this board will be
done rarely, so the low-frequency value will be assumed.
Mating/Un-mating Cycles πK
(per 1000 hours)
0 to 0.05 1.0
0.05 to 0.5 1.5
0.5 to 5 2.0
5 to 50 3.0
> 50 4.0
The quality factor, πQ, is an attribute of the quality level in manufacturing from
the vendor. The table from MIL-HDBK-217F for connectors is shown in Table 2.8.2.
Unless specifically called out in the Bill of Materials, it will be assumed that the quality
level is “Lower.”
Designation πQ
Mil-Spec 1
Lower 2
The environmental factor, πE, is an attribute of the environment. The table from
MIL-HDBK-217F for connectors is shown in Table 2.8.3. As discussed earlier, this
analysis will use the “AUC, Airborne, Uninhabited Cargo” for consideration of harsh
conditions in HEP experiments that may accelerate failure. This will be discussed in the
Section 3.
Designation Meaning πE
3. Analysis Results
The parts used on the board were identified and categorized according to the
methodology in MIL-HDBK-217F, as described in Section 2. The base hazard rates for
each part were defined. The various acceleration factors for each part were calculated,
based upon the expected operating conditions. For this stage of analysis, the environmental
factors πE for all components were set to 1.
The result of this analysis is shown in Table 3.1.1. The hazard rates for all
components were summed to give the overall hazard rate for the board, as described in
equation (2.1.4). This is shown at the top of Table 3.1.1.
The overall hazard rate for the board, λBD, is found to be:
Table 3.1.1a. Calorimeter Front End Board parts List with Calculated FITS Values, Nominal Environment
(Partial, 1 of 3)
42
Table 3.1.1b. Calorimeter Front End Board parts List with Calculated FITS Values, Nominal Environment
(Partial, 2 of 3)
43
Table 3.1.1c. Calorimeter Front End Board parts List with Calculated FITS Values, Nominal Environment
(Partial, 3 of 3)
44
= = 1− "# ∗ … J∗ %
(3.1.3)
For a system consisting of N boards, with no repairs being conducted, the expected
cumulative number of failures at time ti is given by:
# =,∗ = , ∗ [1 − "# ∗ … J∗ %
] (3.1.4)
Assuming that the Mu2e Calorimeter readout system has 2,696 Front End Boards in the
full readout system, and assuming 80% up-time each year (with the power to the electronics
turned off during the down-time), with no repairs performed, the associated probabilities
and expected number of failures in the system are summarized in Table 3.1.2.
Table 3.1.2. Predicted Probability of Failure and Numbers of Failures as a Function of Time, Nominal Environment
45
The overall hazard rate λBD_E with environmental factors included is given by:
This gives the Mean Time to Failure for the AUC environment, denoted as MTTFE to be:
…J ˆROUo
„cc … = = 1.09‚5 Hours (3.2.2)
"#_1
As in Section 3.1, the probability of failure and the expected number of failures at time ti
can be calculated.
Again assuming that the Mu2e Calorimeter readout system has 2,696 Front End
Boards in the full readout system, and assuming 80% up-time per year (with the power to
the electronics turned off during the down-time), with no repairs performed, the associated
probabilities and expected number of failures are summarized in Table 3.2.2.
Table 3.2.2. Predicted Probability of Failure and Numbers of Failures as a Function of Time, Harsh Environment
46
Table 3.2.1a. Calorimeter Front End Board parts List with Calculated FITS Values, Harsh Environment
(Partial, 1 of 3)
47
Table 3.2.1b. Calorimeter Front End Board parts List with Calculated FITS Values, Harsh Environment
(Partial, 2 of 3)
48
Table 3.2.1c. Calorimeter Front End Board parts List with Calculated FITS Values, Harsh Environment
(Partial, 3 of 3)
49
3.3. Discussion
With the environment factors set to 1, the MTTF was found to be 1.92E6 hours.
The resulting probability of failure was found to be 0.36% in the first year. The reliability
requirements for the Calorimeter, as specified in [35], state that the overall failure rate
should be “at the percent level” per year [24]. This includes all electronics in the readout
chain, plus the silicon photo-multipliers that comprise the active detector. Based upon this
analysis, the predicted failure rate is within the stated requirements, approximately 1/3 of
the allocation per year, which falls within the specification, with margin.
One advantage that the Calorimeter Front End Board electronics has is the active
cooling system that will be in place, providing an operating environment estimated to have
a temperature of 12 C. The normal reference temperature for this analysis is 25 C. As
described throughout the document, higher ambient temperatures accelerate the lifetime of
electronics. It is also true that lowering the ambient temperature will decelerate aging.
This can be seen in Table 3.3.1. The cooler environment for this system provides an
improvement of about 27% in the hazard rate, MTTF, and probability of failure. This also
translates to an improvement in failure rate, with approximately 25% fewer boards
predicted to fail in the first year.
The analysis can be used to identify the parts that have the poorest reliability. The
parts having the highest predicted failure rates for the board are shown in Table 3.3.2. The
table is expressed in FIT values (failures per 1E9 hours.) Topping the list are the
connectors. These have a high base hazard rate, as prescribed from the handbook, although
the acceleration factors are modest. The next highest category of failures comes from the
FETs, which also have a relatively large base hazard rate, as prescribed by the handbook..
It is worth noting that in the early days of FET fabrication, the reliability was not as good
as it is today, so these values may not be truly representative of modern-day FET
performance. High voltage capacitors C24 and C28 round out the list. C28 is a bypass
capacitor on the 230V coming into the board for biasing the SiPMs. Likewise, C24 is a
bypass capacitor on the output side of the bias regulator, which nominally operates at 200V.
Both capacitors are rated for 250V. The fact that these capacitors operate so close to the
rating is the primary contribution to the high value of the weighted FIT.
50
Table 3.3.2. Parts on the Board having the Highest Predicted Failure Rates
As described in Section 3.2, the MTTF decreases by a factor of 17.6 when the
environment factors for harsh condition AUC are incorporated in the calculations. This is
significant, and if taken literally, would imply that the reliability of the board would miss
the performance goal.
The consideration of the effect on the MTTF under harsh environmental conditions
was introduced earlier. The handbook defines “AUC, Airborne Uninhabited Cargo.” as,
“Environmentally uncontrolled areas which cannot be inhabited by an aircrew during
flight. Environmental extremes of pressure, temperature and shock may be severe.
Examples include uninhabited areas of long mission aircraft.” There are some similarities
with this definition to the Mu2e experiment. Certainly, the front-end electronics will be
inaccessible during the running of the experiment, where access is expected approximately
once per year for maintenance. The experiment will have wide variation in air
pressure/vacuum. Temperature swings will be present, although likely not to the level
experienced in uninhabited aircraft. There should not be much mechanical vibration or
shock in the Mu2e detector, although the detector train will move, albeit with much smaller
forces and velocity. However, a significant environmental factor for the Mu2e front-end
electronics is radiation damage, which is not mentioned in the definition of AUC, even
though there is a radiation effect in aircraft. The thinning of the atmosphere results in a
higher flux of cosmic rays, resulting in a higher radiation dose, albeit small, compared to
ground-based operation. Even so, the most extreme dose and fluence levels in aircraft are
orders of magnitude smaller than in Mu2e. So, there are similarities between the two
environments, but differences as well. Again, given the uncertainties in the definition of
associated environmental acceleration factors for the different types of electronic
components, consideration of the AUC environment provides a sense of how the MTTF is
affected by harsh environmental conditions. That said, the results from the inclusion of
the environmental acceleration factors should be regarded as illustrative only, with
uncertain expectation that it will match realistic operating conditions concerning reliability.
The exercise does serve though to underscore the importance of understanding the radiation
tolerance of the electronic components in a front-end design.
51
The handbook is not without caveats, limitations, and other issues, some of which
are identified below.
• The handbook is quite dated. The first edition was released in 1961, around the
advent of the marketing of transistors. The original basis for the handbook was
experience with vacuum tube electronics. There have been several releases since
that first version, attempting to include parts and developments in technology as
they evolved. The latest version, MIL-HDBK-217F, was released in 1991. Since
this time, there have obviously been additional major advances in integrated circuit
technology nodes (fabrication feature sizes.) For example, in 1991, the CMOS 0.35
um technology node began production. Today the smallest technology node in
production is 5 nm, a factor of 70 in feature size, or 1400 in area, which matches
the predictions of Moore’s Law [23-24]. New IC fabrication companies have come
and gone in this time. The 1991 handbook makes virtually no reference to
technology nodes. In another example, the majority of electronic components
manufactured today are surface mount, in a variety of sizes, packages, and
materials, but in the 1991 handbook, surface mount parts tend to be lumped into
single categories, although some discrimination is provided for power ratings.
Many important differences in surface mount components and packages that are
known today to affect reliability have been neglected. While it appears that the
Dept. of Defense will not be issuing any more updates to the handbook, the VITA
Standards Organization (VSO) [25] has produced updates [26], which some
reliability analyses have been incorporating. These updates from VITA have not
been included in this analysis.
52
• The handbook methods are based upon elementary reliability concepts, which do
not take into account newer developments in reliability analysis. Some of these can
be found in [27-30]. In addition, a recent thrust has been in the study of “physics
of failure” [31]. The analysis described herein is based entirely on the methodology
set forth in the handbook, and does not include any of these modern aspects to
reliability theory. An overview of the foundation for this analysis is provided in
the Appendix.
1. Early failure, also known as the Infant Mortality period. This type of failure
happens early in the lifetime of a component, and is usually caused by
manufacturing defects.
2. Useful lifetime, also known as the period of constant failure rate. In this period,
failures are random, but occur with an overall constant rate.
3. End of life, also known as the Wear-Out period.
Taken together, these periods comprise the “bathtub curve,” as shown in Fig. 4.1.1 [36].
Generally speaking, this note is concerned with calculating the (constant) failure rate for
the useful lifetime period.
Fig. 4.1.1. Bathtub Curve, showing the three types of failures and their associated
periods. (Courtesy of Wikipedia [26].)
where λ is a constant.
54
The quantity F(t) is called the Cumulative Distribution Function (CDF). The CDF can be
interpreted as:
1. F(t) is the probability that a random component of a specific type and value in a
system fails by time t; or
2. F(t) is the fraction of all like components in a system fail by time t.
Note that while the point at which F(t) = 1 effectively represents the failure of all
components of a particular type in the entire system, this is distinctly different from the
“wear out” failure rate shown in the bathtub curve of Fig. 4.1.1. Wear out comes from
fatigue from use, whereas the failures in the useful lifetime is considered to be random
events related to imperfection in the manufacturing process. Typically, parts wear out
much sooner than the random failures would deplete the population.
where λ is a constant, which will be discussed shortly. A plot f(t) is shown in Fig. 4.1.3,
for the case where λ = 1E6. The quantity f(t) dt represents the fraction of failure times in
interval dt.
The Reliability function (also called the Survival function), is defined as:
1. R(t) is the probability that a random component of a specific type and value in a
system will still be operating after t hours; or
2. R(t) is the fraction of all like components in a system that will still be operating
after t hours.
At a given time τ, some number of failures will have occurred in the system. The
probability of failure in the next ∆τ of time is expressed as a conditional probability, the
probability of failure in the next ∆τ of time given that number of components that have
survived to time τ:
56
Ž •• ‘• Ž •
G G Š ∆‹ | •• ‹ =
V •
(4.4)
Of interest in reliability analysis is the rate of failure, also known as the hazard rate
or instantaneous failure rate. This is denoted as h(t), and is defined as:
Q“ Ž •• ‘• Ž • •
ℎ =
•→H ‘• V • V
= (4.4)
For the case where f(t) is an exponential as represented in (4.2), the hazard rate reduces
to:
ℎ = (4.5)
The units of λ are in “number of failures per unit time,” which is a failure rate.
Electronic component manufacturers often express failure rates in terms of the number of
failures in 1E9 hours, which is called “failures in time” or FITs.
In the general case, h(t) can vary as a function of time. Of interest is the Average
Failure Rate, or AFR. Over a time period t2 – t1, this is defined as:
˜
—˜ = e b
– i − = 6
(4.6)
= 6
Again, for the case where f(t) has an exponential form as shown in (4.2), the AFR is:
– i − = (4.7)
Thus, assuming an exponential form of the Cumulative Distribution Function for the
failures of electronic components, the FITs value will be an indicator of the average
failure rate of the components in the system.
„cc = (4.8)
Note that MTTF is for the case where components fail in a system and are not replaced
(immediately). This would be the situation for a detector in which access to perform
repairs is infrequent. This should not be confused with the term Mean Time Between
Failures, or MTBF, in which components are replaced as they fail. These terms are often
used interchangeably, although their meaning is different.
The goal of this analysis is to calculate the value of the hazard function for an
entire printed circuit board. A printed circuit board typically has many components on it,
57
each having their own hazard function. The probabilities of failure (or survival) of the
different components must be combined in order to get the overall probability of failure
(or survival) for the entire board. For two events, A and B, the probability of them
occurring simultaneously (intersection of event spaces) is given by conditional
probability:
C M⋂
–|™ =
C
(4.9)
Rearranging:
– ⋂™ = ™ ∗ –|™ (4.10)
–|™ = – (4.11)
Then:
– ∩ ™ = ™ ∗ – (4.12)
For a simple printed circuit board that has two different components on it, A and B,
having probabilities of survival P(A) and P(B) respectively, the probability of them both
surviving as a function of time t is obtained by multiplying the two probabilities together
to get the overall probability. This assumes that their failures are independent of each
other, that the failure of one does not cause the failure of the other. For simple reliability
calculations, this is what is generally assumed. If component A have a hazard rate of λA,
and component B have a hazard rate λB, then the combined probability for survival as a
function of time t is given by:
Extrapolating to a printed circuit board containing M components, each with hazard rates
λ1, λ2, …λM respectively, the hazard rates are added together to give an overall hazard
rate for the board λBD:
= ∑Ÿ (4.13)
Once the overall hazard rate for a board is known, the probability of having a
board failure as a function of time is given by Cumulative Distribution Function:
= = 1− "# ∗ (4.14)
For a system having N identical boards, the CDF can be used to calculate the
number of expected failures of boards in the system:
58
# = ,∗ = ,∗ 1− "# ∗ 6 (4.15)
For any given component on a board, there may be several factors that contribute
to the hazard rate. Examples include temperature, mechanical or electrical stress, overall
quality of the part, the environment, etc. One could model this as individual hazards or
failure mechanisms. Assuming that the failure mechanisms are independent, the overall
survival probability would be given by the product of the individual probabilities for each
failure mechanism, i.e., the net probability that all independent failure mechanisms will
not occur at time t. Rather than doing this, the approach used in MIL-HDBK-217F is to
define a base hazard rate, λb, and then define multiplicative factors that are functions of
the particular failure mechanism. The resulting hazard rate, λp, has the form,
5. References
∗ Corresponding author.
E-mail addresses: [email protected] (A. Saini), [email protected] (R. Prakash).
https://doi.org/10.1016/j.nima.2020.164874
Received 14 April 2020; Received in revised form 16 November 2020; Accepted 16 November 2020
Available online 19 November 2020
0168-9002/© 2020 Elsevier B.V. All rights reserved.
A. Saini, R. Prakash and J.D. Kellenberger Nuclear Inst. and Methods in Physics Research, A 988 (2021) 164874
Table 1
Design specifications for operational beam parameters of the PIP-II linac.
Parameter Magnitude Units
Final beam energy 800 MeV
Beam pulse repetition rate 20 Hz
Beam pulse length 0.55 ms
Average CW beam current 2 mA
Final 𝜀z <0.4 mm-mrad
Final 𝜀t ≤0.3 mm-mrad
𝜀z normalized RMS longitudinal emittance; 𝜀t normalized RMS transverse Fig. 1. Block diagram representation of the PIP-II Linac. Red coloured blocks represent
emittance. the warm sections (RT) whereas the blue blocks represent superconducting sections (SC)
operating at 2 K. Normalized design velocity (𝛽) of the cavity in each section is also
shown. (For interpretation of the references to colour in this figure legend, the reader
is referred to the web version of this article.)
(e.g. cryo-plant, RF system etc.) or considering a simple form of the
major beamline elements [9–13]. For this reason, the paper lays out
a methodology for the availability assessment of the complete particle Table 2
Optics elements and transition energy in each section of the PIP-II SRF linac.
accelerator facility. It describes a comprehensive reliability model for
Section CM Cav/Mag Operating Energy
the availability assessment of the Proton Improvement Plan-II (PIP-
per CM Frequency (MeV)
II) SRF accelerator facility [14] that includes not only the accelerator
HWR 1 8/8 162.5 MHz 2.1–10
components but also essential utility systems in terms of the water, SSR1 2 8/4 325 MHz 10–32
air, cryogenic and power systems. Furthermore, the model implements SSR2 7 5/3 325 MHz 32–177
the accelerator components in their detail composition that implies an LB 9 4/1a 650 MHz 177–516
accelerator component is described with its essential auxiliary systems. HB 4 6/1a 650 MHz 516–833
For an instance, an accelerator cavity in the model is implemented with a
Normal conducting quadrupole doublet.
its power coupler, frequency tuner and RF power source. Thereafter, the
paper discusses studies for the PIP-II facility that lead to finding of the
most critical section determining the unavailability budget of the PIP-II these families, the linac is segmented into five SRF sections named
facility. Lastly, the paper converses the input data sensitivity analysis as, Half Wave Resonator (HWR) [20], Single Spoke Resonator (SSR)
assessing impact of a spread in the reliability input data on the model 1 & 2 [21,22], and Low Beta (LB650) and High Beta (HB650) [23].
prediction and, validation of the model methodology using a reference Table 2 highlights configuration of each section and includes details of
model of the existing operational accelerator facility. a number of cryomodules (CM), focusing magnets and cavities as well
The paper is organized in seven sections. Section 2 provides an as operating frequency of cavities and their accelerating ranges. Note
overview of the PIP-II SRF linear accelerator whereas Section 3 in- that, superconducting solenoid magnets are used in the HWR, SSR1 and
troduces key definitions and concepts of the availability analysis for SSR2 sections whereas normal conducting (NC) quadrupole magnets
an accelerator system. Section 4 discusses preparation of the PIP-II arranged in doublet configuration are utilized in the LB650 and HB650
accelerator facility model and describes components selection criteria, sections for the transverse beam focusing.
operational modes and the high-level functional block diagram of the The linac optics has been carefully designed to deliver a high-quality
facility. Section 5 converses results of the availability analyses while beam at the Booster entrance. Fig. 2 shows the accelerating voltage and
Section 6 presents a sensitivity analysis and the model benchmarking output energy at each cavity along the linac for the baselined optics.
with an operational accelerating facility. The paper concludes with a Detailed description of the linac architecture and its optics design has
summary in Section 7. been presented elsewhere [24].
Fermilab is planning to perform a systematic upgrade to its existing There are many good text books [25,26] dedicated to the reliability
accelerator complex to support a world leading neutrino program. A engineering theory. For the comprehension of this article, this section
comprehensive roadmap named ‘‘Proton Improvement Plan (PIP)’’ has introduces necessary theory and, discusses how it is applicable in the
been established. The second stage of the Proton Improvement Plan framework of accelerators.
comprises construction of a new superconducting linear accelerator The failure rate (𝜆) of a component through its life span usually
(linac) capable of accelerating a 2 mA H− ion beam up to 800 MeV follows a bath-tub distribution as shown in Fig. 3. Initial portion of
in a continuous wave (CW) regime. However, the initial operational the bath-tub curve is called the burn-in period that consists of a high
goal is to deliver a 1.1% duty factor pulsed beam to the existing Booster failure rate due to the infant mortality. Similar behaviour is observed
synchrotron [15]. The PIP-II accelerator facility aims at the operational at the end of the curve due to deterioration of components. This period
availability of 90% over a fiscal year [16]. Table 1 summarizes the most is defined as the wearing-out period. Between these two regions, a
relevant operational beam parameters of the PIP-II linac. system has a useful life period which consists of a relatively lower
A schematic of the SRF linac’s architecture is shown in Fig. 1. It and constant failure rate. Assuming, accelerators also follow the bath-
is composed of a warm front-end and an SRF accelerating section. The curve analogy. The burn-in period is then referred to the commissioning
warm front-end comprises an H− ion source (IS) capable of delivering a period when the accelerators are being actively tuned and tested to
15 mA, 30 keV, DC or pulsed beam, a 2 m long Low Energy Beam Trans- deliver operational parameters. A wear-out period for the accelerators
port (LEBT) line [17], a 162.5 MHz, CW Radio Frequency Quadrupole is the period when an upgrade or replacement is needed to maintain
(RFQ) [18] that accelerates the beam to 2.1 MeV and a 13 m long its operational performance. In this paper, the main emphasis is on the
Medium Energy Beam Transport (MEBT) line [19] that includes variety useful period of an accelerator which can be interpreted as its nominal
of diagnostic devices and a chopper system capable of generating operational period. In subsequent sections, the availability model is
an arbitrary bunch pattern before the beam is injected into the SRF solved using the assumption of a constant failure rate of components.
section. Note that, the assumption not only justifies the bath-tub analogy but
The MEBT is followed by the SRF linac that uses five families also permits solving the model analytically which otherwise becomes
of SRF cavities to accelerate the beam up to 800 MeV. Based on too cumbersome to solve analytically for the large systems.
2
A. Saini, R. Prakash and J.D. Kellenberger Nuclear Inst. and Methods in Physics Research, A 988 (2021) 164874
𝑀𝑇 𝐵𝐹 = 𝑀𝑇 𝑇 𝐹 + 𝑀𝑇 𝑇 𝑅; (1)
3
A. Saini, R. Prakash and J.D. Kellenberger Nuclear Inst. and Methods in Physics Research, A 988 (2021) 164874
Fig. 5. A reliability block diagram representing common system-component functional • As also mentioned earlier, each component in the model possesses
relationship in a complex system. only binary states of the operation i.e. either operating nominally
or failed. The component can migrate any state independent of
its history of the operation.
a failure of the overall system until all parallel connected components • A component exhibits a constant failure rate during its operation.
get failed, similar to the logical OR gate analogy). Note that series and • Each component fails at a random time with an exponential
parallel connections are a limiting case of the ‘‘k out of n’’ system where distribution determined by its MTBF. Two simultaneous failures
the availability of a system with n identical components is obtained as are prohibited in the model. Those uncorrelated component fail-
following: ures are then represented by the Markov chains [26] and solved
∑
𝑘 analytically to evaluate the system availability.
𝑛!
𝐴(𝑡)𝑠𝑦𝑠(𝑟≤𝑘) = 𝐴(𝑡)𝑛−𝑟 (1 − 𝐴 (𝑡))𝑟 (4) • When a component fails, it leads to the system failure (unless
𝑟! (𝑛 − 𝑟)!
𝑟=0 fault-tolerances are specified) resulting in an unscheduled accel-
where r is number of failures, k is maximum allowable failures and erator shut-down. A temporary component failure such as one
A(t) is the availability of the component at time t. When k=0, all resulting from quenching of an SRF cavity or magnet is not treated
components are connected in series while for 𝑘 = 𝑛 − 1, all components as a failure in the model.
are connected in parallel. • The model assumes components meet their design specifications
In an accelerator, a variety of component-system functional rela- and the system is maintained to its best operable condition. Thus,
tionships such as series, parallel, standby, redundant connects etc., the model does not incorporate manufacturing errors, human
might exist simultaneously. A list of formulae for the system availability errors and environmental errors. Additionally, implications of the
and reliability with such configurations has been presented in the drift failures or degradation in performance of components are
Appendix A.1. not included in the model.
• The model implements only corrective maintenance. It implies
4. Availability assessment model for PIP-II the fault detection time, logistic time at various stages of repair,
tuning etc. are excluded. As soon as a failure is detected, the
A comprehensive availability assessment model of the PIP-II accel- maintenance process is launched. After a repair, the component
erator facility in form of the high-level functional block diagram is is treated ‘‘as good as new’’. Thus, resulting availability of the
developed to compute its availability. This section details preparation system is called inherent availability. Note that, the availability
of the model and delineates assumptions and guidelines used to build in this paper is always attributed to the inherent availability.
the model. • The model is further simplified with the assumption that the
facility transits from a no-beam state after a failure to the nominal
4.1. Component selection beam state as soon as a repair is completed.
• A mission time of about a year, equivalent to eight thousand
It is evident that an accelerator comprises numerous components
operational hours, is assumed for the availability analysis of the
and dependent systems. Many of these components need additional
PIP-II accelerator facility.
auxiliary elements to execute their nominal function. For instance, an
accelerating cavity assembly in the beamline is comprised of several
4.3. Operational modes
auxiliary elements such as, power coupler to feed RF power; a me-
chanical tuner to tune its resonant frequency etc. This in turn, adds
A system can require to operate in different modes. These opera-
another layer of elements in the model. Consequently, the model of
tional modes define the system-component functional relationship and
an accelerator facility becomes very large and cumbersome. In order
to resolve this issue, a component-selection criterion was applied to therefore, a failure pattern of the system. Consequently, the system op-
the PIP-II model. A component features any of following characteristics erational availability may vary from one operational mode to another.
is included in its detailed composition as practically permissible while Thus, it is essential to establish operational modes of a system before
preparing the model of the PIP-II facility. estimating its availability. In this article, the availability of the PIP-II
accelerator facility is evaluated for two operational modes named as
• Components having moving parts such as vacuum pumps, cavity the nominal operational mode and critical operational mode.
tuners etc.
• Components operating in pulsed mode such as high voltage 4.3.1. Nominal operational mode
switches, kicker system in the MEBT etc. In the nominal operational mode, the PIP-II facility delivers
• Components that are involved in thermal cycling processes e.g. 800 MeV beam to the Booster synchrotron with the design specifica-
heat exchangers for low conductivity water (LCW). tions listed in Table 1. Note that, the baseline configuration of the
• Components containing a high stored energy, e.g. RF cavities and SRF linac has been designed to accelerate the beam up to 833 MeV.
magnets etc. This additional energy provides a safety margin to achieve the nominal
4
A. Saini, R. Prakash and J.D. Kellenberger Nuclear Inst. and Methods in Physics Research, A 988 (2021) 164874
operational mode. It has been shown elsewhere [28,29] that the SRF
linac optical design is sufficiently robust to tolerate a failure of optical
element in each SRF section without conceding the design specifica-
tions. Consequently, the nominal operational mode can be achieved in
two ways. The first nominal operational scenario, termed as no-failure-
permit in this paper, involves all optical elements are operating with
their design parameters. In this configuration, any component failure
will produce a complete system failure. The second scenario is named
as the fail-tolerance operation that permits a faulty/malfunctioned
accelerating cavity in each SRF section (HWR, SSR1, SSR2, LB650 and
HB650). It implies that the facility would keep operating even after
a failure of the SRF cavity in each section. Note that, a repair or
replacement of an element in cryogenic environment requires relatively
a longer time in comparison to repair of a normal-conducting element.
Consequently, the fault-tolerances in the availability estimate have
been included only in SRF sections. This choice for the analysis does not
infer the fault-tolerance capability of the normal-conducting sections
and, allows a conservative estimate of the availability. It is worth to
Fig. 6. Components with the minimum MTTFs in the PIP-II model. The colour of the
mention that a conservative assessment is beneficial at the design phase bar represents the component’s association with respective assembly or section. For
where a number of factors (human errors and environmental impacts) instance, red coloured bar shows the MTTF of components in the ion-source assembly.
are relatively less known. (For interpretation of the references to colour in this figure legend, the reader is referred
to the web version of this article.)
The reliability input data, MTTF and MTTR, for components are ac-
quired from various sources including educated guess from the subject
experts, operational experience with similar components at Fermilab
as well as existing accelerator facilities, and from prototype tests. The
beam commissioning of the PIP-II front-end at the Proton Improvement
Plan-II Injector Test (PIP2IT) facility [30] also provided a useful in-
formation about operational reliability of the PIP-II components such
as ion-source, magnet power supplies etc. A few components were Fig. 7. Components with the maximum MTTF in the PIP-II facility model. The colour
of the bar represents the component’s association with respective assembly or section.
commercially available and therefore, corresponding data were readily For instance, red coloured bars represent components in a superconducting (SC) cavity
available. In addition, a few references [5–13] were also used to obtain assembly. (For interpretation of the references to colour in this figure legend, the reader
data that were unavailable otherwise. is referred to the web version of this article.)
Fig. 6 shows most vulnerable components in the PIP-II accelerator
facility model. It can be noticed from Fig. 6 that components in the ion
source assembly possess the minimum MTTF that are followed by the mitigation strategy involves replacing the faulty cryomodule with a
compressor in the air utility system. Fig. 7 shows the most robust and fully-functional spare cryomodule. Then, repair of the faulty-element
reliable components of the PIP-II facility model that have longest MTTF. in the cryomodule is carried out in parallel without affecting the
Note that, a high MTTF implies less frequent failures of the component. accelerator operational time. This strategy restricts the repair time of a
It can be noticed from Fig. 8 that the high voltage transformer in superconducting element to only about a month.
the electrical power grid and the SRF cavities acquire longest MTTR in
the model. Based on previous experience at Fermilab, experts suggest 5.2. High-level functional diagram for the PIP-II facility
that a repair/replacement of such transformer could take up to full two
weeks. Considering an eight-hours work shift per day, the repair time is As a next step for the availability assessment, a high-level functional
then estimated to more than 1000 h (24 × 3x14 >1000 h). Because of block diagram model of the PIP-II facility was developed. The facility,
this, the PIP-II facility envisions two power lines. Electric-power loads as shown in Fig. 9, was modelled in two main parts: Utility systems and
is swiftly shifted from one line to another in case of a failure. A repair is linac systems.
then performed in parallel without a long interruption. Also note that,
repair of an SRF cavity may need warming of the cryomodule from a 5.2.1. Utilities systems
cryogenic temperature to the room temperature, taking cryomodule out A utility system in the model indicates a central facility of the core
from the accelerator tunnel and then, dismantle it to replace/repair the supply essential to operate an accelerator such as a cryo-plant to supply
faulty cavity. It could result in a long unscheduled down time spanning the cryogen for the SRF cavities. The model incorporates four utility
over several months. To minimize this time at the PIP-II facility, the systems that are subsequently discussed in detail.
5
A. Saini, R. Prakash and J.D. Kellenberger Nuclear Inst. and Methods in Physics Research, A 988 (2021) 164874
Table 3
Components and their functions in the respective packages in the HWR cryomodule.
Component Function
Cavity package
Cavity Acceleration, longitudinal beam
focusing
Tuner Tune cavity resonant frequency
Power coupler Feeding RF power to cavity
Interlock sensors and electronics Sensors and electronics
Low Level RF RF control and instrumentation
Solid state Amplifier (SSA) RF power source
RF control package: SSA control and timing
SSA controls RF controls to SSA
SSA timing Timing to the SSA
Magnet assembly package: Solenoid magnets assembly
Magnet power supply Power supply to solenoids
Magnet Transverse focusing of the
beam.
Magnet instrumentation Control system
Fig. 8. Components with the longest MTTR in the PIP-II facility model.
Steering assembly package
Steering Magnet Beam Trajectory Correction
Steering Power Supply Magnet power supply
• Electrical-Power System: The PIP-II accelerator facility envisions two
Vacuum system package
electrical-power substations where one of the substations is available
Vacuum Valves Maintain vacuum
in the standby mode. In an event of failure, the power-load is swiftly Vacuum. Pump Creating vacuum in the
shifted to the standby substation. The model includes major electrical beamline
components such as transformers, switchgears, fuses, circuit breakers Vacuum pump power supply Powering the vacuum pump
and cables. The most vulnerable component in the electrical system Local cryogenic system package
is the Vacuum Circuit Breaker (VCB) which exhibits a higher failure Local cryogenic system Cryogenic distribution, cryostat
rate. Because of this, four out of every eight VCBs are redundant in the structure and control
electrical system. Note that, the model does not incorporate the power
generating system but only the supply system.
• Cryo-plant System: A cryo-plant supplies the cryogen necessary to 5.3. Case study of availability assessment for HWR section
maintain cryogenic temperature of the superconducting cavities. The
main components of the cryo-plant included in the model are the cold
In order to illustrate how the availability assessment is performed,
compressors, turbines, expanders, warm compressors and, associated
this section discusses a detailed case study for the HWR section and
control systems. The warm compressors are the most susceptible to
describes the methodology applied to evaluate the availability of the
failures among the cryo-plant components.
complete PIP-II facility.
• Low Conductivity Water (LCW) System: It delivers water to maintain
The HWR section is the first SRF section in the PIP-II linac. As
the operating temperature of normal conducting water-cooled elements
shown in Table 2, it consists of one cryomodule that comprises eight
such as the RFQ. The LCW system includes circulating pumps, heat
solenoid magnets and same number of HWR cavities. Each solenoid
exchangers, gauges, transducers, flow meters and, valves. Among those
magnet includes the steering magnets to correct the beam trajectories
components, the circulating pumps are more often involved in the
in horizontal and vertical planes. Those beamline elements further
failures. Consequently, the LCW system of the PIP-II facility includes
need auxiliary components to execute their nominal operation. Thus,
a redundant unit per three circulating pumps.
it is more appropriate to describe an essential element in terms of the
• Compressed Air System: The air system supplies compressed air for package including all supporting components. The cryomodule model
cooling of the radiation-cooled components, actuation and control of is then represented using six packages: cavity, RF control, magnet
pneumatic valves etc. Two main components of the air-system are the assembly, steerer assembly, vacuum system and local cryogenic sys-
compressor, and dryer. Each of them has a redundant unit in the model. tem packages. Table 3 lists major components and their functions in
respective packages for the HWR cryomodule.
5.2.2. Linac system Availability assessment for the HWR cryomodule is performed for
The model includes a detailed description of the accelerator system. two operational modes: no-failure-permit and a cavity-fail-tolerance.
Along with the SRF linac (described in Section-II), details of the Beam In a no-failure-permit mode, failure of any component leads to failure
Transfer Line (BTL) [31] were also included in the model. The BTL of the complete HWR cryomodule whereas in, a cavity-fail-tolerance
line is used to transport the beam from the end of the SRF linac to mode, the cryomodule keeps operating even after failure of one out
the Booster entrance. It is about 350 m long and mainly composed of of any eight SRF cavities. Fig. 10 illustrates the functional block dia-
normal conducting quadrupole and dipole magnets. grams of the HWR cryomodule describing logical connections among
As shown in Fig. 9, the utility systems are connected to the linac element packages for two operational modes. In the no-failure-permit
in a series configuration. It implies failure of any functional blocks mode, all elements packages are connected in the series configuration
will shut-down the complete facility. After establishing the component- (Fig. 10(a)). In a cavity-fail-tolerance mode (Fig. 10(b)), all element
system functional relationship, the PIP-II accelerator facility model packages are connected in series with the cavity packages that are
was incorporated in a Python-based program. The program has been configured in seven out of eight arrangement.
developed at Fermilab to automate the availability assessment. It not After establishing the functional diagram for the HWR cryomodule,
only computes availability of the complete facility but also for the next step involves computing availability of individual component in
individual section and component. This feature facilitates finding the an element package using input data of MTTF and MTTR in Eq. (3).
most vulnerable section determining overall availability of the facility. Table 4 shows availabilities of components in the cavity and magnet
6
A. Saini, R. Prakash and J.D. Kellenberger Nuclear Inst. and Methods in Physics Research, A 988 (2021) 164874
Fig. 9. High level functional diagram for the PIP-II accelerator facility.
Fig. 10. Functional block diagram for the HWR cryomodule for two operational modes :(a) no-failure-permit and (b) a cavity-fail-tolerance.
packages. Then, using the knowledge of components logical connec- where 𝐴𝑝 is the element package availability, 𝐴𝑖 is the availability of
tions in an element package, availability of the package is evaluated. 𝑖th component in the package and N is total number of components in
∑
Since components are connected in series configuration in the packages, a package. Similarly, failure rate of the 𝜆𝑝 = 𝑁𝑖=1 𝜆𝑖 element package
availability of a package is obtained using equation: is computed as:
∏
𝑁 ∑
𝑁
𝐴𝑝 = 𝐴𝑖 (5) 𝜆𝑝 = 𝜆𝑖 (6)
𝑖=1 𝑖=1
7
A. Saini, R. Prakash and J.D. Kellenberger Nuclear Inst. and Methods in Physics Research, A 988 (2021) 164874
Fig. 11. HWR cryomodule is modelled using six essential element packages that are connected in a series configuration with the cavity package. (a) Combined availability of each
essential package and (b) availability of the full HWR cryomodule for two operational modes i.e. no-failure-permit (Case 1) and a cavity-fail-tolerance (Case 2).
8
A. Saini, R. Prakash and J.D. Kellenberger Nuclear Inst. and Methods in Physics Research, A 988 (2021) 164874
Table 4
Availability of the cavity package and the solenoid magnet assembly package in the HWR cryomodule.
Component MTTF (T) 𝜆 (ℎ−1 ) MTTR 𝐴𝑖 𝑀𝑇 𝐵𝐹𝑐𝑝 𝐴𝑐𝑝
(h) (h) (%) (h) (%)
Cavity package
Cavity 8.76E+08 1.14E−09 776 99.999 Case 1: No-failure-permit mode
Tuner 1.00E+06 1.00E−06 216 99.978 8 cavity packages 8 cavity packages
Coupler 1.00E+07 1.00E−07 0.5a 99.999 in series in series
( )8
Interlock sensors 1.00E+05 1.00E−05 1 99.999 𝑀𝑇 𝐵𝐹𝐶𝑃 = 8𝜆1 𝐴𝐶𝑃 = Ap
𝑝
= 3623.19 𝐴𝐶𝑃 = 99.79
a It
is assumed that the coupler MTTR is the time needed to restore accelerator operation after detuning the cavity. Major coupler repairs are
accounted in the cavity MTTR.
Table 5
Availability of the functional blocks of the PIP-II facility for two nominal modes. The component with the least availability in respective section
is also listed.
Section Availability (%) Component with lowest availability in the section.
No-Failure- Fail-tolerance Component name Availability
Permit mode mode (%)
1 Electrical power system 98.79 98.79 Electric wire 99.22
2 LCW central system 99.88 99.88 Pressure gauge 99.91
3 Cryo-plant system 99.07 99.07 Warm compressors 99.82
4 Compressed air system 99.99 99.99 Compressor 99.99
5 Ion source 98.67 98.67 Individual ion source 89.08
6 LEBT 99.93 99.93 High voltage switch 99.95
7 RFQ 99.58 99.58 LCW—distribution (RFQ) 99.70
8 LCW—distribution 99.89 99.89 Circulating pump 99.91
9 MEBT 99.57 99.57 Magnet power supply chain 99.80
10 HWR 99.69 99.90 Solenoid magnet 99.91
11 SSR 1 99.40 99.90 Solenoid magnet 99.91
12 SSR 2 98.50 99.72 Solenoid magnet 99.78
13 LB 650 98.49 99.76 Quadrupole magnet package 99.85
14 HB 650 98.89 99.89 Quadrupole magnet package 99.87
15 Transfer line 98.27 98.27 LCW distribution (Transfer 99.09
line)
It results in the MTTR of 6.8 and 5.2 h for the no-failure and fail- Table 6
tolerance modes respectively. Availability and MTBF allocation by category for two operational modes of the
PIP-II linac facility.
The operational statistics of the existing accelerator facility corrob-
No-failure-permit mode Fail-tolerance-mode
orates that the target availability of 90% is well within reach of the
MTBF A MTBF A
modern technology. The Spallation Neutron Source (SNS) accelerator
(h) (%) (h) (%)
facility at Oak Ridge [33] has been reporting an availability of 90%
Utility system 1881.2 97.6 1881.2 97.6
since 2011 [34,35]. The proposed ESS facility also targets the facility
NC Linac system 127.8 96.1 127.8 96.1
availability of at least 90% over a calendar year [7]. This in turn, SRF Linac system 130.9 95.1 197.8 99.2
confirms feasibility of the PIP-II availability target. It is apparent from PIP-II facility 62.5 89.2 74.5 93
Table 6 that the PIP-II accelerator facility can deliver the target avail-
ability of 90% over a fiscal year in both operational modes. However,
the analysis also corroborates that an additional improvement in the
availability can be achieved through gaining a capability of operation
9
A. Saini, R. Prakash and J.D. Kellenberger Nuclear Inst. and Methods in Physics Research, A 988 (2021) 164874
Fig. 13. Distribution of the down-time hours by sections of the PIP-II facility operating in (left) no-failure-permit and (right) fail-tolerance modes. Note that, fail-tolerance of a
cavity per section was applied only to the SRF sections of the facility.
in a fail-tolerance mode. This is why the baseline design of the PIP-II Table 7
Availability of the PIP-II facility for two cases of the critical
linac [36] has adopted a cavity fault-tolerance in every SRF section. In
operational mode.
addition to a local energy correction, allocation of a spare cavities per
Section Total number Spare A(%)
section enables optics tuning in case of malfunctioned elements which of cavities cavities
is otherwise not possible if spare cavities are located at the end of linac.
LB650 36 11 93.35
At times it is more practical to describe the unavailability in term HB650 24 15 93.34
of the down-time that can be estimated using following equation:
10
A. Saini, R. Prakash and J.D. Kellenberger Nuclear Inst. and Methods in Physics Research, A 988 (2021) 164874
Table 8
Number of packages in MB and HB sections of the SNS
SRF linac.
MB HB
Cavity package 33 48
Magnet package 22 24
Steerer package 22 24
Cryo-package 11 12
Transmitter 4 8
Modulator 3 4
Table 9
Operational availability of the SRF linac system of the SNS accelerator
facility in two operational modes.
No-failure-permit Fail-tolerance-
mode availability mode
Fig. 14. Variation in availability of the total facility with the MTTF scaling factor in availability
a fail-tolerance operational mode. (%) (%)
MB section 97.2 99.4
HB section 96.1 99.3
SRF linac 93.4 98.8
11
A. Saini, R. Prakash and J.D. Kellenberger Nuclear Inst. and Methods in Physics Research, A 988 (2021) 164874
7. Summary For the comprehension of this article, this appendix lists standard
Reliability Engineering textbook formulae. Several of those formulae
The paper introduced a methodology to model the availability of were applied in this article.
the complete particle accelerator facility. A comprehensive reliability For of 𝑖th component, if 𝑟𝑖 = Reliability at any time t, 𝑎𝑖 =
model of the proposed PIP-II accelerator facility was developed that Availability, 𝜆𝑖 = failure rate and 𝜇𝑖 = repair rate, then we can obtain
included not only the accelerator systems but also essential supporting following formulae.
systems such as the central cryo-plant, electrical power systems etc. The
availability assessment of the PIP-II facility reveals that the ion source 1. Series Configuration of the components in a system:
is most vulnerable system with availability of only 88%. Consequently, ∏
the baseline of the PIP-II facility adopted an additional ion source a. Availability of the system 𝐴 = 𝑎𝑖 .
∏
configured in the standby mode. This arrangement increases the ion b. Reliability 𝑅 = 𝑟𝑖 .
source availability to 98.7%. The baseline design of the PIP-II SRF 1
c. MTBF = ∑ 𝜆 ,
𝑖 𝑖
linac also attributes a cavity fault-tolerance in every SRF sections that
d. Mean Time to Failure = (1 − 𝐴) ∗ 𝑀𝑇 𝐵𝐹 .
enables the facility to operate in the fail-tolerance mode. Furthermore,
the PIP-II integration and operation strategy plans for a fully functional 2. Parallel Configuration of the components in a system
spare cryomodule always available for each SRF section in inventory to
minimize a repair time of the superconducting elements and therefore, ∏
a. 1 − 𝐴 = (1 − 𝑎𝑖 ).
unscheduled down time of the facility. ∏
b. 1 − 𝑅 = (1 − 𝑟𝑖 ).
The availability of the full PIP-II facility in nominal operational
c. MTBF = (1−𝐴)1∑ 𝜇 .
mode was found to be 89% that increased to 93% after introducing 𝑖 𝑖
the fail-tolerance of a cavity in every SRF sections in the model. d. Mean Time To failure = (1 − 𝐴) ∗ 𝑀𝑇 𝐵𝐹
This corroborates that the baseline design of the PIP-II accelerator
facility is sufficiently robust to meet the target availability in both 3. k out of n systems: Assume that all the components have same
nominal operational modes. Moreover, availability of the PIP-II facility failure rate (𝜆) and repair rate (𝜇).
was computed for the critical operational mode featuring the facility ( )
∑𝑘 𝑛
operation at the minimum beam energy of 600 MeV. The availability a. 𝐴 = 𝑖=0 𝑎𝑛−𝑖 (1 − 𝑎)𝑖 where, k is maximum
𝑖
of the PIP-II facility in this mode was obtained to be 93%. An input data number of failure allowed in a system, n is total number
sensitivity analysis and the model validation using a reference model of of components.
( )
the SNS SRF linac generate an adequate level of confidence in the PIP-
b. MTBF = 𝜆1 1𝑛 + 𝑛−1 1
+ ⋯ + 𝑘1 for non-repairable systems.
II availability assessment that leads us further to initiate engineering
design of the PIP-II facility.
4. Standby (Cold): A standby component implies that the compo-
CRediT authorship contribution statement nent starts operating as soon as another component gets failed.
Two components in a system have same failure rate (𝜆) and
Arun Saini: Conceptualization, Methodology, Writing - original repair rate (𝜇) and one of the components is kept as standby
draft, Revising, Investigation, Data Curation, Supervision, Formal anal- mode, then reliability and MTBF of the system is expressed as
ysis, Visualization, Writing - review & editing. Ram Prakash: Method- below
ology, Investigation, Formal analysis, Software. Joseph D. Kellen-
berger: Software. a. Reliability 𝑅 = (1 + 𝜆𝑡) 𝑒−𝜆𝑡
2𝜇 2
b. MTBF = 𝜆2 + 𝜆2 (𝜆+2𝜇)
Declaration of competing interest
In general, when two components have different failure rate
The authors declare that they have no known competing finan- 𝜆1 &𝜆2 and repair rate is 𝜇1 and 𝜇2 , MTBF is then express as
cial interests or personal relationships that could have appeared to below ( )
influence the work reported in this paper. 1 1 𝜇1 1 1
MTBF = 𝜆1
+ 𝜆2
+ 𝜆2 𝜆2
− 𝜆 .
𝜆2 +𝜇1 + 𝜆2 𝜇2
1
Acknowledgements
A.2.
The authors are thankful to the large team of scientists, engineers
and technical staffs who provided key input data for this study. The
authors would like to express gratitude on a more personal level to A. See Fig. A.1.
Klebaner, A. Martinez, J. Holzbauer, D. L. Newhart and, J. E. Anderson
Jr. for their constructive suggestions and discussions that helped the A.3.
authors to enhance quality of the paper. The author also wishes to
acknowledge efforts of L. Serio and C. Adolphsen who reviewed this See Table A.1.
work and provided their invaluable feedback. The authors are also
grateful to Barbara Merrill and Dr. Priyanka Saini and Dr. Vyacheslav
Yakovlev for their invaluable time to proof-read the manuscript and
useful suggestions. Appendix B. Supplementary data
This manuscript has been authored by Fermi Research Alliance, LLC
under Contract No. DE-AC02-07CH11359 with the U.S. Department of Supplementary material related to this article can be found online
Energy, Office of Science, Office of High Energy Physics. at https://doi.org/10.1016/j.nima.2020.164874.
12
A. Saini, R. Prakash and J.D. Kellenberger Nuclear Inst. and Methods in Physics Research, A 988 (2021) 164874
Fig. A.1. Availability of HWR cryomodule computed for no-failure-permit mode using BlockSim (blue coloured bars) and analytical model (saffron coloured bars). (For interpretation
of the references to colour in this figure legend, the reader is referred to the web version of this article.)
Table A.1
A detailed view of the element packages in the SRF cryomodule of the SNS linac.
Packages Components MTTF MTTR A(%)
Magnet package Magnet 1E+06 16
Power Supply 4.6E+04 2
Magnet Instrumentation 1E+05 2
Magnet package availability 99.99
Cavity package SRF cavity 8.7E+08 776
Tuner 1E+06 216
Coupler 1E+07 0.5
Interlock sensor 1E+05 1
Klystron 5E+04 4.5
Wave Guide 1.5E+05 3
Circulator 5E+04 3
Load 7.5E+04 3
LLRF 1E+05 2
Cavity package availability 99.93
Steering magnet package Steerer
Power supply 1E+06 2
Magnet instrumentation 1E+06 2
Steerer instrumentation 1E+05 2
Steering magnet package availability 99.99
Cryo package Vacuum valves 1E+07 8
Ion pump 1E+06 4
Ion pump power supply 1E+05 1
Local cryogenic distribution 5E+05 2
Cryo-package availability 99.99
Additional components Transmitter 2.26E+04 4 99.98
Modulator 5.6E+03 3 99.94
References [10] P. Tallerico, D. Rees, D. Anderson, An availability model for the SNS Linac RF
system, in: Proceedings of PAC2001, Chicago, IL, USA, 2001, MPPH112, pp.
1035–1037.
[1] J.N. Galayda, The LCLS-II: A high-power upgrade to the LCLS, in: Proceedings
[11] E.S. Lessner, P.N. Ostroumov, Reliability and availability in the RIA driver linac,
of IPAC2018, Vancouver, Canada, MOYGB2, pp. 18–23.
in: Proceedings of PAC2005, Knoxville, TN, USA, 2005, FOAC005, pp. 443–445.
[2] S. Peggs, et al., ESS Technical Design Report, 2013, http://inspirehep.net/record/
[12] G.W. Dodson, Accelerator systems RAM analysis, Talk in Accelerator Reli-
1704813?ln=en.
ability Workshop, 2002, http://www.esrf.eu/files/live/sites/www/files/events/
[3] A. Sharma, A.R. Jana, C.B. Patidar, M.K. Pal, N. Kulkarni, P.K. Hoyal, et al.,
conferences/2002/ARW/proceedings/MONPM/Dodson.pdf.
Reference physics design for 1 GeV Injector Linac and accumulator ring for
[13] M.J. Haire, Computation of Normal Conducting and Superconducting Linear
Indian spallation neutron source, arXiv:1609.04518 [physics.acc-ph].
Accelerator (Linac) Availabilities, ORNL, USA, Tech. Report, ORNL/TM-2000/93,
[4] Zhihui Li, Peng Cheng, Huiping Geng, Zhen Guo, Yuan He, Cai Meng, Huafu 2000, https://www.osti.gov/biblio/885853-yUWMiH/.
Ouyang, Shilun Pei, Biao Sun, Jilei Sun, Jingyu Tang, Fang Yan, Yao Yang,
[14] PIP-II Conceptual Design Report, 2017, http://pip2-docdb.fnal.gov/cgi-bin/
Chuang Zhang, Zheng Yang, Phys. Rev. ST Accel. Beams 16 (2013) 080101.
ShowDocument?docid=113.
[5] T. Himel, J. Nelson, N. Phinney, Availability and reliability issues for ILC, in: [15] E.L. Hubbard, Booster Synchrotron Report, 1973, https://lss.fnal.gov/archive/
Proceedings of PAC07, Albuquerque, New Mexico, USA, pp. 1966–1969. tm/TM-0405.pdf.
[6] L. Burgazzi, P. Pierini, Reliability studies of a high-power proton accelerator for [16] L. Merminga, PIP-II Global Requirements Document, FNAL, USA, PIP-II
accelerator-driven system applications for nuclear waste transmutation, Reliab. Document 1166-v8, ED0001222, 2020, https://pip2-docdb.fnal.gov/cgi-
Eng. Syst. Saf. 92 (2007) 449–463, http://dx.doi.org/10.1016/j.ress.2005.12.008. bin/RetrieveFile?docid=1166&filename=ED0001222%20PIP-II%20Global%
[7] E. Bargalló, R. Andersson, A. Nordt, A. De Isusi, E. Pitcher, K.H. Andersen, ESS 20Requirements%20Document%20GRD.pdf&version=8.
availability and reliability approach, in: Proceedings of IPAC2015, Richmond, [17] A. Shemyakin, M. Alvarez, R. Andrews, J.-P. Carneiro, A. Chen, R. D’Arcy,
VA, USA (2015), MOPTY045, pp. 1033–1035. B. Hanna, L. Prost, V. Scarpine, C. Wiesner, PIP-II injector test’s low energy
[8] J. Knaster, P. Garin, H. Matsumoto, Y. Okumura, M. Sugimoto, F. Arbeiter, P. beam transport: Commissioning and selected measurements, AIP Conf. Proc. 1869
Cara, S. Chel, A. Facco, P. Favuzza, T. Furukawa, R. Heidinger, A. Ibarra, T. (050003) (2017).
Kanemura, A. Kasugai, H. Kondo, V. Massaut, J. Molla, G. Micciche, S. O’hira, K. [18] S. Virostek, et al., Final design of a CW Radio Frequency Quadrupole (RFQ) for
Sakamoto, T. Yokomine, E. Wakai, Overview of the IFMIF/EVEDA project, Nucl. the Project X Injector Experiment (PXIE), in: Proc. NAPAC’13, Pasadena, CA,
Fusion 57 (10) (2017) 102016, http://dx.doi.org/10.1088/1741-4326/aa6a6a. USA, 2013, WEPMA21, pp. 1025–1027.
[9] R. Andersson, A. Nordt, E. Bargalló, Machine protection systems and their impact [19] A. Saini, C.M. Baffes, A.Z. Chen, V.A. Lebedev, L. Prost, A. Shemyakin, Design of
on beam availability and accelerator reliability, in: Proceedings of IPAC2015, PIP-II medium energy beam transport beam, in: Proc. of IPAC 2018, Vancouver,
Richmond, VA, USA, 2015, MOPTY044, pp. 1029–1032. Canada, 2018, TUPAF076, pp. 905–908.
13
A. Saini, R. Prakash and J.D. Kellenberger Nuclear Inst. and Methods in Physics Research, A 988 (2021) 164874
[20] Z.A. Conway, et al., IOP Conf. Ser.: Mater. Sci. Eng. 101 (2015) 012019. [31] A. Vivoli, J. Hunt, D.E. Johnson, V. Lebedev, Transfer Line Design for PIP-II
[21] M.H. Awida, et al., Development of low single-spoke resonators for the front Project, in: Proceedings of IPAC2015, Richmond, VA, USA, 2015, THPF119, pp.
end of the proton improvement plan-II at Fermilab, IEEE Trans. Nucl. Sci. 64 3989–3991.
(9) (2017) 2450–2464. [32] https://www.reliasoft.com/products/blocksim-system-reliability-availability-
[22] V. Roger, et al., Design Update of the SSR1 Cryomodule for PIP-II Project, maintainability-ram-analysis-software.
in: Proceedings of IPAC2018, Vancouver, Canada, 2018, WEPML019, pp. [33] S. Henderson, et al., The Spallation Neutron Source Beam Commissioning and
2721–2723. Initial Operations, ORNL, USA, Tech. Report, ORNL/TM-2015/321, 2015, https:
[23] A. Rowe, SRF Technology for PIP-II and PIP-III, in: Proc. SRF2017, Lanzhau, //info.ornl.gov/sites/publications/files/Pub56465.pdf.
China, 2017. [34] S.-H. Kim, R. Afanador, W. Blokland, M. Champion, A. Coleman, M. Crofford,
[24] A. Saini, Design considerations for the Fermilab PIP-II 800 MeV Superconducting et al., The status of the superconducting linac and SRF activities at the SNS,
Linac, in: Proc. of NA-PAC 2016, Chicago, USA, 2016, WEPOA60. in: Proceedings of the 16th International Conference on RF superconductivity,
[25] D.J. Smith, Reliability, Maintainability and Risk, Elsevier Ltd, 2011, http://dx. Paris, France, September 23–27, 2013, pp. 83–88, http://accelconf.web.cern.ch/
doi.org/10.1016/C2010-0-66333-4. AccelConf/SRF2013/papers/mop007.PDF.
[26] M. Raus, A. Hsyland, System Reliability Theory Models, Statistical Methods, and [35] S.H. Kim, et al., Overview of ten-year operation of the superconducting linear
Applications, second ed., John Wiley & Sons, Inc. accelerator at the Spallation Neutron Source, Nucl. Instrum. Methods Phys. Res.
[27] J. Upadhyay, D. Im, J. Peshl, M. Bašović, S. Popović, A.M. Valente-Feliciano, et A (ISSN: 0168-9002) 852 (2017) 20–32, http://dx.doi.org/10.1016/j.nima.2017.
al., Apparatus and method for plasma processing of SRF cavities, Nucl. Instrum. 02.009.
Methods Phys. Res. A 818 (2016) 76–81, http://dx.doi.org/10.1016/j.nima.2016. [36] M. Convery, et al., The PIP-II Preliminary Design Report, PIP-II Doc-
02.049. ument 2261-v33, 2020, https://pip2-docdb.fnal.gov/cgi-bin/ShowDocument?
[28] A. Saini, J.-F. Ostiguy, N. Solyak, V.P. Yakovlev, Studies of fault scenarios in SC docid=2261.
CW Project-X linac, in: Proceedings of NA-PAC2013, Pasadena, California, USA, [37] T.P. Wangler, RF Linear Accelerator, second ed., Wiley-VCH Verlag GmbH & Co.,
2013, MOPMA10, pp. 318-320. 2008.
[29] A. Saini, N. Solyak, V.P. Yakovlev, S. Mishra, K. Ranjan, Study of effects of failure [38] A. Sukhanov, A. Lunin, V. Yakovlev, M. Awida, M. Champion, C. Ginsburg, I.
of beamline elements and its compensation in CW superconducting linac, in: Gonin, C. Grimm, T. Khabiboulline, T. Nicol, Yu. Orlov, A. Saini, D. Sergatskiv,
Proceedings of IPAC2012, New Orleans, Louisiana, USA: 2012, pp. 1173–1175. N. Solyak, A. Vostrikov, Higher order modes in project-X Linac, Nucl. Instrum.
[30] P.F. Derwent, J.-P. Carniero, J. Edelen, V. Lebedev, L. Prost, A. Saini, A. Methods Phys. Res. A 734 (part A) (2014) 9–22, http://dx.doi.org/10.1016/j.
Shemyakin, J. Steimel, PIP-II Injector Test: challeng-es and status, in: Proc. of nima.2013.06.113.
LINAC’16, East Lansing, MI, USA, September 25–30, 2016, WE1A01.
14