Academia.edu no longer supports Internet Explorer.
To browse Academia.edu and the wider internet faster and more securely, please take a few seconds to upgrade your browser.
2013, 2013 International Conference on Field-Programmable Technology (FPT)
While we reap the benefits of process scaling in terms of transistor density and switching speed, consideration must be given to the negative effects it causes: increased variation, degradation and fault susceptibility. Above device level, such phenomena and the faults they induce can lead to reduced yield, decreased system reliability and, in extreme cases, total failure after a period of successful operation. Although error detection and correction are almost always considered for highly sensitive and susceptible applications such as those in space, for other, more general-purpose applications they are often overlooked. In this paper, we present a parallel matrix multiplication accelerator running in hardware on the Xilinx Zynq system-onchip platform, along with 'bolt-on' logic for detecting, locating and avoiding faults within its datapath. Designs of various sizes are compared with respect to resource overhead and performance impact. Our largest-implemented fault-tolerant accelerator was found to consume 17.3% more area, run at a 3.95% lower frequency and incur an 18.8% execution time penalty over its equivalent fault-susceptible design during fault-free operation.
2016
As the threat of fault susceptibility caused by mechanisms including variation and degradation increases, engineers must give growing consideration to error detection and correction. While the use of common fault tolerance strategies frequently causes the incursion of significant overheads in area, performance and/or power consumption, options exist that buck these trends. In particular, algorithm-based fault tolerance embodies a proven family of low-overhead error mitigation techniques able to be built upon to create self-verifying circuitry. In this paper, we present our research into the application of algorithm-based fault tolerance ABFT in FPGA-implemented accelerators at reduced levels of precision. This allows for the introduction of a previously unexplored tradeoff: sacrificing the observability of faults associated with low-magnitude errors for gains in area, performance and efficiency by reducing the bit-widths of logic used for error detection. We describe the implementat...
2011 IEEE Ninth International Symposium on Parallel and Distributed Processing with Applications, 2011
Commercial graphics processing units (GPUs) prove their attractive, inexpensive in high performance scientific applications. However, a recent research [1] through Folding@home demonstrates that two-thirds of tested GPUs on Folding@home exhibit a detectable, pattern-sensitive rate of memory soft errors for GPGPU. Fault tolerance has been viewed as critical to the effective use of these GPUs. In this paper, we present an on-line GPU error detection, location, and correction method to incorporate fault tolerance into matrix multiplication. The main contribution of the paper is to extend the traditional algorithm-based fault tolerance (ABFT) from offline to online and apply it to matrix multiplication on GPUs. The proposed on-line fault tolerance mechanism detects soft errors in the middle of the computation so that better reliability can be achieved by correcting corrupted computations in time. Experimental results demonstrate that the proposed method is highly efficient.
22nd International Conference on Field Programmable Logic and Applications (FPL), 2012
Commercial SRAM-based, field-programmable gate arrays (FPGAs) have the capability to provide space applications with the necessary performance, energy-efficiency, and adaptability to meet next-generation mission requirements. However, mitigating an FPGA's susceptibility to radiationinduced faults is challenging. Triple-modular redundancy (TMR) techniques are traditionally used to mitigate radiation effects, but TMR incurs substantial overheads such as increased area and power requirements. In order to reduce these overheads while still providing sufficient radiation mitigation, we propose the use of algorithm-based fault tolerance (ABFT). We investigate the effectiveness of hardware-based ABFT logic in COTS FPGAs by developing multiple ABFT-enabled matrix multiplication designs, carefully analyzing resource usage and reliability tradeoffs, and proposing design modifications for higher reliability. We perform fault-injection testing on a Xilinx Virtex-5 platform to validate these ABFT designs, measure design vulnerability, and compare ABFT effectiveness to other fault-tolerance methods. Our hybrid ABFT design reduces total design vulnerability by 99% while only incurring 25% overhead over a baseline, non-protected design.
1991
Reliability of a computer architecture may be increased through fault tolerance. However, fault tolerance is achieved at a price of decreased throughput. The Fault Tolerant Parallel Processor at the Charles Stark Draper Laboratory maintains high levels of reliability and throughput by combining technologies of fault tolerance and parallel processing. The architecture is based on a Network Element (NE), which performs the functions of fault tolerance and parallel processing. A design for two field programmable gate arrays (FPGAs) is proposed herein which will replace much of the NE and perform the communication, synchronization, and redundancy management functions within the NE. This will yield increased reliability, reduced size, and reduced power dissipation. These FPGAs will be integrated with the next implementation of the Fault Tolerant Parallel Processor. Thesis Supervisor: Title: Thomas F. Knight Associate Professor of Electrical Engineering Acknowledgments This work would not...
2018 30th International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD), 2018
Machine Learning (ML) is making a strong resurgence in tune with the massive generation of unstructured data which in turn requires massive computational resources. Due to the inherently compute-and power-intensive structure of Neural Networks (NNs), hardware accelerators emerge as a promising solution. However, with technology node scaling below 10nm, hardware accelerators become more susceptible to faults, which in turn can impact the NN accuracy. In this paper, we study the resilience aspects of Register-Transfer Level (RTL) model of NN accelerators, in particular fault characterization and mitigation. By following a High Level Synthesis (HLS) approach, first, we characterize the vulnerability of various components of RTL NN. We observed that the severity of faults depends on both i) application-level specifications, i.e., NN data (inputs, weights, or intermediate), NN layers, and NN activation functions, and ii) architectural-level specifications, i.e., data representation model and the parallelism degree of the underlying accelerator. Second, motivated by characterization results, we present a low-overhead fault mitigation technique that can efficiently correct bit flips, by 47.3% better than state-of-the-art methods.
Defect and Fault Tolerance in VLSI Systems, 1989
This paper addresses the important fault-tolerance issue for arrays of large number of processors. An array grid model based on single-track switches is adopted. Single track requires less hardware overhead and suffers less from possible faults on switches. More significantly, we are able to establish a very useful necessary and sufficient condition for the reconfigurability of the array. This is indeed the theoretical footing for two reconfiguration algorithms: one adopts global control for the (fabrication-time) yield enhancement and the other is a distributed scheme for the (run-time) reliability improvement. For the fabrication time reconfiguration algorithm, the task can be reformulated as a maximum independent set problem. An existing algorithm in graph theory is adopted to effectively solve this problem. The simulations conducted indicate that the algorithm is computationally very efficient; therefore, it is also very suitable for the compile-time fault-tolerance. In contrast, for the real time reconfiguration algorithm, it is more suitable to adopt a distributive method for (asynchronous) array processors. The algorithm has several important features: (1) it is distributively executed by the processor elements (PEs); (2) no global information is required by the individual PEs; (3) the time overhead for reconfiguration is independent of the array size; (4) transient faults are handled by retries or by deactivating/reactivating the temporarily failed PE. Based on simulations, the performance of the algorithms and the tradeoffs between fault-tolerance capability and hardware complexity for various kinds of spare PE distributions are evaluated.
Applied Sciences
Deep learning technology has enabled the development of increasingly complex safety-related autonomous systems using high-performance computers, such as GPU, which provide the required high computing performance for the execution of parallel computing algorithms, such as matrix–matrix multiplications (a central computing element of deep learning software libraries). However, the safety certification of parallel computing software algorithms and GPU-based safety-related systems is a challenge to be addressed. For example, achieving the required fault-tolerance and diagnostic coverage for random hardware errors. This paper contributes with a safe matrix–matrix multiplication software implementation for GPUs with random hardware error-detection capabilities (permanent, transient) that can be used with different architectural patterns for fault-tolerance, and which serves as a foundation for the implementation of safe deep learning libraries for GPUs. The proposed contribution is comple...
Microprocessors and Microsystems, 2014
Keywords: 20
International Journal of Modern Education and Computer Science, 2017
This paper presents a new fault-tolerant architecture for floating-point multipliers in which the fault-tolerance capability is achieved at the cost of output precision reduction. In this approach, to achieve the faulttolerant floating-point multiplier, the hardware cost of the primary design is reduced by output precision reduction. Then, the appropriate redundancy is utilized to provide error detection/correction in such a way that the overall required hardware becomes almost the same as the primary multiplier. The proposed multiplier can tolerate a variety of permanent and transient faults regarding the acceptable reduced precisions in many applications. The implementation results reveal that the 17-bit and 14-bit mantissas are enough to obtain a floating-point multiplier with error detection or error correction, respectively, instead of the 23-bit mantissa in the IEEE-754 standardbased multiplier with a few percent area and power overheads.
VLSI Design, 1996
In order to reduce cost and to achieve high speed a new hardware accelerator for fault simulation has been designed. The architecture of the new accelerator is based on a reconfigurabl mesh type processing element (PE) array. Circuit elements at the same topological level are simulated concurrently, as in a pipelined process. A new parallel simulation algorithm expands all of the gates to two input gates in order to limit the number of faults to two at each gate, so that the faults can be distributed uniformly throughout the PE array. The PE array reconfiguration operation provides a simulation speed advantage by maximizing the use of each PE cell.
IRJET, 2020
Equal lattice handling is a run of the mill activity in numerous frameworks, and specifically grid vector augmentation (MVM) is one of the most widely recognized tasks in the advanced computerized signal preparing and advanced correspondence frameworks. This paper proposes a fault tolerant plan for number equal MVMs. The plan consolidates thoughts from mistake rectification codes with oneself checking capacity of MVM. Field-programmable entryway exhibit assessment shows that the proposed plan can fundamentally decrease the overheads contrasted with the security of each MVM all alone. In this way, the proposed strategy can be utilized to lessen the expense of giving adaptation to internal failure in down to earth usage.
This paper presents a proposal for a systematic approach for designing one class of faulttolerant systolic arrays with orthogonal interconnects and unidirectional data flow (OUSA) for multiplication of rectangular matrices. The method employs space-time redundancy to achieve fault-tolerance. It consists of four steps. In the first step the inner computation space of the basic systolic algorithm for matrix multiplication is expanded. In the second step we derive a matrix multiplication algorithm which enables us to obtain OUSAs with data pipeline period λ = 3. During the third step redundancy is introduced by deriving three equivalent algorithms with disjoint index spaces. In the last step the obtained algorithm is mapped into a processor-time domain. In this way we have obtained four different OUSAs. For the given matrix dimensions, two out of four arrays have an optimal number of processing elements (PEs) and minimal execution time. For the square case, all arrays have an optimal number of PEs, Ω = n(n + 2), and total execution time of T tot = 6n. All of them can tolerate single transient errors and the majority of multiple error patterns with high probability. In addition, two arrays can tolerate permanent faults as well. The obtained arrays are suitable for implementation in VLSI technology. Compared to a hexagonal array of the same dimensions, the number of I/O pins is reduced by approximately 30%.
VLSI Design, 2013
This paper examines fault tolerant adder designs implemented on FPGAs which are inspired by the methods of modular redundancy, roving, and gradual degradation. A parallel-prefix adder based upon the Kogge-Stone configuration is compared with the simple ripple carry adder (RCA) design. The Kogge-Stone design utilizes a sparse carry tree complemented by several smaller RCAs. Additional RCAs are inserted into the design to allow fault tolerance to be achieved using the established methods of roving and gradual degradation. A triple modular redundant ripple carry adder (TMR-RCA) is used as a point of reference. Simulation and experimental measurements on a Xilinx Spartan 3E FPGA platform are carried out. The TMR-RCA is found to have the best delay performance and most efficient resource utilization for an FPGA fault-tolerant implementation due to the simplicity of the approach and the use of the fast-carry chain. However, the superior performance of the carry-tree adder over an RCA in a...
Nanotechnology, 2003
Both von Neumann's NAND multiplexing, based on a massive duplication of imperfect devices and randomized imperfect interconnects, and reconfigurable architectures have been investigated to come up with solutions for integrations of highly unreliable nanometre-scale devices. In this paper, we review these two techniques, and present a defect-and fault-tolerant architecture in which von Neumann's NAND multiplexing is combined with a massively reconfigurable architecture. The system performance of this architecture is evaluated by studying its reliability, i.e. the probability of system survival. Our evaluation shows that the suggested architecture can tolerate a device error rate of up to 10 −2 , with multiple redundant components; the structure is efficiently robust against both permanent and transient faults for an ultra-large integration of highly unreliable nanometre-scale devices.
IEEE Design & Test
Neural networks are a popular choice to accurately perform complex classification tasks. In edge applications, neural network inference is accelerated on embedded hardware platforms, which often utilise FPGA-based architectures, due to their low-power and flexible parallelism. A significant amount of applications require resilient hardware against faults, being compliant to safety standards. In this work, we present Selective TMR, an automated tool which analyses the sensitivity of computations within neural network inference to the overall network accuracy. The tool then triplicates the most sensitive computations, which increases the functional safety of the neural network accelerator without resorting to full triple modular redundancy (TMR). As a result, this allows designers to explore trade-off between accelerator reliability and hardware cost. In some cases, we see an improvement in 24% minimum accuracy under a single stuck@ hardware fault, while increasing the overall resource footprint by 56%.
IEEE Design and Test of Computers, 2004
ICS ARE SENSITIVE to upsets that occur in aerospace. More recently, ICs have also become sensitive to upsets at ground level because of the continual evolution of fabrication technology for semiconductors. Drastic device shrinkage, power supply reduction, and increasing operating speeds significantly reduce noise margins and thus reliability because of the internal noise sources that very deep-submicron ICs face. 1 This trend is approaching a point at which it will be infeasible to produce ICs that are free from these effects. Consequently, fault tolerance is no longer a matter exclusively for aerospace designers; it's important for the designers of nextgeneration ground-level products as well. FPGAs are popular for design solutions because they improve logic density and performance for many applications. SRAM-based FPGAs, in particular, are highly flexible because they are reprogrammable, allowing onsite design changes. However, because the reprogrammability leads to a high logic density in terms of SRAM memory cells, SRAM-based FPGAs are also sensitive to radiation and require protection to work in harsh environments. 2 Our high-level fault tolerance technique combines time and hardware redundancy to cope with upsets in SRAMbased FPGAs. This technique reduces the number of I/O pads, and therefore power dissipation, in the interface compared to the well-known triple modular redundancy (TMR) solution. Our goal is to reduce hardware overhead (which is three times more in TMR than the original area of the unprotected design) to close to twice the original area, maintaining the same reliability and consequently reducing power dissipation. We've evaluated our technique in two types of circuits: multipliers and digital filters. Radiation effects on SRAM-based FPGAs A radiation environment contains various charged particles, generated by sun activity, that interact with silicon atoms, exciting and ionizing the atomic electrons. 3 At ground level, neutrons are the most frequent causes of upsets. 4 When a single heavy ion strikes the silicon, it loses its energy through the production of free electron-hole pairs, resulting in a dense ionized track in the local region. Protons and neutrons can cause a nuclear reaction when passing through the material. The recoil also produces ionization, generating a transient current pulse that can cause an upset in the circuit. A single particle can hit either the combinational or the sequential logic in the silicon. 5 When a charged particle strikes a memory cell's sensitive nodes, such as a drain in an off-state transistor, it generates a transient current pulse that can mistakenly turn on the opposite Designing Fault-Tolerant Techniques for SRAM-Based FPGAs Editors' note: FPGAs have become prevalent in critical applications in which transient faults can seriously affect the circuit's operation. This article presents a fault tolerance technique for transient and permanent faults in SRAM-based FPGAs. This technique combines duplication with comparison (DWC) and concurrent error detection (CED) to provide a highly reliable circuit while maintaining hardware, pin, and power overheads far lower than with classic triple-modular-redundancy techniques.
Journal of Parallel and Distributed Computing, 2009
We present a new approach to fault tolerance for High Performance Computing system. Our approach is based on a careful adaptation of the Algorithm-Based Fault Tolerance technique [K. Huang, J. Abraham, Algorithm-based fault tolerance for matrix operations, IEEE Transactions on Computers (Spec. Issue Reliable & Fault-Tolerant Comp.) 33 (1984) 518-528] to the need of parallel distributed computation.
2020 IEEE International Electron Devices Meeting (IEDM)
Despite great promises shown in the laboratory environment, memristor crossbar, or non-volatile resistive analog memory, based matrix multiplication accelerators suffer from unexpected computing errors, limiting opportunities to replace mainstream digital systems. While many previously demonstrated applications, such as neural networks, are tolerant of small errors, they are challenged by any significant outliers, which must be detected and corrected. Herein, we experimentally demonstrate an analog Error Correcting Code (ECC) scheme that considerably reduces the chance of substantial errors, by detecting and correcting errors with minimum hardware overhead. Different from well-known digital ECC in communication and memory, this analog version can tolerate small errors while detecting and correcting those over a predefined threshold. With this scheme, we can recover the MNIST handwritten digit classification accuracy experimentally from 90.31% to 96.21% in the event an array builds up shorted devices and from 73.12% to 97.36% when current noise is injected. For applications where high reliability and compute precision are demanded, such as in highperformance and scientific computing, we expect the schemes shown here to make analog computing more feasible.
2008
We present a new approach to fault tolerance for High Performance Computing system. Our approach is based on a careful adaptation of the Algorithmic Based Fault Tolerance technique to the need of parallel distributed computation. We obtain a strongly scalable mechanism for fault tolerance. We can also detect and correct errors (bit-flip) on the fly of a computation. To assess the viability of our approach, we have developed a fault tolerant matrixmatrix multiplication subroutine and we propose some models to predict its running time. Our parallel fault-tolerant matrix-matrix multiplication scores 1.4 TFLOPS on 484 processors (cluster jacquard.nersc.gov) and returns a correct result while one process failure has happened. This represents 65% of the machine peak efficiency and less than 12% overhead with respect to the fastest failure-free implementation. We predict (and have observed) that, as we increase the processor count, the overhead of the fault tolerance drops significantly.
2008 Fourth Advanced International Conference on Telecommunications, 2008
The approach described in this paper uses an array of Field Programmable Gate Array (FPGA) devices to implement a fault tolerant hardware system that can be compared to the running of fault tolerant software on a traditional processor. Fault tolerance is achieved is achieved by using FPGA with on the fly partial programmability feature. Major considerations while mapping to the FPGA includes the size of the area to be mapped and communication issues related to their communication. Area size selection is compared to the page size selection in Operating System Design. Communication issues between modules are compared to the software engineering paradigms dealing with module coupling, fan-in, fan-out and cohesiveness. Finally, the overhead associated with the downloading of the reconfiguration files is discussed.
Loading Preview
Sorry, preview is currently unavailable. You can download the paper by clicking the button above.