Papers by Peter Y K Cheung

As the threat of fault susceptibility caused by mechanisms including variation and degradation in... more As the threat of fault susceptibility caused by mechanisms including variation and degradation increases, engineers must give growing consideration to error detection and correction. While the use of common fault tolerance strategies frequently causes the incursion of significant overheads in area, performance and/or power consumption, options exist that buck these trends. In particular, algorithm-based fault tolerance embodies a proven family of low-overhead error mitigation techniques able to be built upon to create self-verifying circuitry. In this paper, we present our research into the application of algorithm-based fault tolerance ABFT in FPGA-implemented accelerators at reduced levels of precision. This allows for the introduction of a previously unexplored tradeoff: sacrificing the observability of faults associated with low-magnitude errors for gains in area, performance and efficiency by reducing the bit-widths of logic used for error detection. We describe the implementat...

Proceedings of the 5th International Workshop on Embedded and Mobile Deep Learning, 2021
The ever-growing computational demands of increasingly complex machine learning models frequently... more The ever-growing computational demands of increasingly complex machine learning models frequently necessitate the use of powerful cloud-based infrastructure for their training. Binary neural networks are known to be promising candidates for on-device inference due to their extreme compute and memory savings over higher-precision alternatives. However, their existing training methods require the concurrent storage of high-precision activations for all layers, generally making learning on memory-constrained devices infeasible. In this paper, we demonstrate that the backward propagation operations needed for binary neural network training are strongly robust to quantization, thereby making on-the-edge learning with modern models a practical proposition. We introduce a low-cost binary neural network training strategy exhibiting sizable memory footprint and energy reductions while inducing little to no accuracy loss vs Courbariaux & Bengio's standard approach. These resource decreases are primarily enabled through the retention of activations exclusively in binary format. Against the latter algorithm, our drop-in replacement sees coincident memory requirement and energy consumption drops of 2-6Ă—, while reaching similar test accuracy in comparable time, across a range of small-scale models trained to classify popular datasets. We also demonstrate from-scratch ImageNet training of binarized ResNet-18, achieving a 3.12Ă— memory reduction. Such savings will allow for unnecessary cloud offloading to be avoided, reducing latency, increasing energy efficiency and safeguarding privacy. Preprint. Under review.
International Conference on Field Programmable Logic and Applications, 2005.
This paper presents a revised model for the yield analysis of FPGA interconnect layers. Based on ... more This paper presents a revised model for the yield analysis of FPGA interconnect layers. Based on proven yield models, this work improves the predictions and assumptions of previously reported analysis. The model is then applied to three well known yield improvement schemes to quantify the enhancement offered by these schemes.

2013 International Conference on Field-Programmable Technology (FPT), 2013
While we reap the benefits of process scaling in terms of transistor density and switching speed,... more While we reap the benefits of process scaling in terms of transistor density and switching speed, consideration must be given to the negative effects it causes: increased variation, degradation and fault susceptibility. Above device level, such phenomena and the faults they induce can lead to reduced yield, decreased system reliability and, in extreme cases, total failure after a period of successful operation. Although error detection and correction are almost always considered for highly sensitive and susceptible applications such as those in space, for other, more general-purpose applications they are often overlooked. In this paper, we present a parallel matrix multiplication accelerator running in hardware on the Xilinx Zynq system-onchip platform, along with 'bolt-on' logic for detecting, locating and avoiding faults within its datapath. Designs of various sizes are compared with respect to resource overhead and performance impact. Our largest-implemented fault-tolerant accelerator was found to consume 17.3% more area, run at a 3.95% lower frequency and incur an 18.8% execution time penalty over its equivalent fault-susceptible design during fault-free operation.
2008 International Conference on Field-Programmable Technology, 2008
As integrated circuits are scaled down it becomes difficult to maintain uniformity in process par... more As integrated circuits are scaled down it becomes difficult to maintain uniformity in process parameters across each individual die. To avoid significant performance loss through pessimistic over-design new design strategies are required that are cognisant of within-die performance variability. This paper examines the effect of process variability on the clock resources in FPGA devices. A model of variation in clock skew in FPGA clock networks is presented. Techniques for reducing the impact of variations on the performance of implemented designs are proposed and analysed, demonstrating that skew variation can be reduced by 70% or more through a combination of phase adjustment and clock rerouting. Measurements on a Virtex-5 FPGA validate the feasibility and benefits of the proposed compensation strategies.
15th Annual IEEE Symposium on Field-Programmable Custom Computing Machines (FCCM 2007), 2007

Lecture Notes in Computer Science
Recent technological advances in imaging industry have lead to the production of imaging systems ... more Recent technological advances in imaging industry have lead to the production of imaging systems with high density pixel sensors. However, their long exposure times limit their applications to static images due to the motion blur effect. This work presents a system that reduces the motion blurring using a time-variant image sensor. This sensor can combine several pixels together to form a larger pixel when it is necessary. Larger pixels require shorter exposure times and produce high frame-rate samples with reduced motion blur. An FPGA is employed to enhance the spatial resolution of these samples employing Super Resolution (SR) techniques in real-time. This work focuses on the spatial resolution enhancement block and presents an FPGA implementation of the Iterative Back Projection (IBP) SR algorithm. The proposed architecture achieves 25 fps for VGA input and can serve as a general purpose real-time resolution enhancement system.

IEICE Electronics Express, 2014
This paper proposes a 2-stage variation-aware placement method that benefits from the optimality ... more This paper proposes a 2-stage variation-aware placement method that benefits from the optimality of a full-chipwise (chip-bychip) placement to alleviate the impact of process variation. By classifying FPGAs into a small number of classes based on their variation maps and performing placement optimisation specifically for each class instead of each chip, two-stage placement can greatly reduce the execution time with similar timing improvement as achieved by full chipwise optimal placement. Our proposed method is implemented in a modified version of VPR 5.0 and verified using variation maps measured from 129 DE0 boards equipped with Cyclone III FPGAs. The results are compared with variation-blind, Statistical static timing analysis (SSTA) and full chipwise placement. The timing gain of 7.5% is observed in 20 MCNC benchmarks with 16 classes for 95% timing yield, while reducing execution time by a factor of 8 compared to full-chipwise placement.
IEICE Electronics Express, 2014
In this paper, the FPGA routing process is explored to mitigate and take advantage of the effect ... more In this paper, the FPGA routing process is explored to mitigate and take advantage of the effect of delay variability due to process variation. A new method called partial rerouting is proposed in this paper to improve the timing performance based on process variation and reduce the execution time. By only rerouting a small number of critical and near-critical paths, about 6.3% timing improvement can be achieved by partial rerouting method. At the same time, partial rerouting can speed up the routing process by 9 times compared with full chipwise with 100 target FPGAs (variation maps). Moreover, the partial rerouting enables a trade-off between product yield and routing speed.

IEEE Transactions on Very Large Scale Integration (VLSI) Systems, 2013
The key aspects of a good on-chip timing measurement platform are high measurement resolution, ac... more The key aspects of a good on-chip timing measurement platform are high measurement resolution, accuracy, and low area overhead. A measurement method based on transition probability (TP) has shown promising characteristics in all these areas. In this paper, the TP measurement method is examined through simulation to understand its apparent effectiveness and accuracy in measuring complex circuits. Timing uncertainties and logic glitch activities are considered in detail, and the effect of varying input vectors' probability distributions is analyzed to enable further accuracy improvements. Using a field-programmable gate array, the method is implemented and demonstrated as a modular on-chip test platform for testing complex arbitrary circuits. Practical circuits found in typical modular designs, including fixed/floating-point arithmetic and filter circuits, are chosen to evaluate the test platform. The resolution of the timing measurements ranges from 0.3 to 8.0 ps, and the measurement errors against reference measurements are found to be within 3.6%. The test platform can be applied to VLSI designs with minor area overhead, and provides designers with precise and accurate physical timing information of circuits.

IEEE Transactions on Very Large Scale Integration (VLSI) Systems, 2005
This paper explores the problem of architectural synthesis (scheduling, allocation, and binding) ... more This paper explores the problem of architectural synthesis (scheduling, allocation, and binding) for multiple word-length systems. It is demonstrated that the resource allocation and binding problem, and the interaction between scheduling, allocation, and binding, are complicated by the existence of multiple word-length operators. Both optimum and heuristic approaches to the combined problem are formulated. The optimum solution involves modeling as an integer linear program, while the heuristic solution considers intertwined scheduling, binding, and resource word-length selection. Techniques are introduced to perform scheduling with incomplete word-length information, to combine binding and word-length selection, and to refine word-length information based on critical path analysis. Results are presented for several benchmark and artificial examples, demonstrating significant resource savings of up to 46% are possible by considering these problems within the proposed unified framework.

Proceedings of the 2003 International Symposium on Circuits and Systems, 2003. ISCAS '03., May 25, 2003
Reconfigurable Computers based on a combination of conventional microprocessors and Field Program... more Reconfigurable Computers based on a combination of conventional microprocessors and Field Programmable Gate Arrays (FPGAs) presents new challenges to designers. Debugging on such hardware/software cohabiting systems can be a nightmare. This paper presents SONICmole, a debugging environment designed for the UltraSONIC reconfigurable computer, which is designed specifically for real-time video applications. The window-based integrated debugging environment includes a hardware debug module (the "mole") that performs the function of a logic analyzer, embedded within the FPGA design, and an easy-to-use software package that facilitates such a hardware/software system. The resource overhead of the hardware module is only 4% of a Virtex XVC1000 FPGA. The approach reported here is not limited to the UltraSONIC architecture, and can easily be modified for other reconfigurable computers. UltraSONIC Logic Analyzer Trigger and Store Setup Group Configuration... Figure 6. The SONICbug user interface showing signal states as history lists and as timing diagrams
Proceedings of the IEEE 1991 Custom Integrated Circuits Conference
The authors describe an efficient method for obtaining the parameters of behavioural models of an... more The authors describe an efficient method for obtaining the parameters of behavioural models of analogue cells from either measurements of fabricated devices or from circuit simulation. No assumption about the analytical mapping between the performance space and the behavioural parameter space is required. By combining this method with standard tolerance analysis techniques based on the Monte Carlo method one can
Proceedings. 10th Annual IEEE Symposium on Field-Programmable Custom Computing Machines
This paper is concerned with the image registration problem as applied to video sequences that ha... more This paper is concerned with the image registration problem as applied to video sequences that have been subjected to geometric distortions. This work involves the development of a computationally efficient algorithm to restore the video sequence using image registration techniques. An approach based on motion vectors is proposed and is found to be successful in restoring the video sequence for any affine transform based distortion. The algorithm is implemented in FPGA hardware targeted for a reconfigurable computing platform called SONIC. It is shown that the algorithm can efficiently restore the video data in realtime.
1996 IEEE International Symposium on Circuits and Systems. Circuits and Systems Connecting the World. ISCAS 96
By means of new tools that can extract realistic faults from the PCB layout, generate test vector... more By means of new tools that can extract realistic faults from the PCB layout, generate test vectors and simulate fault diagnosis, we evaluated the applicability of structural diagnosis methods for interconnect faults. The results show that, compared to the traditional behavioural approaches, structural diagnosis can reduce ambiguity in diagnosis and more compact test sets can be employed.
International Conference on Field Programmable Logic and Applications, 2005.
The purpose of this paper is to detail a high-level analytical modelling and optimisation environ... more The purpose of this paper is to detail a high-level analytical modelling and optimisation environment for mixed-granularity Field Programmable Gate Arrays (FPGAs). The work carried out for the purposes of this study involves the creation of an analytical framework that can be used to optimise the design of a reconfigurable device for a set of benchmarks. The strengths of this approach are the simultaneous placement, module selection and architecture generation. In this paper, the problem is cast as a formal optimisation, and may be solved using existing optimisation tools. In addition, the approach is adapted into an heuristic for larger benchmark sets. The design space is explored by examining the tradeoffs between area, speed and flexibility, and some comparisons to commercial architectures are drawn.
Proceedings of the Second International Conference on Automatic Face and Gesture Recognition
Automatic facial feature detection is typically solved by using manually segmented images to trai... more Automatic facial feature detection is typically solved by using manually segmented images to train a feature detector. In this paper, we investigate whether it is possible to improve the detection performance of such a feature detector by using additional unsegmented images. We propose a new adaptive automatic facial feature segmentation algorithm which aims to do this. The experimental results using this algorithm demonstrate that it is possible to improve the detection performance obtained from a small segmented training set by using a larger number of additionalunsegmented images.
2000 IEEE International Symposium on Circuits and Systems. Emerging Technologies for the 21st Century. Proceedings (IEEE Cat No.00CH36353)

Proceedings. 10th Annual IEEE Symposium on Field-Programmable Custom Computing Machines
This paper presents tabu search (TS) method with intensification strategy for hardware-software p... more This paper presents tabu search (TS) method with intensification strategy for hardware-software partitioning. The algorithm operates on functional blocks for designs represented as directed acyclic graphs (DAG), with the objective of minimising processing time under various hardware area constraints. Results are compared to two other heuristic search algorithms: genetic algorithm (GA) and simulated annealing (SA). The comparison involves a scheduling model based on list scheduling for calculating processing time used as a system cost, assuming that shared resource conflicts do not occur. The results show that TS, which rarely appears for solving this kind of problem, is superior to SA and GA in terms of both search time and the quality of solutions. In addition, we have implemented intensification strategy in TS called penalty reward, which can further improve the quality of results.
Uploads
Papers by Peter Y K Cheung