Papers by Silviu-Ioan Filip

Large-scale deep neural networks (DNNs) have achieved remarkable success in many application scen... more Large-scale deep neural networks (DNNs) have achieved remarkable success in many application scenarios. However, high computational complexity and energy costs of modern DNNs make their deployment on edge devices challenging. Model quantization is a common approach to deal with deployment constraints, but searching for optimized bit-widths can be challenging. In this work, we present Adaptive Bit-Width Quantization Aware Training (AdaQAT), a learning-based method that automatically optimizes weight and activation signal bit-widths during training for more efficient DNN inference. We use relaxed real-valued bitwidths that are updated using a gradient descent rule, but are otherwise discretized for all quantization operations. The result is a simple and flexible QAT approach for mixed-precision uniform quantization problems. Compared to other methods that are generally designed to be run on a pretrained network, AdaQAT works well in both training from scratch and fine-tuning scenarios. Initial results on the CIFAR-10 and ImageNet datasets using ResNet20 and ResNet18 models, respectively, indicate that our method is competitive with other state-of-the-art mixed-precision quantization approaches.
Towards Machine-Efficient Rational L<sup>∞</sup>-Approximations of Mathematical Functions

Training Deep Neural Networks (DNNs) can be computationally demanding, particularly when dealing ... more Training Deep Neural Networks (DNNs) can be computationally demanding, particularly when dealing with large models. Recent work has aimed to mitigate this computational challenge by introducing 8-bit floating-point (FP8) formats for multiplication. However, accumulations are still done in either half (16-bit) or single (32-bit) precision arithmetic. In this paper, we investigate lowering accumulator word length while maintaining the same model accuracy. We present a multiply-accumulate (MAC) unit with FP8 multiplier inputs and FP12 accumulations, which leverages an optimized stochastic rounding (SR) implementation to mitigate swamping errors that commonly arise during low precision accumulations. We investigate the hardware implications and accuracy impact associated with varying the number of random bits used for rounding operations. We additionally attempt to reduce MAC area and power by proposing a new scheme to support SR in floating-point MAC and by removing support for subnormal values. Our optimized eager SR unit significantly reduces delay and area when compared to a classic lazy SR design. Moreover, when compared to MACs utilizing single-or half-precision adders, our design showcases notable savings in all metrics. Furthermore, our approach consistently maintains near baseline accuracy across a diverse range of computer vision tasks, making it a promising alternative for low-precision DNN training.

One of the major bottlenecks in high-resolution Earth Observation (EO) space systems is the downl... more One of the major bottlenecks in high-resolution Earth Observation (EO) space systems is the downlink between the satellite and the ground. Due to hardware limitations, onboard power limitations or ground-station operation costs, there is a strong need to reduce the amount of data transmitted. Various processing methods can be used to compress the data. One of them is the use of on-board deep learning to extract relevant information in the data. However, most ground-based deep neural network parameters and computations are performed using single-precision floating-point arithmetic, which is not adapted to the context of on-board processing. We propose to rely on quantized neural networks and study how to combine low precision (mini) floating-point arithmetic with a Quantization-Aware Training methodology. We evaluate our approach with a semantic segmentation task for ship detection using satellite images from the Airbus Ship dataset. Our results show that 6-bit floating-point quantization for both weights and activations can compete with single-precision without significant accuracy degradation. Using a Thin U-Net 32 model, only a 0.3% accuracy degradation is observed with 6-bit minifloat quantization (a 6-bit equivalent integer-based approach leads to a 0.5% degradation). An initial hardware study also confirms the potential impact of such lowprecision floating-point designs, but further investigation at the scale of a full inference accelerator is needed before concluding whether they are relevant in a practical on-board scenario.
Design example showcasing a transition band anomaly and how it can be removed.. .. .. 31 3.4 Degr... more Design example showcasing a transition band anomaly and how it can be removed.. .. .. 31 3.4 Degree n = 10 minimax approximation of f (x) = ln(x)e sin(x) in terms of relative error.. .. 3.5 Relative errors (correct significant digits) when computing the starting p for Example 3.20 with both types of barycentric formulas. Uniform initialization (Λ Rn[x],[−1,1] (x) 4.24973 • 10 14

HAL (Le Centre pour la Communication Scientifique Directe), Apr 30, 2023
Software implementations of mathematical functions often use approximations that can be either po... more Software implementations of mathematical functions often use approximations that can be either polynomial or rational in nature. While polynomials are the preferred approximation in most cases, rational approximations are nevertheless an interesting alternative when dealing with functions that have a pronounced "nonpolynomial behavior" (such as poles close to the approximation domain, asymptotes or finite limits at ±∞). The major challenge is that of computing good rational approximations with machine number coefficients (e.g. floatingpoint or fixed-point) with respect to the supremum norm, a key step in most procedures for evaluating a mathematical function. This is made more complicated by the fact that even when dealing with real-valued coefficients, optimal supremum norm solutions are sometimes difficult to obtain. Here, we introduce flexible and fast algorithms for computing such rational approximations with both real and machine number coefficients. Their effectiveness is explored on several examples.
HAL (Le Centre pour la Communication Scientifique Directe), May 12, 2017
This article presents an open-source tool for the automatic design of reliable finite impulse res... more This article presents an open-source tool for the automatic design of reliable finite impulse response (FIR) filters, targeting FPGAs. It shows that user intervention can be limited to a very small number of relevant input parameters: a high-level frequency-domain specification, and input/output formats. All the other design parameters are computed automatically, using novel approaches to filter coefficient quantization and directform architecture implementation. Our tool guarantees a priori that the resulting architecture respects the specification, while attempting to minimize its cost. Our approach is evaluated on a range of examples and shown to produce designs that are very competitive with the state of the art, with very little design effort.

The multiplication by a constant is a frequently used operation. To implement it on Field Program... more The multiplication by a constant is a frequently used operation. To implement it on Field Programmable Gate Arrays (FPGAs), the state of the art offers two completely different methods: one relying on bit shifts and additions/subtractions, and another one using look-up tables and additions. So far, it was unclear which method performs best for a given constant and input/output data types. The main contribution of this work is a thorough comparison of both methods in the main application contexts of constant multiplication: filters, signalprocessing transforms, and elementary functions. Most of the previous state of the art addresses multiplication by an integer constant. This work shows that, in most of these application contexts, a formulation of the problem as the multiplication by a real constant allows for more efficient architectures. Another contribution is a novel extension of the shift-and-add method to real constants. For that, an integer linear programming (ILP) formulation is proposed, which truncates each component in the shift-and-add network to a minimum necessary word size that is aligned with the approximation error of the coefficient. All methods are implemented within the open-source FloPoCo framework. 1 With FloPoCo version 4.1.2, try the command flopoco FPExp we=11 wf=53 and look in the produced VHDL for the signal absKLog2.

Siam Review, 2019
The usual way in which mathematicians work with randomness is by a rigorous formulation of the id... more The usual way in which mathematicians work with randomness is by a rigorous formulation of the idea of Brownian motion, which is the limit of a random walk as the step length goes to zero. A Brownian path is continuous but nowhere differentiable, and this nonsmoothness is associated with technical complications that can be daunting. However, there is another approach to random processes that is more elementary, involving smooth random functions defined by finite Fourier series with random coefficients or, equivalently, by trigonometric polynomial interpolation through random data values. We show here how smooth random functions can provide a very practical way to explore random effects. For example, one can solve smooth random ordinary differential equations using standard mathematical definitions and numerical algorithms, rather than having to develop new definitions and algorithms of stochastic differential equations. In the limit as the number of Fourier coefficients defining a smooth random function goes to \infty , one obtains the usual stochastic objects in what is known as their Stratonovich interpretation.

SIAM Journal on Scientific Computing, 2018
Computing rational minimax approximations can be very challenging when there are singularities on... more Computing rational minimax approximations can be very challenging when there are singularities on or near the interval of approximation-precisely the case where rational functions outperform polynomials by a landslide. We show that far more robust algorithms than previously available can be developed by making use of rational barycentric representations whose support points are chosen in an adaptive fashion as the approximant is computed. Three variants of this barycentric strategy are all shown to be powerful: (1) a classical Remez algorithm, (2) a "AAA-Lawson" method of iteratively reweighted least-squares, and (3) a differential correction algorithm. Our preferred combination, implemented in the Chebfun MINIMAX code, is to use (2) in an initial phase and then switch to (1) for generically quadratic convergence. By such methods we can calculate approximations up to type (80, 80) of |x| on [−1, 1] in standard 16-digit floating point arithmetic, a problem for which Varga, Ruttan, and Carpenter required 200-digit extended precision.

The most compute-intensive stage of deep neural network (DNN) training is matrix multiplication w... more The most compute-intensive stage of deep neural network (DNN) training is matrix multiplication where the multiply-accumulate (MAC) operator is key. To reduce training costs, we consider using low-precision arithmetic for MAC operations. While low-precision training has been investigated in prior work, the focus has been on reducing the number of bits in weights or activations without compromising accuracy. In contrast, the focus in this paper is on implementation details beyond weight or activation width that affect area and accuracy. In particular, we investigate the impact of fixed-versus floating-point representations, multiplier rounding, and floatingpoint exceptional value support. Results suggest that (1) lowprecision floating-point is more area-effective than fixed-point for multiplication, (2) standard IEEE-754 rules for subnormals, NaNs, and intermediate rounding serve little to no value in terms of accuracy but contribute significantly to area, (3) lowprecision MACs require an adaptive loss-scaling step during training to compensate for limited representation range, and (4) fixed-point is more area-effective for accumulation, but the cost of format conversion and downstream logic can swamp the savings. Finally, we note that future work should investigate accumulation structures beyond the MAC level to achieve further gains. • a new, extensible, open-source DNN training framework called Archimedes-MPO which uses FPGA or GPU
MPTorch and MPArchimedes: Open Source Frameworks to Explore Custom Mixed- Precision Operations for DNN Training on Edge Devices
HAL (Le Centre pour la Communication Scientifique Directe), Dec 5, 2021
International audienc
Springer eBooks, 2022
The design and implementation of Deep Learning (DL) models is currently receiving a lot of attent... more The design and implementation of Deep Learning (DL) models is currently receiving a lot of attention from both industrials and academics. However, the computational workload associated with DL is often out of reach for low-power embedded devices and is still costly when run on datacenters. By relaxing the need for fully precise operations, Approximate Computing (AxC) substantially improves performance and energy efficiency. DL is extremely relevant in this context, since playing with the accuracy needed to do adequate computations will significantly enhance performance, while keeping the quality of results in a user-constrained range. This chapter will explore how AxC can improve the performance and energy efficiency of hardware accelerators in DL applications during inference and training.
IEEE Transactions on Signal Processing, May 15, 2018
Many applications of finite impulse response (FIR) digital filters impose strict format constrain... more Many applications of finite impulse response (FIR) digital filters impose strict format constraints for the filter coefficients. Such requirements increase the complexity of determining optimal designs for the problem at hand. We introduce a fast and efficient method, based on the computation of good nodes for polynomial interpolation and Euclidean lattice basis reduction. Experiments show that it returns quasi-optimal finite wordlength FIR filters; compared to previous approaches it also scales remarkably well (length 125 filters are treated in < 9s). It also proves useful for accelerating the determination of optimal finite wordlength FIR filters.
ACM Transactions on Mathematical Software, Aug 13, 2016
With a long history dating back to the beginning of the 1970s, the Parks-McClellan algorithm is p... more With a long history dating back to the beginning of the 1970s, the Parks-McClellan algorithm is probably the most well-known approach for designing finite impulse response filters. Despite being a standard routine in many signal processing packages, it is possible to find practical design specifications where existing codes fail to work. Our goal is twofold. We first examine and present solutions for the practical difficulties related to weighted minimax polynomial approximation problems on multi-interval domains (i.e., the general setting under which the Parks-McClellan algorithm operates). Using these ideas, we then describe a robust implementation of this algorithm. It routinely outperforms existing minimax filter design routines.
Many applications of finite impulse response (FIR) digital filters impose strict format constrain... more Many applications of finite impulse response (FIR) digital filters impose strict format constraints for the filter coefficients. Such requirements increase the complexity of determining optimal designs for the problem at hand. We introduce a fast and efficient method, based on the computation of good nodes for polynomial interpolation and Euclidean lattice basis reduction. Experiments show that it returns quasi-optimal finite wordlength FIR filters; compared to previous approaches it also scales remarkably well (length 125 filters are treated in < 9s). It also proves useful for accelerating the determination of optimal finite wordlength FIR filters.

IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems
This work presents two novel methods that simultaneously optimize both the design of a finite imp... more This work presents two novel methods that simultaneously optimize both the design of a finite impulse response (FIR) filter and its multiplierless hardware implementation. We use integer linear programming (ILP) to minimize the number of adders used to implement a direct/transposed FIR filter adhering to a given frequency specification. The proposed algorithms work by either fixing the number of adders used to implement the products (multiplier block adders) or by bounding the adder depth (AD) used for these products. The latter can be used to design filters with minimal AD for low power applications. In contrast to previous multiplierless FIR filter approaches, the methods introduced here ensure adder count optimality. We perform extensive numerical experiments which demonstrate that our simultaneous filter design approach yields results superior to those in the literature.
Springer eBooks, 2022
The design and implementation of Deep Learning (DL) models is currently receiving a lot of attent... more The design and implementation of Deep Learning (DL) models is currently receiving a lot of attention from both industrials and academics. However, the computational workload associated with DL is often out of reach for low-power embedded devices and is still costly when run on datacenters. By relaxing the need for fully precise operations, Approximate Computing (AxC) substantially improves performance and energy efficiency. DL is extremely relevant in this context, since playing with the accuracy needed to do adequate computations will significantly enhance performance, while keeping the quality of results in a user-constrained range. This chapter will explore how AxC can improve the performance and energy efficiency of hardware accelerators in DL applications during inference and training.

2021 24th International Symposium on Design and Diagnostics of Electronic Circuits & Systems (DDECS)
The design and implementation of Convolutional Neural Networks (CNNs) for deep learning (DL) is c... more The design and implementation of Convolutional Neural Networks (CNNs) for deep learning (DL) is currently receiving a lot of attention from both industrials and academics. However, the computational workload involved with CNNs is often out of reach for low power embedded devices and is still very costly when running on datacenters. By relaxing the need for fully precise operations, approximate computing substantially improves performance and energy efficiency. Deep learning is very relevant in this context, since playing with the accuracy to reach adequate computations will significantly enhance performance, while keeping quality of results in a user-constrained range. AdequateDL is a project aiming to explore how approximations can improve performance and energy efficiency of hardware accelerators in DL applications. This paper presents the main concepts and techniques related to approximation of CNNs and preliminary results obtained in the AdequateDL framework.
MPTorch and MPArchimedes: Open Source Frameworks to Explore Custom Mixed- Precision Operations for DNN Training on Edge Devices
International audienc
Uploads
Papers by Silviu-Ioan Filip