Academia.edu no longer supports Internet Explorer.
To browse Academia.edu and the wider internet faster and more securely, please take a few seconds to upgrade your browser.
2022, arXiv (Cornell University)
With the surging popularity of edge computing, the need to efficiently perform neural network inference on battery-constrained IoT devices has greatly increased. While algorithmic developments enable neural networks to solve increasingly more complex tasks, the deployment of these networks on edge devices can be problematic due to the stringent energy, latency, and memory requirements. One way to alleviate these requirements is by heavily quantizing the neural network, i.e. lowering the precision of the operands. By taking quantization to the extreme, e.g. by using binary values, new opportunities arise to increase the energy efficiency. Several hardware accelerators exploiting the opportunities of low-precision inference have been created, all aiming at enabling neural network inference at the edge. In this chapter, design choices and their implications on the flexibility and energy efficiency of several accelerators supporting extremely quantized networks are reviewed.
ArXiv, 2018
Deep learning as a means to inferencing has proliferated thanks to its versatility and ability to approach or exceed human-level accuracy. These computational models have seemingly insatiable appetites for computational resources not only while training, but also when deployed at scales ranging from data centers all the way down to embedded devices. As such, increasing consideration is being made to maximize the computational efficiency given limited hardware and energy resources and, as a result, inferencing with reduced precision has emerged as a viable alternative to the IEEE 754 Standard for Floating-Point Arithmetic. We propose a quantization scheme that allows inferencing to be carried out using arithmetic that is fundamentally more efficient when compared to even half-precision floating-point. Our quantization procedure is significant in that we determine our quantization scheme parameters by calibrating against its reference floating-point model using a single inference batc...
This paper presents TULIP, a new architecture for a variable precision Quantized Neural Network (QNN) inference. It is designed with the goal of maximizing energy efficiency per classification. TULIP is constructed by arranging a collection of unique processing elements (TULIP-PEs) in a single instruction multiple data (SIMD) fashion. Each TULIP-PE contains binary neurons that are interconnected using multiplexers. Each neuron also has a small dedicated local register connected to it. The binary neurons are implemented as standard cells and used for implementing threshold functions, i.e., an inner-product and thresholding operation on its binary inputs. The neurons can be reconfigured with a single change in the control signals to implement all the standard operations used in a QNN. This paper presents novel algorithms for implementing the operations of a QNN on the TULIP-PEs in the form of a schedule of threshold functions. TULIP was implemented as an ASIC in TSMC 40nm- LP technol...
Intelligent Automation & Soft Computing, 2022
Convolutional Neural Network (ConNN) implementations on Field Programmable Gate Array (FPGA) are being studied since the computational capabilities of FPGA have been improved recently. Model compression is required to enable ConNN deployment on resource-constrained FPGA devices. Logarithmic quantization is one of the efficient compression methods that can compress a model to very low bit-width without significant deterioration in performance. It is also hardware-friendly by using bitwise operations for multiplication. However, the logarithmic suffers from low resolution at high inputs due to exponential properties. Therefore, we propose a modified logarithmic quantization method with a fine resolution to compress a neural network model. In experiments, quantized models achieve a negligible loss of accuracy without the need for retraining steps. Besides this, we propose a resource-efficient hardware accelerator for running ConNN inference. Our design completely eliminates multipliers with bit shifters and adders. Throughput is measured in Giga Operation Per Second (GOP/s). The hardware utilization efficiency is represented by GOP/s per block of Digital Signal Processing (DSP) and Look-up Tables (LUTs). The result shows that the accelerator achieves resource efficiency of 9.38 GOP/s/DSP and 3.33 GOP/s/kLUTs.
ACM Transactions on Embedded Computing Systems, 2022
Ternary Neural Networks (TNNs) and mixed-precision Ternary Binary Networks (TBNs) have demonstrated higher accuracy compared to Binary Neural Networks (BNNs) while providing fast, low-power and memory-efficient inference. Related works have improved the accuracy of TNNs and TBNs, but overlooked their optimizations on CPU and GPU platforms. First, there is no unified encoding for the binary and ternary values in TNNs and TBNs. Second, existing works store the 2-bit quantized data sequentially in 32/64-bit integers, resulting in bit-extraction overhead. Last, adopting standard 2-bit multiplications for ternary values leads to a complex computation pipeline, and efficient mixed-precision multiplication between ternary and binary values is unavailable. In this paper, we propose TAB as a unified and optimized inference method for ternary, binary and mixed-precision neural networks. TAB includes unified value representation, efficient data storage scheme, and novel bitwise dot product pip...
arXiv (Cornell University), 2023
The deployment of AI models on low-power, realtime edge devices requires accelerators for which energy, latency, and area are all first-order concerns. There are many approaches to enabling deep neural networks (DNNs) in this domain, including pruning, quantization, compression, and binary neural networks (BNNs), but with the emergence of the "extreme edge", there is now a demand for even more efficient models. In order to meet the constraints of ultra-low-energy devices, we propose ULEEN, a model architecture based on weightless neural networks. Weightless neural networks (WNNs) are a class of neural model which use table lookups, not arithmetic, to perform computation. The elimination of energy-intensive arithmetic operations makes WNNs theoretically well suited for edge inference; however, they have historically suffered from poor accuracy and excessive memory usage. ULEEN incorporates algorithmic improvements and a novel training strategy inspired by BNNs to make significant strides in improving accuracy and reducing model size. We compare FPGA and ASIC implementations of an inference accelerator for ULEEN against edge-optimized DNN and BNN devices. On a Xilinx Zynq Z-7045 FPGA, we demonstrate classification on the MNIST dataset at 14.3 million inferences per second (13 million inferences/Joule) with 0.21 µs latency and 96.2% accuracy, while Xilinx FINN achieves 12.3 million inferences per second (1.69 million inferences/Joule) with 0.31 µs latency and 95.83% accuracy. In a 45nm ASIC, we achieve 5.1 million inferences/Joule and 38.5 million inferences/second at 98.46% accuracy, while a quantized Bit Fusion model achieves 9230 inferences/Joule and 19,100 inferences/second at 99.35% accuracy. In our search for ever more efficient edge devices, ULEEN shows that WNNs are deserving of consideration.
2018
The performance of neural networks, especially the currently popular form of deep 1 neural networks, is often limited by the underlying hardware. Computations in 2 deep neural networks are expensive, have large memory footprint, and are power 3 hungry. Conventional reduced-precision numerical formats, such as fixed-point 4 and floating point, cannot accurately represent deep neural network parameters 5 with a nonlinear distribution and small dynamic range. Recently proposed posit 6 numerical format with tapered precision represents small values more accurately 7 than the other formats. In this work, we propose an ultra-low precision deep neural 8 network, PositNN, that uses posits during inference. The efficacy of PositNN is 9 demonstrated on a deep neural network architecture with three datasets (MNIST, 10 Fashion MNIST and Cifar-10), where an 8-bit PositNN outperforms other {5-8}-bit 11 low-precision neural networks and a 32-bit floating point baseline network. 12
2020 IEEE International Symposium on Circuits and Systems (ISCAS), 2020
Emerging intelligent embedded devices rely on Deep Neural Networks (DNNs) to be able to interact with the realworld environment. This interaction comes with the ability to retrain DNNs, since environmental conditions change continuously in time. Stochastic Gradient Descent (SGD) is a widely used algorithm to train DNNs by optimizing the parameters over the training data iteratively. In this work, first we present a novel approach to add the training ability to a baseline DNN accelerator (inference only) by splitting the SGD algorithm into simple computational elements. Then, based on this heuristic approach we propose TaxoNN, a lightweight accelerator for DNN training. TaxoNN can easily tune the DNN weights by reusing the hardware resources used in the inference process using a time-multiplexing approach and low-bitwidth units. Our experimental results show that TaxoNN delivers, on average, 0.97% higher misclassification rate compared to a full-precision implementation. Moreover, TaxoNN provides 2.1× power saving and 1.65× area reduction over the state-of-the-art DNN training accelerator.
2019
Low-precision DNNs have been extensively explored in order to reduce the size of DNN models for edge devices. Recently, the posit numerical format has shown promise for DNN data representation and compute with ultra-low precision in [5..8]-bits. However, previous studies were limited to studying posit for DNN inference only. In this paper, we propose the Cheetah framework, which supports both DNN training and inference using posits, as well as other commonly used formats. Additionally, the framework is amenable for different quantization approaches and supports mixed-precision floating point and fixed-point numerical formats. Cheetah is evaluated on three datasets: MNIST, Fashion MNIST, and CIFAR-10. Results indicate that 16-bit posits outperform 16-bit floating point in DNN training. Furthermore, performing inference with [5..8]-bit posits improves the trade-off between performance and energy-delay-product over both [5..8]-bit float and fixed-point.
IEEE Access
In recent years, the need for the efficient deployment of Neural Networks (NN) on edge devices has been steadily increasing. However, the high computational demand required for Machine Learning (ML) inference on tiny microcontroller-based IoT devices avoids a direct software deployment on such resource-constrained edge devices. Therefore, various custom and application-specific NN hardware accelerators have been proposed to enable real-time Machine Learning (ML) inference on low-power and resource-limited edge devices. Efficient mapping of the computational load onto hardware and software resources is a key challenge for performance improvement while keeping low power and a low area footprint. High performance and yet low power embedded processors may be attained via the usage of hardware acceleration. This paper presents an efficient hardware-software framework to accelerate machine learning inference on edge devices using a modified TensorFlow Lite for Microcontroller (TFLM) model running on a Microcontroller (MCU) and a dedicated Neural Processing Unit (NPU) custom hardware accelerator, referred to as MCU-NPU. The proposed framework supports weight compression of pruned quantized NN models and exploits the pruned model sparsity to reduce computational complexity further. The proposed methodology has been evaluated by employing the MCU-NPU acceleration for various TFLM-based NN architectures using the common MLPerf Tiny benchmark. Experimental results demonstrate a significant speedup of up to 724x compared to a pure software implementation. For example, the resulting runtime for the CIFAR-10 classification is reduced from about 20 sec to only 37 ms using the proposed hardware acceleration. Moreover, the proposed hardware accelerator outperforms all the reference models optimized for edge devices in terms of inference runtime. INDEX TERMS TinyML, neural processing unit, TensorFlow-Lite for microcontrollers, hardware-software codesign.
2023
One of the major bottlenecks in high-resolution Earth Observation (EO) space systems is the downlink between the satellite and the ground. Due to hardware limitations, onboard power limitations or ground-station operation costs, there is a strong need to reduce the amount of data transmitted. Various processing methods can be used to compress the data. One of them is the use of on-board deep learning to extract relevant information in the data. However, most ground-based deep neural network parameters and computations are performed using single-precision floating-point arithmetic, which is not adapted to the context of on-board processing. We propose to rely on quantized neural networks and study how to combine low precision (mini) floating-point arithmetic with a Quantization-Aware Training methodology. We evaluate our approach with a semantic segmentation task for ship detection using satellite images from the Airbus Ship dataset. Our results show that 6-bit floating-point quantization for both weights and activations can compete with single-precision without significant accuracy degradation. Using a Thin U-Net 32 model, only a 0.3% accuracy degradation is observed with 6-bit minifloat quantization (a 6-bit equivalent integer-based approach leads to a 0.5% degradation). An initial hardware study also confirms the potential impact of such lowprecision floating-point designs, but further investigation at the scale of a full inference accelerator is needed before concluding whether they are relevant in a practical on-board scenario.
Applied Reconfigurable Computing. Architectures, Tools, and Applications, 2018
Modern Convolutional Neural Networks (CNNs) are typically based on floating point linear algebra based implementations. Recently, reduced precision Neural Networks (NNs) have been gaining popularity as they require significantly less memory and computational resources compared to floating point. This is particularly important in power constrained compute environments. However, in many cases a reduction in precision comes at a small cost to the accuracy of the resultant network. In this work, we investigate the accuracy-throughput trade-off for various parameter precision applied to different types of NN models. We firstly propose a quantization training strategy that allows reduced precision NN inference with a lower memory footprint and competitive model accuracy. Then, we quantitatively formulate the relationship between data representation and hardware efficiency. Our experiments finally provide insightful observation. For example, one of our tests show 32-bit floating point is more hardware efficient than 1-bit parameters to achieve 99% MNIST accuracy. In general, 2-bit and 4-bit fixed point parameters show better hardware trade-off on small-scale datasets like MNIST and CIFAR-10 while 4-bit provide the best trade-off in large-scale tasks like AlexNet on ImageNet dataset within our tested problem domain.
ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
The massive adoption of IoT devices, the recent developments in the efficiency of AI systems, and the increase of edge computational power, accelerated the deployment of edge AI systems. The implementation of these systems through the use of low-power embedded devices scattered across the edges of a network allows for reduced latency and cost, compared to traditional cloud-based AI computing systems. As a result of the low-complexity AI models and the available low-power embedded systems on the market, this paper provides a comparative study on the inference performance of convolutional neural networks for different edge devices, by exploiting lowpower GPUs and dedicated AI hardware. The benchmark results were able to achieve 864 inferences/s for the Jetson AGX Xavier board on a pre-trained SqueezeNet, while reaching a high power efficiency of 52.6 inferences/s per Watt. For the dedicated Movidius neural stick, the system requires only 1.5 W for processing 24.2 inferences/s.
ArXiv, 2017
It remains a challenge to run Deep Learning in devices with stringent power budget in the Internet-of-Things. This paper presents a low-power accelerator for processing Deep Neural Networks in the embedded devices. The power reduction is realized by avoiding multiplications of near-zero valued data. The near-zero approximation and a dedicated Near-Zero Approximation Unit (NZAU) are proposed to predict and skip the nearzero multiplications under certain thresholds. Compared with skipping zero-valued computations, our design achieves 1.92X and 1.51X further reduction of the total multiplications in LeNet5 and Alexnet respectively, with negligible lose of accuracy. In the proposed accelerator, 256 multipliers are grouped into 16 independent Processing Lanes (PL) to support up to 16 neuron activations simultaneously. With the help of data pre-processing and buffering in each PL, multipliers can be clock-gated in most of the time even the data is excessively streaming in. Designed and si...
ArXiv, 2021
In this paper, we propose a mixed-precision convolution unit architecture which supports different integer and floating point (FP) precisions. The proposed architecture is based on low-bit inner product units and realizes higher precision based on temporal decomposition. We illustrate how to integrate FP computations on integer-based architecture and evaluate overheads incurred by FP arithmetic support. We argue that alignment and addition overhead for FP inner product can be significant since the maximum exponent difference could be up to 58 bits, which results into a large alignment logic. To address this issue, we illustrate empirically that at least 8 bits of alignment logic are required to maintain inference accuracy. We present novel optimizations based on the above observations to reduce the FP arithmetic hardware overheads. Our empirical results, based on simulation and hardware implementation, show significant reduction in FP16 overhead. Over a typical mixed precision imple...
2021 26th Asia and South Pacific Design Automation Conference (ASP-DAC), 2021
ReRAM-based accelerators have shown great potential for accelerating DNN inference because ReRAM crossbars can perform analog matrix-vector multiplication operations with low latency and energy consumption. However, these crossbars require the use of ADCs which constitute a significant fraction of the cost of MVM operations. The overhead of ADCs can be mitigated via partial sum quantization. However, prior quantization flows for DNN inference accelerators do not consider partial sum quantization which is not highly relevant to traditional digital architectures. To address this issue, we propose a mixed precision quantization scheme for ReRAM-based DNN inference accelerators where weight quantization, input quantization, and partial sum quantization are jointly applied for each DNN layer. We also propose an automated quantization flow powered by deep reinforcement learning to search for the best quantization configuration in the large design space. Our evaluation shows that the propo...
2018 Design, Automation & Test in Europe Conference & Exhibition (DATE)
Artificial intelligence and especially Machine Learning recently gained a lot of interest from the industry. Indeed, new generation of neural networks built with a large number of successive computing layers enables a large amount of new applications and services implemented from smart sensors to data centers. These Deep Neural Networks (DNN) can interpret signals to recognize objects or situations to drive decision processes. However, their integration into embedded systems remains challenging due to their high computing needs. This paper presents PNeuro, a scalable energy-efficient hardware accelerator for the inference phase of DNN processing chains. Simple programmable processing elements architectured in SIMD clusters perform all the operations needed by DNN (convolutions, pooling, non-linear functions, etc.). An FDSOI 28 nm prototype shows an energy efficiency of 700 GMACS/s/W at 800 MHz. These results open important perspectives regarding the development of smart energy-efficient solutions based on Deep Neural Networks.
Cornell University - arXiv, 2022
Deep Learning has been one of the most disruptive technological advancements in recent times. The high performance of deep learning models comes at the expense of high computational, storage and power requirements. Sensing the immediate need for accelerating and compressing these models to improve on-device performance, we introduce Deeplite Neutrino for production-ready optimization of the models and Deeplite Runtime for deployment of ultra-low bit quantized models on Arm-based platforms. We implement low-level quantization kernels for Armv7 and Armv8 architectures enabling deployment on the vast array of 32-bit and 64-bit Arm-based devices. With efficient implementations using vectorization, parallelization, and tiling, we realize speedups of up to 2x and 2.2x compared to TensorFlow Lite with XNNPACK backend on classification and detection models, respectively. We also achieve significant speedups of up to 5x and 3.2x compared to ONNX Runtime for classification and detection models, respectively.
2019 IEEE/CVF International Conference on Computer Vision Workshop (ICCVW), 2019
Recent machine learning methods use increasingly large deep neural networks to achieve state of the art results in various tasks. The gains in performance come at the cost of a substantial increase in computation and storage requirements. This makes real-time implementations on limited resources hardware a challenging task. One popular approach to address this challenge is to perform lowbit precision computations via neural network quantization. However, aggressive quantization generally entails a severe penalty in terms of accuracy, and often requires retraining of the network, or resorting to higher bit precision quantization. In this paper, we formalize the linear quantization task as a Minimum Mean Squared Error (MMSE) problem for both weights and activations, allowing low-bit precision inference without the need for full network retraining. The main contributions of our approach are the optimizations of the constrained MSE problem at each layer of the network, the hardware aware partitioning of the network parameters, and the use of multiple low precision quantized tensors for poorly approximated layers. The proposed approach allows 4 bits integer (INT4) quantization for deployment of pretrained models on limited hardware resources. Multiple experiments on various network architectures show that the suggested method yields state of the art results with minimal loss of tasks accuracy.
2018 IEEE International Conference on Big Data (Big Data), 2018
Recent research has focused on Deep Neural Networks (DNNs) implemented directly in hardware. However, larger DNNs require significant energy and area, thereby limiting their wide adoption. We propose a novel DNN quantization technique and a corresponding hardware solution, CompactNet that optimizes the use of hardware resources even further, through dynamic allocation of memory for each parameter. Experimental results for the MNIST and CIFAR-10 datasets, show that CompactNet reduces the memory requirement by over 80%, the energy requirement by 12-fold, and the area requirement by 7-fold, when compared to the conventional DNN. This is achieved with minimal degradation to the classification accuracy. We demonstrate that, CompactNet provides pareto-optimal designs to make trade-offs between accuracy and resource requirement. The applications of CompactNet can be extended to datasets like ImageNet, and into models like MobileNet. Index Terms-approximate computing, deep neural networks, low power design, asic
Proceedings of the Conference for Next Generation Arithmetic 2019
Deep neural networks (DNNs) have been demonstrated as effective prognostic models across various domains, e.g. natural language processing, computer vision, and genomics. However, modern-day DNNs demand high compute and memory storage for executing any reasonably complex task. To optimize the inference time and alleviate the power consumption of these networks, DNN accelerators with low-precision representations of data and DNN parameters are being actively studied. An interesting research question is in how low-precision networks can be ported to edge-devices with similar performance as high-precision networks. In this work, we employ the fixed-point, floating point, and posit numerical formats at ≤8-bit precision within a DNN accelerator, Deep Positron, with exact multiply-and-accumulate (EMAC) units for inference. A unified analysis quantifies the trade-offs between overall network efficiency and performance across five classification tasks. Our results indicate that posits are a natural fit for DNN inference, outperforming at ≤8-bit precision, and can be realized with competitive resource requirements relative to those of floating point.
Loading Preview
Sorry, preview is currently unavailable. You can download the paper by clicking the button above.