Academia.edu no longer supports Internet Explorer.
To browse Academia.edu and the wider internet faster and more securely, please take a few seconds to upgrade your browser.
2021, Proceedings of the 5th International Workshop on Embedded and Mobile Deep Learning
The ever-growing computational demands of increasingly complex machine learning models frequently necessitate the use of powerful cloud-based infrastructure for their training. Binary neural networks are known to be promising candidates for on-device inference due to their extreme compute and memory savings over higher-precision alternatives. However, their existing training methods require the concurrent storage of high-precision activations for all layers, generally making learning on memory-constrained devices infeasible. In this paper, we demonstrate that the backward propagation operations needed for binary neural network training are strongly robust to quantization, thereby making on-the-edge learning with modern models a practical proposition. We introduce a low-cost binary neural network training strategy exhibiting sizable memory footprint and energy reductions while inducing little to no accuracy loss vs Courbariaux & Bengio's standard approach. These resource decreases are primarily enabled through the retention of activations exclusively in binary format. Against the latter algorithm, our drop-in replacement sees coincident memory requirement and energy consumption drops of 2-6×, while reaching similar test accuracy in comparable time, across a range of small-scale models trained to classify popular datasets. We also demonstrate from-scratch ImageNet training of binarized ResNet-18, achieving a 3.12× memory reduction. Such savings will allow for unnecessary cloud offloading to be avoided, reducing latency, increasing energy efficiency and safeguarding privacy. Preprint. Under review.
ArXiv, 2021
Recent works on Binary Neural Networks (BNNs) have made promising progress in narrowing the accuracy gap of BNNs to their 32-bit counterparts. However, the accuracy gains are often based on specialized model designs using additional 32-bit components. Furthermore, almost all previous BNNs use 32-bit for feature maps and the shortcuts enclosing the corresponding binary convolution blocks, which helps to effectively maintain the accuracy, but is not friendly to hardware accelerators with limited memory, energy, and computing resources. Thus, we raise the following question: “How can accuracy and energy consumption be balanced in a BNN network design?” We extensively study this fundamental problem in this work and propose a novel BNN architecture without most commonly used 32-bit components: BoolNet. Experimental results on ImageNet demonstrate that BoolNet can achieve 4.6× energy reduction coupled with 1.2% higher accuracy than the commonly used BNN architecture Bi-RealNet [30]. Code ...
International Conference on Learning Representations, 2018
There are many applications scenarios for which the computational performance and memory footprint of the prediction phase of Deep Neural Networks (DNNs) need to be optimized. Binary Deep Neural Networks (BDNNs) have been shown to be an effective way of achieving this objective. In this paper, we show how Convolutional Neural Networks (CNNs) can be implemented using binary representations. Espresso is a compact, yet powerful library written in C/CUDA that features all the functionalities required for the forward propagation of CNNs, in a binary file less than 400KB, without any external dependencies. Although it is mainly designed to take advantage of massive GPU parallelism, Espresso also provides an equivalent CPU implementation for CNNs. Espresso provides special convolutional and dense layers for BCNNs, leveraging bit-packing and bitwise computations for efficient execution. These techniques provide a speed-up of matrix-multiplication routines, and at the same time, reduce memory usage when storing parameters and activations. We experimentally show that Espresso is significantly faster than existing implementations of optimized binary neural networks (≈ 2 orders of magnitude). Espresso is released under the Apache 2.0 license and is available at .
arXiv (Cornell University), 2022
Binary Neural Networks (BNNs) are receiving an upsurge of attention for bringing power-hungry deep learning towards edge devices. The traditional wisdom in this space is to employ sign(.) for binarizing feature maps. We argue and illustrate that sign(.) is a uniqueness bottleneck, limiting information propagation throughout the network. To alleviate this, we propose to dispense sign(.), replacing it with a learnable activation binarizer (LAB), allowing the network to learn a fine-grained binarization kernel per layer-as opposed to global thresholding. LAB is a novel universal module that can seamlessly be integrated into existing architectures. To confirm this, we plug it into four seminal BNNs and show a considerable accuracy boost at the cost of tolerable increase in delay and complexity. Finally, we build an end-to-end BNN (coined as LAB-BNN) around LAB, and demonstrate that it achieves competitive performance on par with the state-of-the-art on ImageNet. Our code can be found in our repository: https://github.com/sfalkena/LAB 1 .
2022
Applications of neural networks on edge systems have proliferated in recent years but the ever-increasing model size makes neural networks not able to deploy on resource-constrained microcontrollers efficiently. We propose bit-serial weight pools, an end-to-end framework that includes network compression and acceleration of arbitrary sub-byte precision. The framework can achieve up to 8x compression compared to 8-bit networks by sharing a pool of weights across the entire network. We further propose a bit-serial lookup based software implementation that allows runtime-bitwidth tradeoff and is able to achieve more than 2.8x speedup and 7.5x storage compression compared to 8-bit weight pool networks, with less than 1% accuracy drop.
2019 IEEE/CVF International Conference on Computer Vision Workshop (ICCVW), 2019
Recent machine learning methods use increasingly large deep neural networks to achieve state of the art results in various tasks. The gains in performance come at the cost of a substantial increase in computation and storage requirements. This makes real-time implementations on limited resources hardware a challenging task. One popular approach to address this challenge is to perform lowbit precision computations via neural network quantization. However, aggressive quantization generally entails a severe penalty in terms of accuracy, and often requires retraining of the network, or resorting to higher bit precision quantization. In this paper, we formalize the linear quantization task as a Minimum Mean Squared Error (MMSE) problem for both weights and activations, allowing low-bit precision inference without the need for full network retraining. The main contributions of our approach are the optimizations of the constrained MSE problem at each layer of the network, the hardware aware partitioning of the network parameters, and the use of multiple low precision quantized tensors for poorly approximated layers. The proposed approach allows 4 bits integer (INT4) quantization for deployment of pretrained models on limited hardware resources. Multiple experiments on various network architectures show that the suggested method yields state of the art results with minimal loss of tasks accuracy.
ArXiv, 2022
For binary neural networks (BNNs) to become the mainstream on-device computer vision algorithm, they must achieve a superior speed-vs-accuracy tradeoff than 8-bit quantization and establish a similar degree of general applicability in vision tasks. To this end, we propose a BNN framework comprising 1) a minimalistic inference scheme for hardware-friendliness, 2) an over-parameterized training scheme for high accuracy, and 3) a simple procedure to adapt to different vision tasks. The resultant framework overtakes 8-bit quantization in the speed-vs-accuracy tradeoff for classification, detection, segmentation, superresolution and matching: our BNNs not only retain the accuracy levels of their 8-bit baselines but also showcase 1.32.4× faster FPS on mobile CPUs. Similar conclusions can be drawn for prototypical systolic-array-based AI accelerators, where our BNNs promise 2.8-7× fewer execution cycles than 8-bit and 2.1-2.7× fewer cycles than alternative BNN designs. These results sugges...
Computational Intelligence and Neuroscience, 2020
The increase in sophistication of neural network models in recent years has exponentially expanded memory consumption and computational cost, thereby hindering their applications on ASIC, FPGA, and other mobile devices. Therefore, compressing and accelerating the neural networks are necessary. In this study, we introduce a novel strategy to train low-bit networks with weights and activations quantized by several bits and address two corresponding fundamental issues. One is to approximate activations through low-bit discretization for decreasing network computational cost and dot-product memory. The other is to specify weight quantization and update mechanism for discrete weights to avoid gradient mismatch. With quantized low-bit weights and activations, the costly full-precision operation will be replaced by shift operation. We evaluate the proposed method on common datasets, and results show that this method can dramatically compress the neural network with slight accuracy loss.
2020 IEEE Winter Conference on Applications of Computer Vision (WACV), 2020
MobileNet and Binary Neural Networks are two among the most widely used techniques to construct deep learning models for performing a variety of tasks on mobile and embedded platforms. In this paper, we present a simple yet efficient scheme to exploit MobileNet binarization at activation function and model weights. However, training a binary network from scratch with separable depth-wise and point-wise convolutions in case of MobileNet is not trivial and prone to divergence. To tackle this training issue, we propose a novel neural network architecture, namely MoBi-Net-Mobile Binary Network in which skip connections are manipulated to prevent information loss and vanishing gradient, thus facilitate the training process. More importantly, while existing binary neural networks often make use of cumbersome backbones such as Alex-Net, ResNet, VGG-16 with float-type pre-trained weights initialization, our MoBi-Net focuses on binarizing the already-compressed neural networks like MobileNet without the need of a pre-trained model to start with. Therefore, our proposal results in an effectively small model while keeping the accuracy comparable to existing ones. Experiments on ImageNet dataset show the potential of the MoBiNet as it achieves 54.40% top-1 accuracy and dramatically reduces the computational cost with binary operators.
ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
The massive adoption of IoT devices, the recent developments in the efficiency of AI systems, and the increase of edge computational power, accelerated the deployment of edge AI systems. The implementation of these systems through the use of low-power embedded devices scattered across the edges of a network allows for reduced latency and cost, compared to traditional cloud-based AI computing systems. As a result of the low-complexity AI models and the available low-power embedded systems on the market, this paper provides a comparative study on the inference performance of convolutional neural networks for different edge devices, by exploiting lowpower GPUs and dedicated AI hardware. The benchmark results were able to achieve 864 inferences/s for the Jetson AGX Xavier board on a pre-trained SqueezeNet, while reaching a high power efficiency of 52.6 inferences/s per Watt. For the dedicated Movidius neural stick, the system requires only 1.5 W for processing 24.2 inferences/s.
2019 Fifth Workshop on Energy Efficient Machine Learning and Cognitive Computing - NeurIPS Edition (EMC2-NIPS), 2019
Quantization for deep neural networks have afforded models for edge devices that use less on-board memory and enable efficient low-power inference. In this paper, we present a comparison of model-parameter driven quantization approaches that can achieve as low as 3-bit precision without affecting accuracy. The post-training quantization approaches are data-free, and the resulting weight values are closely tied to the dataset distribution on which the model has converged to optimality. We show quantization results for a number of state-of-art deep neural networks (DNN) using large dataset like ImageNet. To better analyze quantization results, we describe the overall range and local sparsity of values afforded through various quantization schemes. We show the methods to lower bit-precision beyond quantization limits with object class clustering.
2019
Deep learning algorithms achieve high classification accuracy at the expense of significant computation cost. In order to reduce this cost, several quantization schemes have gained attention recently with some focusing on weight quantization, and others focusing on quantizing activations. This paper proposes novel techniques that individually target weight and activation quantizations resulting in an overall quantized neural network (QNN). Our activation quantization technique, PArameterized Clipping acTivation (PACT), uses an activation clipping parameter α that is optimized during training to find the right quantization scale. Our weight quantization scheme, statistics-aware weight binning (SAWB), finds the optimal scaling factor that minimizes the quantization error based on the statistical characteristics of weight distribution without the need for an exhaustive search. Furthermore, we provide an innovative insight for quantization in the presence of shortcut connections, which ...
IEEE Access, 2020
High computational requirement and rigorous memory cost are the significant issues which limit Convolutional Neural Networks' deployability in resource-constrained environments typically found in edge devices of Internet-of-Things (IoT). To address the problem, binary and ternary networks have been proposed to constrain the weights to reduce computational and memory costs. However, owing to the binary or ternary values, the backward propagations are not as efficient as normal during training, which makes it tough to train in edge devices. In this paper, we find a different way to resolve the problem and propose Fixed-Sign Binary Neural Network (FSB), which decomposes convolution kernel into sign and scaling factor as the prior researches but only trains the scaling factors instead of both. By doing so, our FSB avoids the sign involved in backward propagations and makes models easy to be deployed and trained in the IoT devices. Meanwhile, the convolution-acceleration architecture which we design for our FSB results in a reduced computing burden while achieving the same performance. Thanks to the efficiency of our FSB, even though we randomly initialize the sign and fix it to be untrainable, our FSB still has remarkable performances. INDEX TERMS Convolutional neural network (CNN), Internet-of-Things (IoT), model compression, resource-constrained environment.
ArXiv, 2021
Top-1 ImageNet optimization promotes enormous networks that may be impractical in inference settings. Binary neural networks (BNNs) have the potential to significantly lower the compute intensity but existing models suffer from low quality. To overcome this deficiency, we propose PokeConv, a binary convolution block which improves quality of BNNs by techniques such as adding multiple residual paths, and tuning the activation function. We apply it to ResNet-50 and optimize ResNet’s initial convolutional layer which is hard to binarize. We name the resulting network family PokeBNN1. These techniques are chosen to yield favorable improvements in both top-1 accuracy and the network’s cost. In order to enable joint optimization of the cost together with accuracy, we define arithmetic computation effort (ACE), a hardwareand energy-inspired cost metric for quantized and binarized networks. We also identify a need to optimize an under-explored hyper-parameter controlling the binarization gr...
Algorithms
Deep learning is now present in a wide range of services and applications, replacing and complementing other machine learning algorithms. Performing training and inference of deep neural networks using the cloud computing model is not viable for applications where low latency is required. Furthermore, the rapid proliferation of the Internet of Things will generate a large volume of data to be processed, which will soon overload the capacity of cloud servers. One solution is to process the data at the edge devices themselves, in order to alleviate cloud server workloads and improve latency. However, edge devices are less powerful than cloud servers, and many are subject to energy constraints. Hence, new resource and energy-oriented deep learning models are required, as well as new computing platforms. This paper reviews the main research directions for edge computing deep learning algorithms.
Cornell University - arXiv, 2022
Deep Learning has been one of the most disruptive technological advancements in recent times. The high performance of deep learning models comes at the expense of high computational, storage and power requirements. Sensing the immediate need for accelerating and compressing these models to improve on-device performance, we introduce Deeplite Neutrino for production-ready optimization of the models and Deeplite Runtime for deployment of ultra-low bit quantized models on Arm-based platforms. We implement low-level quantization kernels for Armv7 and Armv8 architectures enabling deployment on the vast array of 32-bit and 64-bit Arm-based devices. With efficient implementations using vectorization, parallelization, and tiling, we realize speedups of up to 2x and 2.2x compared to TensorFlow Lite with XNNPACK backend on classification and detection models, respectively. We also achieve significant speedups of up to 5x and 3.2x compared to ONNX Runtime for classification and detection models, respectively.
ArXiv, 2018
Deep learning as a means to inferencing has proliferated thanks to its versatility and ability to approach or exceed human-level accuracy. These computational models have seemingly insatiable appetites for computational resources not only while training, but also when deployed at scales ranging from data centers all the way down to embedded devices. As such, increasing consideration is being made to maximize the computational efficiency given limited hardware and energy resources and, as a result, inferencing with reduced precision has emerged as a viable alternative to the IEEE 754 Standard for Floating-Point Arithmetic. We propose a quantization scheme that allows inferencing to be carried out using arithmetic that is fundamentally more efficient when compared to even half-precision floating-point. Our quantization procedure is significant in that we determine our quantization scheme parameters by calibrating against its reference floating-point model using a single inference batc...
2020
Convolutional neural networks (CNNs) have been widely used in many tasks, but training CNNs is time-consuming and energy-hungry. Using the low-bit integer format has been proved promising for speeding up and improving the energy efficiency of CNN inference, while the training phase of CNNs can hardly benefit from such a technique because of following challenges: (1) The integer data format cannot meet the requirements of the data dynamic range in training, resulting in the accuracy drop; (2) The floating-point data format keeps large dynamic range with much more exponent bits, resulting in higher accumulation power than integer one; (3) There are some specially designed data formats (e.g., with group-wise scaling) that have the potential to deal with the former two problems but the common hardware can not support them efficiently. To tackle all these challenges and make the training phase of CNNs benefit from the low-bit format, we propose a low-bit training framework for convolutio...
Proceedings of the 18th ACM International Conference on Computing Frontiers, 2021
Human Activity Recognition (HAR) is a relevant inference task in many mobile applications. State-of-the-art HAR at the edge is typically achieved with lightweight machine learning models such as decision trees and Random Forests (RFs), whereas deep learning is less common due to its high computational complexity. In this work, we propose a novel implementation of HAR based on deep neural networks, and precisely on Binary Neural Networks (BNNs), targeting low-power general purpose processors with a RISC-V instruction set. BNNs yield very small memory footprints and low inference complexity, thanks to the replacement of arithmetic operations with bit-wise ones. However, existing BNN implementations on general purpose processors impose constraints tailored to complex computer vision tasks, which result in over-parametrized models for simpler problems like HAR. Therefore, we also introduce a new BNN inference library, which targets ultra-compact models explicitly. With experiments on a single-core RISC-V processor, we show that BNNs trained on two HAR datasets obtain higher classification accuracy compared to a state-of-the-art baseline based on RFs. Furthermore, our BNN reaches the same accuracy of a RF with either less memory (up to 91%) or more energy-efficiency (up to 70%), depending on the complexity of the features extracted by the RF.
2018 IEEE International Conference on Big Data (Big Data), 2018
Recent research has focused on Deep Neural Networks (DNNs) implemented directly in hardware. However, larger DNNs require significant energy and area, thereby limiting their wide adoption. We propose a novel DNN quantization technique and a corresponding hardware solution, CompactNet that optimizes the use of hardware resources even further, through dynamic allocation of memory for each parameter. Experimental results for the MNIST and CIFAR-10 datasets, show that CompactNet reduces the memory requirement by over 80%, the energy requirement by 12-fold, and the area requirement by 7-fold, when compared to the conventional DNN. This is achieved with minimal degradation to the classification accuracy. We demonstrate that, CompactNet provides pareto-optimal designs to make trade-offs between accuracy and resource requirement. The applications of CompactNet can be extended to datasets like ImageNet, and into models like MobileNet. Index Terms-approximate computing, deep neural networks, low power design, asic
2020
In this paper, we propose a number of novel techniques and numerical representation formats that enable, for the very first time, the precision of training systems to be aggressively scaled from 8-bits to 4-bits. To enable this advance, we explore a novel adaptive Gradient Scaling technique (GradScale) that addresses the challenges of insufficient range and resolution in quantized gradients as well as explores the impact of quantization errors observed during model training. We theoretically analyze the role of bias in gradient quantization and propose solutions that mitigate the impact of this bias on model convergence. Finally, we examine our techniques on a spectrum of deep learning models in computer vision, speech and NLP. In combination with previously proposed solutions for 4-bit quantization of weight and activation tensors, 4-bit training shows non-significant loss in accuracy across application domains while enabling significant hardware acceleration (>7× over state of ...
Loading Preview
Sorry, preview is currently unavailable. You can download the paper by clicking the button above.