Academia.edu no longer supports Internet Explorer.
To browse Academia.edu and the wider internet faster and more securely, please take a few seconds to upgrade your browser.
2021 24th International Symposium on Design and Diagnostics of Electronic Circuits & Systems (DDECS)
The design and implementation of Convolutional Neural Networks (CNNs) for deep learning (DL) is currently receiving a lot of attention from both industrials and academics. However, the computational workload involved with CNNs is often out of reach for low power embedded devices and is still very costly when running on datacenters. By relaxing the need for fully precise operations, approximate computing substantially improves performance and energy efficiency. Deep learning is very relevant in this context, since playing with the accuracy to reach adequate computations will significantly enhance performance, while keeping quality of results in a user-constrained range. AdequateDL is a project aiming to explore how approximations can improve performance and energy efficiency of hardware accelerators in DL applications. This paper presents the main concepts and techniques related to approximation of CNNs and preliminary results obtained in the AdequateDL framework.
2019 IEEE/ACM International Conference on Computer-Aided Design (ICCAD), 2019
The state-of-the-art approaches employ approximate computing to reduce the energy consumption of DNN hardware. Approximate DNNs then require extensive retraining afterwards to recover from the accuracy loss caused by the use of approximate operations. However, retraining of complex DNNs does not scale well. In this paper, we demonstrate that efficient approximations can be introduced into the computational path of DNN accelerators while retraining can completely be avoided. ALWANN provides highly optimized implementations of DNNs for custom low-power accelerators in which the number of computing units is lower than the number of DNN layers. First, a fully trained DNN (e.g., in TensorFlow) is converted to operate with 8-bit weights and 8-bit multipliers in convolutional layers. A suitable approximate multiplier is then selected for each computing element from a library of approximate multipliers in such a way that (i) one approximate multiplier serves several layers, and (ii) the overall classification error and energy consumption are minimized. The optimizations including the multiplier selection problem are solved by means of a multiobjective optimization NSGA-II algorithm. In order to completely avoid the computationally expensive retraining of DNN, which is usually employed to improve the classification accuracy, we propose a simple weight updating scheme that compensates the inaccuracy introduced by employing approximate multipliers. The proposed approach is evaluated for two architectures of DNN accelerators with approximate multipliers from the open-source "EvoApprox" library, while executing three versions of ResNet on CIFAR-10. We report that the proposed approach saves 30% of energy needed for multiplication in convolutional layers of ResNet-50 while the accuracy is degraded by only 0.6% (0.9% for the ResNet-14). The proposed technique and approximate layers are available as an open-source extension of TensorFlow at https://github.com/ehw-fit/tf-approximate. Index Terms-approximate computing, deep neural networks,
ACM Computing Surveys, 2022
Deep Neural Networks (DNNs) are very popular because of their high performance in various cognitive tasks in Machine Learning (ML). Recent advancements in DNNs have brought beyond human accuracy in many tasks, but at the cost of high computational complexity. To enable efficient execution of DNN inference, more and more research works, therefore, exploit the inherent error resilience of DNNs and employ Approximate Computing (AC) principles to address the elevated energy demands of DNN accelerators. This article provides a comprehensive survey and analysis of hardware approximation techniques for DNN accelerators. First, we analyze the state of the art and by identifying approximation families, we cluster the respective works with respect to the approximation type. Next, we analyze the complexity of the performed evaluations (with respect to the dataset and DNN size) to assess the efficiency, the potential, and limitations of approximate DNN accelerators. Moreover, a broad discussion...
Springer eBooks, 2022
The design and implementation of Deep Learning (DL) models is currently receiving a lot of attention from both industrials and academics. However, the computational workload associated with DL is often out of reach for low-power embedded devices and is still costly when run on datacenters. By relaxing the need for fully precise operations, Approximate Computing (AxC) substantially improves performance and energy efficiency. DL is extremely relevant in this context, since playing with the accuracy needed to do adequate computations will significantly enhance performance, while keeping the quality of results in a user-constrained range. This chapter will explore how AxC can improve the performance and energy efficiency of hardware accelerators in DL applications during inference and training.
Lattice ML Journal, 2020
Deep Neural networks are among the most powerful machine learning techniques that are becoming very interesting for Big Data applications. In the context of embedded platforms, implementing efficient DNNs in terms of performance and energy consumption while they maintain a required quality is very challenging. Sparsity can be used as an intelligent technique for reducing the size of DNNs. The purpose of this research is to explore the possibilities of introducing sparsity to CNNs and to evaluate their performance.
Applied Reconfigurable Computing. Architectures, Tools, and Applications, 2020
Convolution Neural Networks (CNNs) are widely used for image classification and object detection applications. The deployment of these architectures in embedded applications is a great challenge. This challenge arises from CNNs' high computation complexity that is required to be implemented on platforms with limited hardware resources like FPGA. Since these applications are inherently error-resilient, approximate computing (AC) offers an interesting trade-off between resource utilization and accuracy. In this paper, we study the impact on CNN performances when several approximation techniques are applied simultaneously. We focus on two of the widely used approximation techniques, namely quantization and pruning. Our experimental results showed that for CNN networks of different parameter sizes and 3% loss in accuracy, we can obtain up to 27.9%-47.2% reduction in computation complexity in terms of FLOPs for CIFAR-10 and MNIST datasets.
IET Computers & Digital Techniques, 2021
Recently, Deep Learning (DL) applications are getting more and more involved in different fields. Deploying such Deep Neural Networks (DNN) on embedded devices is still a challenging task considering the massive requirement of computation and storage. Given that the number of operations and parameters increases with the complexity of the model architecture, the performance will strongly depend on the hardware target resources and basically the memory footprint of the accelerator. Recent research studies have discussed the benefit of implementing some complex DL applications based on different models and platforms. However, it is necessary to guarantee the best performance when designing hardware accelerators for DL applications to run at full speed, despite the constraints of low power, high accuracy and throughput. Field Programmable Gate Arrays (FPGAs) are promising platforms for the deployment of large-scale DNN which seek to reach a balance between the above objectives. Besides, the growing complexity of DL models has made researches think about applying optimization techniques to make them more hardware-friendly. Herein, DL concept is presented. Then, a detailed description of different optimization techniques used in recent research works is explored. Finally, a survey of research works aiming to accelerate the implementation of DNN models on FPGAs is provided. This is an open access article under the terms of the Creative Commons Attribution License, which permits use, distribution and reproduction in any medium, provided the original work is properly cited.
Artificial Intelligence Review
Deep neural networks (DNNs) have made significant achievements in a wide variety of domains. For the deep learning tasks, multiple excellent hardware platforms provide efficient solutions, including graphics processing units (GPUs), central processing units (CPUs), field programmable gate arrays (FPGAs), and application-specific integrated circuit (ASIC). Nonetheless, CPUs outperform other solutions including GPUs in many cases for the inference workload of DNNs with the support of various techniques, such as the high-performance libraries being the basic building blocks for DNNs. Thus, CPUs have been a preferred choice for DNN inference applications, particularly in the low-latency demand scenarios. However, the DNN inference efficiency remains a critical issue, especially when low latency is required under conditions with limited hardware resources, such as embedded systems. At the same time, the hardware features have not been fully exploited for DNNs and there is much room for i...
2018 Design, Automation & Test in Europe Conference & Exhibition (DATE)
Artificial intelligence and especially Machine Learning recently gained a lot of interest from the industry. Indeed, new generation of neural networks built with a large number of successive computing layers enables a large amount of new applications and services implemented from smart sensors to data centers. These Deep Neural Networks (DNN) can interpret signals to recognize objects or situations to drive decision processes. However, their integration into embedded systems remains challenging due to their high computing needs. This paper presents PNeuro, a scalable energy-efficient hardware accelerator for the inference phase of DNN processing chains. Simple programmable processing elements architectured in SIMD clusters perform all the operations needed by DNN (convolutions, pooling, non-linear functions, etc.). An FDSOI 28 nm prototype shows an energy efficiency of 700 GMACS/s/W at 800 MHz. These results open important perspectives regarding the development of smart energy-efficient solutions based on Deep Neural Networks.
ArXiv, 2017
It remains a challenge to run Deep Learning in devices with stringent power budget in the Internet-of-Things. This paper presents a low-power accelerator for processing Deep Neural Networks in the embedded devices. The power reduction is realized by avoiding multiplications of near-zero valued data. The near-zero approximation and a dedicated Near-Zero Approximation Unit (NZAU) are proposed to predict and skip the nearzero multiplications under certain thresholds. Compared with skipping zero-valued computations, our design achieves 1.92X and 1.51X further reduction of the total multiplications in LeNet5 and Alexnet respectively, with negligible lose of accuracy. In the proposed accelerator, 256 multipliers are grouped into 16 independent Processing Lanes (PL) to support up to 16 neuron activations simultaneously. With the help of data pre-processing and buffering in each PL, multipliers can be clock-gated in most of the time even the data is excessively streaming in. Designed and si...
2018
Deep neural networks (DNNs) are currently widely used for many artificial intelligence (AI) applications including computer vision, speech recognition, and robotics. While DNNs deliver state-ofthe-art accuracy on many AI tasks, it comes at the cost of high computational complexity. Accordingly, there has been a significant amount of research on the topic of energy-efficient processing of DNNs, from the design of efficient DNN algorithms to the design of efficient DNN processors. However, in surveying these techniques, we found that there were certain limitations to the approaches used in this large body of work that need to be addressed. First, the number of weights and MACs are not sufficient for evaluating the energy consumption of DNNs; rather than focusing of weights and MACs, designers of efficient DNN algorithms should more directly target energy and incorporate that into their design. Second, the wide range techniques used for efficient DNN algorithm design has resulted in a ...
Journal of Low Power Electronics, 2018
Growing interest towards the development of smart Cyber Physical Systems (CPS) and Internet of Things (IoT) has motivated the researchers to explore the suitability of carrying out embedded machine learning. This has enabled a new age of smart CPS and IoT with emerging applications like autonomous vehicles, smart cities and houses, advanced robotics, IoT-Healthcare, and Industry 4.0. Due to the availability of a huge amount of data and compute power, Deep Neural Networks (DNNs) have become one of the enabling technologies behind this current age of machine learning and intelligent systems. The benefits of DNNs however come at a high computational cost and require tremendous amount of energy/power resources that are typically not available on (embedded) IoT and CPS devices, especially when considering the IoT-Edge nodes. To improve the performance and energy/power efficiency of these DNNs, this paper presents a cross-layer approximation methodology which exploits the error resiliency offered by DNNs at various hardware and software layers of the computing stack. We present various case studies at both software and hardware level in order to demonstrate the energy benefits of the proposed methodology. At software level we provide a systematic pruning methodology while at hardware level we provide a case study utilizing approximation of multipliers used for performing the weighted sum operation in the neural processing of DNNs.
2018 IEEE International Conference on Big Data (Big Data), 2018
Recent research has focused on Deep Neural Networks (DNNs) implemented directly in hardware. However, larger DNNs require significant energy and area, thereby limiting their wide adoption. We propose a novel DNN quantization technique and a corresponding hardware solution, CompactNet that optimizes the use of hardware resources even further, through dynamic allocation of memory for each parameter. Experimental results for the MNIST and CIFAR-10 datasets, show that CompactNet reduces the memory requirement by over 80%, the energy requirement by 12-fold, and the area requirement by 7-fold, when compared to the conventional DNN. This is achieved with minimal degradation to the classification accuracy. We demonstrate that, CompactNet provides pareto-optimal designs to make trade-offs between accuracy and resource requirement. The applications of CompactNet can be extended to datasets like ImageNet, and into models like MobileNet. Index Terms-approximate computing, deep neural networks, low power design, asic
2020 18th IEEE International New Circuits and Systems Conference (NEWCAS), 2020
Convolutional Neural Networks (CNNs) have shown outstanding accuracy for many vision tasks during recent years. When deploying CNNs on portable devices and embedded systems, however, the large number of parameters and computations result in long processing time and low battery life. An important factor in designing CNN hardware accelerators is to efficiently map the convolution computation onto hardware resources. In addition, to save battery life and reduce energy consumption, it is essential to reduce the number of DRAM accesses since DRAM consumes orders of magnitude more energy compared to other operations in hardware. In this paper, we propose an energy-efficient architecture which maximally utilizes its computational units for convolution operations while requiring a low number of DRAM accesses. The implementation results show that the proposed architecture performs one image recognition task using the VGGNet model with a latency of 393 ms and only 251.5 MB of DRAM accesses.
International Journal of Advanced Trends in Computer Science and Engineering, 2020
In today's technology era, Convolutional Neural Networks (CNNs) are the limelight for various cognitive tasks because of their high accuracy. With the increasing complexity in the applications, CNNs present high computation and storage demands which call for customized hardware support to boost their performance. The streaming nature of CNN's workloads makes them suitable for hardware implementations like FPGAs and ASICs. Providing sufficient resources alone cannot solve this difficulty, which makes Approximate Computing a solution. This article gives an insight into various approximate computing techniques used to accelerate the CNNs at multiple levels for the hardware implementations. The survey has been conducted by considering different metrics: approximation technique used, datasets used for evaluation, network structure (AlexNet, LeNet, Visual Geometry Group (VGG) ), hardware platform for implementation (Application Specific Integrated Circuit (ASIC) or Field Programmable Gate Array (FPGA)), training or testing phase and results (in terms of accuracy, area, power, throughput, resource utilization). The approximate computation techniques applied at the various levels of the network and layers are discussed. Necessary comparisons have also been made to know the utility of these techniques for yielding more significant performance gains with minimal losses in the accuracy. Methods are presented with recent contributions in the state-of-the-art image processing applications along with the various future outlooks based on the studies made.
2018 Design, Automation & Test in Europe Conference & Exhibition (DATE)
Deep Neural Networks (DNNs) have emerged as a powerful and versatile set of techniques to address challenging artificial intelligence (AI) problems. Applications in domains such as image/video processing, natural language processing, speech synthesis and recognition, genomics and many others have embraced deep learning as the foundational technique. DNNs achieve superior accuracy for these applications using very large models which require 100s of MBs of data storage, ExaOps of computation and high bandwidth for data movement. Despite advances in computing systems, training state-of-the-art DNNs on large datasets takes several days/weeks, directly limiting the pace of innovation and adoption. In this paper, we discuss how these challenges can be addressed via approximate computing. Based on our earlier studies demonstrating that DNNs are resilient to numerical errors from approximate computing, we present techniques to reduce communication overhead of distributed deep learning training via adaptive residual gradient compression (AdaComp), and computation cost for deep learning inference via Prameterized clipping ACTivation (PACT) based network quantization. Experimental evaluation demonstrates order of magnitude savings in communication overhead for training and computational cost for inference while not compromising application accuracy.
IEEE Journal on Emerging and Selected Topics in Circuits and Systems, 2018
Deep Neural Networks (DNNs) have emerged as the state-of-the-art technique in a wide range of machine learning tasks for analytics and computer vision in the next generation of embedded (mobile, IoT, wearable) devices. Despite their success, they suffer from high energy requirements. In recent years, the inherent error resiliency of DNNs has been exploited by introducing approximations at either the algorithmic or the hardware levels (individually) to obtain energy savings while incurring tolerable accuracy degradation. However, there is a need for investigating the overall energy-accuracy trade-offs arising from the introduction of approximations at different levels in complex DNNs. We perform a comprehensive analysis to determine the effectiveness of cross-layer approximations for the energy-efficient realization of large-scale DNNs. The approximations considered are as follows: (i) use of lower complexity networks (containing lesser number of layers and/or neurons per layer), (ii) pruning of synaptic weights, (iii) approximate multiplication operation in the neuronal MAC (Multiply-and-Accumulate) computation, and (iv) approximate write/read operations to/from the synaptic memory. Our experiments on recognition benchmarks (MNIST, CIFAR10) show that cross-layer approximation provides substantial improvements in energy efficiency for different accuracy/quality requirements. Furthermore, we propose a synergistic framework for combining the approximation techniques to achieve maximal energy benefits from approximate DNNs.
2020 IEEE International Symposium on Circuits and Systems (ISCAS), 2020
Emerging intelligent embedded devices rely on Deep Neural Networks (DNNs) to be able to interact with the realworld environment. This interaction comes with the ability to retrain DNNs, since environmental conditions change continuously in time. Stochastic Gradient Descent (SGD) is a widely used algorithm to train DNNs by optimizing the parameters over the training data iteratively. In this work, first we present a novel approach to add the training ability to a baseline DNN accelerator (inference only) by splitting the SGD algorithm into simple computational elements. Then, based on this heuristic approach we propose TaxoNN, a lightweight accelerator for DNN training. TaxoNN can easily tune the DNN weights by reusing the hardware resources used in the inference process using a time-multiplexing approach and low-bitwidth units. Our experimental results show that TaxoNN delivers, on average, 0.97% higher misclassification rate compared to a full-precision implementation. Moreover, TaxoNN provides 2.1× power saving and 1.65× area reduction over the state-of-the-art DNN training accelerator.
IEEE Access, 2019
Leveraging deep convolutional neural networks (DCNNs) for various application areas has become a recent inclination of many machine learning practitioners due to their impressive performance. Research trends show that the state-of-the-art networks are getting deeper and deeper and such networks have shown significant performance increase. Deeper and larger neural networks imply the increase in computational intensity and memory footprint. This is particularly a problem for inference-based applications on resource constrained computing platforms. On the other hand, field-programmable gate arrays (FPGAs) are becoming a promising choice in giving hardware solutions for most deep learning implementations due to their high-performance and low-power features. With the rapid formation of various state-of-the-art CNN architectures, a flexible CNN hardware processor that can handle different CNN architectures and yet customize itself to achieve higher resource efficiency and optimum performance is critically important. In this paper, a novel and highly flexible DCNN processor, MulNet, is proposed. MulNet can be used to process most regular state-of-the-art CNN variants aiming at maximizing resource utilization of a target device. A processing core with multiplier and without multiplier is employed to achieve that. We formulated optimum fixed-point quantization format for MulNet by analyzing layer-by-layer quantization error. We also created a power-of-2 quantization for multiplier-free (MF) processing core of MulNet. Both quantizations significantly reduced the memory space needed and the logic consumption in the target device. We utilized Xilinx Zynq SoCs to leverage the one die hybrid (CPU and FPGA) architecture. We devised a scheme that utilizes Zynq processing system (PS) for memory intensive layers and the Zynq programmable logic (PL) for computationally intensive layers. We implemented modified LeNet, CIFAR-10 full, ConvNet processor (CNP), MPCNN, and AlexNet to evaluate MulNet. Our architecture with MF processing cores shows the promising result, by saving 36%-72% on-chip memory and 10%-44% DSP48 IPs, compared to the architecture with cores implemented using the multiplier. Comparison with the state of the art showed a very promising 25-40× DSP48 and 25-29× on-chip memory reduction with up to 136.9-GOP/s performance and 88.49-GOP/s/W power efficiency. Hence, our results demonstrate that the proposed architecture can be very expedient for resource constrained devices. INDEX TERMS DCNN, MulNet, constrained devices, hybrid embedded system. I. INTRODUCTION Leveraging deep convolutional neural networks (DCNN) for various application areas has become a recent inclination of many machine learning practitioners due to their impressive performance [1]. These include applications ranging from (but certainly not limited to) image processing, The associate editor coordinating the review of this manuscript and approving it for publication was Junaid Shuja. computer vision, automotive applications to computational biology, computational finance and natural language processing. Research trends show that the state-of-the-art networks proposed in past few years are getting deeper and deeper and such networks have shown significant performance increase [2], [3]. However, as the networks get deeper it also implies more training time, increase in computational intensity and memory-footprint. This is particularly a problem for inference-based applications on resource
2022
Dilated and transposed convolutions are widely used in modern convolutional neural networks (CNNs). These kernels are used extensively during CNN training and inference of applications such as image segmentation and high-resolution image generation. Although these kernels have grown in popularity, they stress current compute systems due to their high memory intensity, exascale compute demands, and large energy consumption. We find that commonly-used low-power CNN inference accelerators based on spatial architectures are not optimized for both of these convolutional kernels. Dilated and transposed convolutions introduce significant zero padding when mapped to the underlying spatial architecture, significantly degrading performance and energy efficiency. Existing approaches that address this issue require significant design changes to the otherwise simple, efficient, and well-adopted architectures used to compute direct convolutions. To address this challenge, we propose EcoFlow, a ne...
IEEE Access
Deep learning algorithms have seen success in a wide variety of applications, such as machine translation, image and speech recognition, and self-driving cars. However, these algorithms have only recently gained a foothold in the embedded systems domain. Most embedded systems are based on cheap microcontrollers with limited memory capacity, and, thus, are typically seen as not capable of running deep learning algorithms. Nevertheless, we consider that advancements in compression of neural networks and neural network architecture, coupled with an optimized instruction set architecture, could make microcontroller-grade processors suitable for specific low-intensity deep learning applications. We propose a simple instruction set extension with two main components-hardware loops and dot product instructions. To evaluate the effectiveness of the extension, we developed optimized assembly functions for the fully connected and convolutional neural network layers. When using the extensions and the optimized assembly functions, we achieve an average clock cycle count decrease of 73% for a small scale convolutional neural network. On a per layer base, our optimizations decrease the clock cycle count for fully connected layers and convolutional layers by 72% and 78%, respectively. The average energy consumption per inference decreases by 73%. We have shown that adding just hardware loops and dot product instructions has a significant positive effect on processor efficiency in computing neural network functions.
Loading Preview
Sorry, preview is currently unavailable. You can download the paper by clicking the button above.