Academia.edu no longer supports Internet Explorer.
To browse Academia.edu and the wider internet faster and more securely, please take a few seconds to upgrade your browser.
2019, ArXiv
Convolutional neural networks (CNNs) are inherently suffering from massively redundant computation (FLOPs) due to the dense connection pattern between feature maps and convolution kernels. Recent research has investigated the sparse relationship between channels, however, they ignored the spatial relationship within a channel. In this paper, we present a novel convolutional operator, namely comb convolution, to exploit the intra-channel sparse relationship among neurons. The proposed convolutional operator eliminates nearly 50% of connections by inserting uniform mappings into standard convolutions and removing about half of spatial connections in convolutional layer. Notably, our work is orthogonal and complementary to existing methods that reduce channel-wise redundancy. Thus, it has great potential to further increase efficiency through integrating the comb convolution to existing architectures. Experimental results demonstrate that by simply replacing standard convolutions with ...
IEEE Transactions on Image Processing, 2021
The channel redundancy in feature maps of convolutional neural networks (CNNs) results in the large consumption of memories and computational resources. In this work, we design a novel Slim Convolution (SlimConv) module to boost the performance of CNNs by reducing channel redundancies. Our SlimConv consists of three main steps: Reconstruct, Transform and Fuse, through which the features are splitted and reorganized in a more efficient way, such that the learned weights can be compressed effectively. In particular, the core of our model is a weight flipping operation which can largely improve the feature diversities, contributing to the performance crucially. Our SlimConv is a plug-and-play architectural unit which can be used to replace convolutional layers in CNNs directly. We validate the effectiveness of SlimConv by conducting comprehensive experiments on ImageNet, MS COCO2014, Pascal VOC2012 segmentation, and Pascal VOC2007 detection datasets. The experiments show that SlimConv-equipped models can achieve better performances consistently, less consumption of memory and computation resources than non-equipped conterparts. For example, the ResNet-101 fitted with SlimConv achieves 77.84% top-1 classification accuracy with 4.87 GFLOPs and 27.96M parameters on ImageNet, which shows almost 0.5% better performance with about 3 GFLOPs and 38% parameters reduced.
ArXiv, 2021
Deploying deep Convolutional Neural Networks (CNNs) is impacted by their memory footprint and speed requirements, which mainly come from convolution. Widely-used convolution algorithms, im2col and MEC, produce a lowered matrix from an activation map by redundantly storing the map’s elements included at horizontal and/or vertical kernel overlappings without considering the sparsity of the map. Using the sparsity of the map, this paper proposes two new convolution algorithms dubbed Compressed Pattern Overlap (CPO) and Compressed Pattern Sets (CPS) that simultaneously decrease the memory footprint and increase the inference speed while preserving the accuracy. CPO recognizes non-zero elements (NZEs) at horizontal and vertical overlappings in the activation maps. CPS further improves the memory savings of CPO by compressing the index positions of neighboring NZEs. In both algorithms, channels/regions of the activation maps with all zeros are skipped. Then, CPO/CPS performs convolution v...
IEEE Transactions on Very Large Scale Integration (VLSI) Systems, 2021
In deep convolutional neural networks (DCNNs), model size and computation complexity are two important factors governing throughput and energy efficiency when deployed to hardware for inference. Recent works on compact DCNNs as well as pruning methods are effective, yet with drawbacks. For instance, more than half the size of all MobileNet models lies in their last two layers, mainly because compact separable convolution (CONV) layers are not applicable to their last fullyconnected (FC) layers. Also, in pruning methods the compression is gained at the expense of irregularity in the DCNN architecture, which necessitates additional indexing memory to address non-zero weights, thereby increasing memory footprint, decompression delays, and energy consumption. In this paper, we propose cyclic sparsely connected (CSC) architectures, with memory/computation complexity of O(N log N) where N is the number of nodes/channels given a DCNN layer, that, contrary to compact depthwise separable layers, can be used as an overlay for both FC and CONV layers of O(N 2). Also, contrary to pruning methods, CSC architectures are structurally sparse and require no indexing due to their cyclic nature. We show that both standard convolution and depthwise convolution layers are special cases of the CSC layers and whose mathematical function, along with FC layers, can be unified into one single formulation, and whose hardware implementation can be carried out under one arithmetic logic component. We examine the efficacy of the CSC architectures for compression of LeNet, AlexNet, and MobileNet DCNNs with precision ranging from 2 to 32 bits. More specifically, we surge upon the compact 8-bit quantized 0.5 MobileNet V1 and show that by compressing its last two layers with CSC architectures, the model is compressed by ∼ 1.5× with a size of only 873 KB and little accuracy loss. Lastly, we design a configurable hardware that implements all types of DCNN layers including FC, CONV, depthwise, CSC-FC, and CSC-CONV indistinguishably within a unified pipeline. We implement the hardware on a tiny Xilinx FPGA for total on-chip processing of the compressed MobileNet that, compared to the related work, has the highest Inference/J while utilizing the smallest FPGA.
IEEE Transactions on Neural Networks and Learning Systems, 2018
Convolutional neural networks (CNNs) have become the dominant neural network architecture for solving many state-of-the-art (SOA) visual processing tasks. Even though Graphical Processing Units (GPUs) are most often used in training and deploying CNNs, their power consumption becomes a problem for real time mobile applications. We propose a flexible and efficient CNN accelerator architecture which can support the implementation of SOA CNNs in low-power and low-latency application scenarios. This architecture exploits the sparsity of neuron activations in CNNs to accelerate the computation and reduce memory requirements. The flexible architecture allows full utilization of available computing resources across a wide range of convolutional network kernel sizes; and numbers of input and output feature maps. We implemented the proposed architecture on an FPGA platform and present results showing how our implementation reduces external memory transfers and compute time in five different CNNs ranging from small ones up to the widely known large VGG16 and VGG19 CNNs. We show how in RTL simulations in a 28nm process with a clock frequency of 500 MHz, the NullHop core is able to reach over 450 GOp/s and efficiency of 368%, maintaining over 98% utilization of the MAC units and achieving a power efficiency of over 3 TOp/s/W in a core area of 5.8 mm 2 .
2021
Motivated by the necessity for parameter efficiency in distributed machine learning and AI-enabled edge devices, we provide a general and easy to implement method for significantly reducing the number of parameters of Convolutional Neural Networks (CNNs), during both the training and inference phases. We introduce a simple auxiliary neural network which can generate the convolutional filters of any CNN architecture from a low dimensional latent space. This auxiliary neural network, which we call "Convolutional Slice Generator" (CSG), is unique to the network and provides the association between its convolutional layers. During the training of the CNN, instead of training the filters of the convolutional layers, only the parameters of the CSG and their corresponding "code vectors" are trained. This results in a significant reduction of the number of parameters due to the fact that the CNN can be fully represented using only the parameters of the CSG, the code vect...
2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017
We propose a new method for creating computationally efficient and compact convolutional neural networks (CNNs) using a novel sparse connection structure that resembles a tree root. This allows a significant reduction in computational cost and number of parameters compared to state-of-the-art deep CNNs, without compromising accuracy, by exploiting the sparsity of inter-layer filter dependencies. We validate our approach by using it to train more efficient variants of state-of-the-art CNN architectures, evaluated on the CIFAR10 and ILSVRC datasets. Our results show similar or higher accuracy than the baseline architectures with much less computation, as measured by CPU and GPU timings. For example, for ResNet 50, our model has 40% fewer parameters, 45% fewer floating point operations, and is 31% (12%) faster on a CPU (GPU). For the deeper ResNet 200 our model has 48% fewer parameters and 27% fewer floating point operations, while maintaining state-of-the-art accuracy. For GoogLeNet, our model has 7% fewer parameters and is 21% (16%) faster on a CPU (GPU).
2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2018
Neural networks rely on convolutions to aggregate spatial information. However, spatial convolutions are expensive in terms of model size and computation, both of which grow quadratically with respect to kernel size. In this paper, we present a parameter-free, FLOP-free "shift" operation as an alternative to spatial convolutions. We fuse shifts and point-wise convolutions to construct end-to-end trainable shift-based modules, with a hyperparameter characterizing the tradeoff between accuracy and efficiency. To demonstrate the operation's efficacy, we replace ResNet's 3x3 convolutions with shift-based modules for improved CI-FAR10 and CIFAR100 accuracy using 60% fewer parameters; we additionally demonstrate the operation's resilience to parameter reduction on ImageNet, outperforming ResNet family members. We finally show the shift operation's applicability across domains, achieving strong performance with fewer parameters on classification, face verification and style transfer.
2020
Deep convolutional neural networks (CNNs) have achieved significant improvements in different vision tasks, including classification, detection and segmentation. However, the increasing model size and computation makes it difficult to implement DNNs on embedded systems with limited hardware resources. Many approaches proposed to build a lightweight network and have achieved comparable performance, such as MobileNets, ShuffleNet, and ESPNet. This paper proposes a lightweight and efficient network based on depthwise dilated separable convolution and MobileNetv2 architecture. Depthwise dilated convolution in depthwise dilated separable convolution module effectively enlarge the field of view of filters to incorporate larger context without increasing the number of parameters or the amount of computation. Furthermore, instead of using a convolution with 3×3 kernel size for each depthwise separable convolution block in MobileNetv2, this paper uses dilated convolutions with different dila...
IEEE Access, 2022
Convolutional neural networks have demonstrated impressive results in many computer vision tasks. However, the increasing size of these networks raises concerns about the information overload resulting from the large number of network parameters. In this paper, we propose Frequency Regularization to restrict the non-zero elements of the network parameters in the frequency domain. The proposed approach operates at the tensor level, and can be applied to almost all network architectures. Specifically, the tensors of parameters are maintained in the frequency domain, where high-frequency components can be eliminated by zigzag setting tensor elements to zero. Then, the inverse discrete cosine transform (IDCT) is used to reconstruct the spatial tensors for matrix operations during network training. Since high-frequency components of images are known to be less critical, a large proportion of these parameters can be set to zero when networks are trained with the proposed frequency regularization. Comprehensive evaluations on various state-of-the-art network architectures, including LeNet, Alexnet, VGG, Resnet, ViT, UNet, GAN, and VAE, demonstrate the effectiveness of the proposed frequency regularization. For a very small accuracy decrease (less than 2%), a LeNet5 with 0.4M parameters can be represented by only 776 float16 numbers (over 1100× reduction), and a UNet with 34M parameters can be represented by only 759 float16 numbers (over 80000× reduction). In particular, the original size of the UNet model is reduced from 366 Mb to 4.5 Kb.
IEEE Access, 2021
Convolutional Neural Networks (CNNs) have been shown to be very useful in image recognition and other Artificial Intelligence (AI) applications, however, at the expense of intensive computation requirement. To address the challenge of overwhelming calculation requirements, researchers have proposed various network pruning techniques. But, due to the irregular sparse patterns, unstructured sparse networks are difficult to compute efficiently on either Graphic processing units (GPUs) or Field Programmable Gate Arrays (FPGAs). In this paper, we propose a software/hardware co-optimized Reconfigurable Sparse convolutional Neural Network accelerator design (RSNN) on FPGAs. A novel sparse convolution dataflow is proposed with simpler control logic than existing mux-based selection logic. To balance the computation load on different Processing Units (PUs), we propose a software-based loadbalance aware pruning technique as well as a kernel merging method. Experimental results show that RSNN is 2.41×-7.91× better on Digital Signal Processor (DSP) efficiency than previous dense CNN FPGA accelerators, and 1.23×-2.93× better than previous sparse CNN FPGA accelerators. INDEX TERMS Accelerator, convolutional neural network, FPGA, sparse neural network.
IEEE Access
Most convolutional neural network (CNN) designs are still bottlenecked by costly computational load, which impedes their utilization in industrial applications. To address this issue, a new Sparse-Split-Parallelism (SSP) design framework is proposed in this paper. It fuses three design strategies that can be applied to the majority of the popular state-of-the-art block-based CNN models to lighten their computing budget while maintaining comparable accuracies. At a block level, a design strategy, based on the novel concept of sparse skip connections, is introduced, which provides optimal connectivity, preventing a severe rise in channel numbers and keeping a satisfactory feature reuse in the network model. As part of the modulelevel design, a new SSP module is created that preserves the design features of the targeted existing models, and a novel proportional channel split operation is employed to achieve optimal trade-off between accuracy and model size. As the third strategy at a layer level, the idea of the degree of parallelism is adopted, resulting into an equal number of channels in the layers, which decreases memory access and yields a better inference time. The effectiveness of the framework has been validated through a comprehensive experimental work. The evaluation results, which are based on DenseNet, ResNet, ShiftNet, ShuffleNet, and ShuffleNet-v2, verify that the proposed SSP framework is notably capable of reducing the parameter number, FLOPs, and inference time of existing CNN models with quite alike accuracies. The models are evaluated on image classification using the ImageNet and CIFAR-10&CIFAR-100 datasets, as well as on object detection with the MS COCO dataset. INDEX TERMS Computer Vision, convolutional neural networks, convolution types, real-time embedded systems, object detection, image classification, visual recognition, autonomous driving systems. 157 of the second transition layer. Any pixel (m, t) in Fig.1 acts 158 as a surrogate to show the dependency between the modules 159 m and t. The scale of the colour bar on the right-hand side of 160 the figure indicates the feature reuse intensity (i.e., nearer to 161 the top the higher the feature reuse). For instance, a fully dark 162 red pixel in any position (m, t) indicates that there is a fully 163 effective feature reuse between the modules. 164 An apparent observation from Fig.1 is that feature reuse 165 intensity is stronger, especially in connections skipping four 166 steps at the most. Beyond this distance, there starts a gradually 167 rising decay. A similar behaviour has also been demonstrated 168 in ShuffleNet-v2 and DenseNet (k = 12) with 40 layers, 169 on the CIFAR dataset [33]. 170 The motivations behind the second and third design strate-171 gies of the proposed SSP framework, originating from 172 the ShuffleNet-v2 model [29], are to further enhance its 173 efficiency. 174 In the module-level design strategy, a novel 'propor-175 tional channel split' operation with concatenation is applied 176 to the modules, which differs from the one introduced in 177 ShuffleNet-v2 [29]. This operation primarily enables a capa-178 bility for making a trade-off between accuracy and model 179 size. 180 As for the layer-level design strategy, we directly apply 181 the idea of the degree of parallelism [3], [29] to our models. 182 In theory, FLOPs (multiply-adds) can be seen as a direct 183 metric for the computational budget. Contrary to that, the 184 number of memory access operations [3] is a more decisive 185 factor in practice. At this point, ShuffleNet-v2 proposes a 186 general enhancement for this dilemma, in which the equal 187 number of input and output channels in a convolutional layer 188 ensures the lower bound of memory access number. 189 Thus, we build the Sparse-Split-Parallelism (SSP) frame-190 work by means of the introduced design strategies above. cated holder space. Thus, we create new SSP-enabled models 1039 and enhance their performance through the application of the 1040 framework. Another important feature of the SSP module is 1041 the novel proportional channel split operation, which allows 1042 to control the channel proportion for the underlying and 1043 identity mapping. This is achieved by means of a special 1044 hyperparameter, ε which is also used in the SSP framework 1045 for managing a desired trade-off between accuracy and model 1046 size.
Computer Vision – ECCV 2018
We introduce a fast and efficient convolutional neural network, ES-PNet, for semantic segmentation of high resolution images under resource constraints. ESPNet is based on a new convolutional module, efficient spatial pyramid (ESP), which is efficient in terms of computation, memory, and power. ES-PNet is 22 times faster (on a standard GPU) and 180 times smaller than the state-of-the-art semantic segmentation network PSPNet [1], while its categorywise accuracy is only 8% less. We evaluated ESPNet on a variety of semantic segmentation datasets including Cityscapes, PASCAL VOC, and a breast biopsy whole slide image dataset. Under the same constraints on memory and computation, ESPNet outperforms all the current efficient CNN networks such as Mo-bileNet [16], ShuffleNet [17], and ENet [20] on both standard metrics and our newly introduced performance metrics that measure efficiency on edge devices. Our network can process high resolution images at a rate of 112 and 9 frames per second on a standard GPU and edge device, respectively. 2 Related Work Multiple different techniques, such as convolution factorization, network compression, and low-bit networks, have been proposed to speed up convolutional neural networks. We, first, briefly describe these approaches and then provide a brief overview of CNNbased semantic segmentation. Convolution factorization: Convolutional factorization decomposes the convolutional operation into multiple steps to reduce the computational complexity. This factorization has successfully shown its potential in reducing the computational complexity of deep CNN networks (e.g. Inception [11-13], factorized network [22], ResNext [14], Xception [15], and MobileNets [16]). ESP modules are also built on this factorization principle. The ESP module decomposes a convolutional layer into a point-wise convolution and spatial pyramid of dilated convolutions. This factorization helps in reducing the computational complexity, while simultaneously allowing the network to learn the representations from a large effective receptive field. Network Compression: Another approach for building efficient networks is compression. These methods use techniques such as hashing [23], pruning [24], vector quantization [25], and shrinking [26, 27] to reduce the size of the pre-trained network. Low-bit networks: Another approach towards efficient networks is low-bit networks, which quantize the weights to reduce the network size and complexity (e.g. [28-31]). Sparse CNN: To remove the redundancy in CNNs, sparse CNN methods, such as sparse decomposition [32], structural sparsity learning [33], and dictionary-based method [34], have been proposed. We note that compression-based methods, low-bit networks, and sparse CNN methods are equally applicable to ESPNets and are complementary to our work. Dilated convolution: Dilated convolutions [35] are a special form of standard convolutions in which the effective receptive field of kernels is increased by inserting zeros (or holes) between each pixel in the convolutional kernel. For a n × n dilated convolutional kernel with a dilation rate of r, the effective size of the kernel is [(n − 1)r + 1] 2. The dilation rate specifies the number of zeros (or holes) between pixels. However, due to dilation, only n × n pixels participate in the convolutional operation, reducing the computational cost while increasing the effective kernel size. Yu and Koltun [18] stacked dilated convolution layers with increasing dilation rate to learn contextual representations from a large effective receptive field. A similar strategy was adopted in [19, 36, 37]. Chen et al. [3] introduced an atrous spatial pyramid (ASP) module. This module can be viewed as a parallelized version of [3]. These modules are computationally inefficient (e.g. ASPs have high memory requirements and learn many more parameters; see Section 3.2). Our ESP module also learns multi-scale
IEEE Transactions on Very Large Scale Integration (VLSI) Systems, 2018
As convolution contributes most operations in convolutional neural network (CNN), the convolution acceleration scheme significantly affects the efficiency and performance of a hardware CNN accelerator. Convolution involves multiply and accumulate (MAC) operations with four levels of loops, which results in a large design space. Prior works either employ limited loop optimization techniques, e.g. loop unrolling, tiling and interchange, or only tune some of the design variables after the accelerator architecture and dataflow are already fixed. Without fully studying the convolution loop optimization before the hardware design phase, the resulting accelerator can hardly exploit the data reuse and manage data movement efficiently. This work overcomes these barriers by quantitatively analyzing and optimizing the design objectives (e.g. memory access) of the CNN accelerator based on multiple design variables. Then, we propose a specific dataflow of hardware CNN acceleration to minimize the data communication while maximizing the resource utilization to achieve high performance. The proposed CNN acceleration scheme and architecture are demonstrated by implementing endto-end CNNs including NiN, VGG-16 and ResNet-50/ResNet-152 for inference. For VGG-16 CNN, the overall throughputs achieve 348 GOPS and 715 GOPS on Intel Stratix V and Arria 10 FPGAs, respectively.
2014
This paper presents a clustered SIMD accelerator template for Convolutional Networks. These networks significantly outperform other methods in detection and classification tasks in the vision domain. Due to the excessive compute and data transfer requirements these applications benefit a lot from a dedicated accelerator. The proposed accelerator reduces memory traffic by loop transformations such as tiling and fusion to merge successive layers. Although fusion can introduce redundant computations it often reduces the data transfer, and therefore can remove performance bottlenecks. The SIMD cluster is mapped to a Xilinx Zynq FPGA, which can achieve 6.4 Gops performance with a small amount of resources. The performance can be scaled by using multiple clusters.
ACM Journal on Emerging Technologies in Computing Systems
We provide here a novel method, called hypercolumn sparsification, to achieve high recognition performance for convolutional neural networks (CNNs) despite low-precision weights and activities during both training and test phases. This method is applicable to any CNN architecture that operates on signal patterns (e.g., audio, image, video) to extract information such as class membership. It operates on the stack of feature maps in each of the cascading feature matching and pooling layers through the processing hierarchy of the CNN by an explicit competitive process ( k -WTA, winner take all) that generates a sparse feature vector at each spatial location. This principle is inspired by local brain circuits, where neurons tuned to respond to different patterns in the incoming signals from an upstream region inhibit each other using interneurons, such that only the ones that are maximally activated survive the quenching threshold. We show this process of sparsification is critical for ...
IEEE Access
Designing small and efficient mobile neural networks is difficult because the challenge is to determine the architecture that achieves the best performance under a given limited computational scenario. Previous lightweight neural networks rely on a cell module that is repeated in all stacked layers across the network. These approaches do not permit layer diversity, which is critical for achieving strong performance. This paper presents an experimental study to develop an efficient mobile network using a hierarchical architecture. Our proposed mobile network, called Diversity Network (DivNet), has been shown to perform better than the basic architecture generally employed by the best high-efficiency models-with simply stacked layers-, regarding complexity cost and performance. A set of architectural design decisions are described that reduce the proposed model size while yielding a significant performance improvement. Our experiments on image classification show that compared to, respectively, MobileNetV2, SqueezeNet, and ShuffleNetV2, our proposal DivNet can improve accuracy by 2.09%, 0.76%, and 0.66% on the CIFAR100 dataset, and by 0.05%, 4.96%, and 1.13% on the CIFAR10 dataset. On more complex datasets, e.g., ImageNet, our proposal DivNet achieves 70.65% Top-1 accuracy and 90.23% Top-5 accuracy, still better than other small models like MobilNet, SqueezeNet, ShuffleNet. INDEX TERMS Deep neural network, mobile network, network compression.
IEEE Journal of Selected Topics in Signal Processing, 2020
Convolutional Neural Networks (CNNs) have become indispensable for solving machine learning tasks in speech recognition, computer vision, and other areas that involve highdimensional data. A CNN filters the input feature using a network containing spatial convolution operators with compactly supported stencils. In practice, the input data and the hidden features consist of a large number of channels, which in most CNNs are fully coupled by the convolution operators. This coupling leads to immense computational cost in the training and prediction phase. In this paper, we introduce LeanConvNets that are derived by sparsifying fully-coupled operators in existing CNNs. Our goal is to improve the efficiency of CNNs by reducing the number of weights, floating point operations and latency times, with minimal loss of accuracy. Our lean convolution operators involve tuning parameters that controls the trade-off between the network's accuracy and computational costs. These convolutions can be used in a wide range of existing networks, and we exemplify their use in residual networks (ResNets). Using a range of benchmark problems from image classification and semantic segmentation, we demonstrate that the resulting LeanConvNet's accuracy is close to state-of-the-art networks while being computationally less expensive. In our tests, the lean versions of ResNet in most cases outperform comparable reduced architectures such as MobileNets and ShuffleNets.
2019 IEEE/CVF International Conference on Computer Vision (ICCV), 2019
Quantization is a popular way of increasing the speed and lowering the memory usage of Convolution Neural Networks (CNNs). When labelled training data is available, network weights and activations have successfully been quantized down to 1-bit. The same cannot be said about the scenario when labelled training data is not available, e.g. when quantizing a pre-trained model, where current approaches show, at best, no loss of accuracy at 8-bit quantizations. We introduce DSConv, a flexible quantized convolution operator that replaces single-precision operations with their far less expensive integer counterparts, while maintaining the probability distributions over both the kernel weights and the outputs. We test our model as a plugand-play replacement for standard convolution on most popular neural network architectures, ResNet, DenseNet, GoogLeNet, AlexNet and VGG-Net and demonstrate stateof-the-art results, with less than 1% loss of accuracy, without retraining, using only 4-bit quantization. We also show how a distillation-based adaptation stage with unlabelled data can improve results even further.
international conference on learning representations, 2018
Model pruning has become a useful technique that improves the computational efficiency of deep learning, making it possible to deploy solutions in resourcelimited scenarios. A widely-used practice in relevant work assumes that a smallernorm parameter or feature plays a less informative role at the inference time. In this paper, we propose a channel pruning technique for accelerating the computations of deep convolutional neural networks (CNNs) that does not critically rely on this assumption. Instead, it focuses on direct simplification of the channel-tochannel computation graph of a CNN without the need of performing a computationally difficult and not-always-useful task of making high-dimensional tensors of CNN structured sparse. Our approach takes two stages: first to adopt an end-toend stochastic training method that eventually forces the outputs of some channels to be constant, and then to prune those constant channels from the original neural network by adjusting the biases of their impacting layers such that the resulting compact model can be quickly fine-tuned. Our approach is mathematically appealing from an optimization perspective and easy to reproduce. We experimented our approach through several image learning benchmarks and demonstrate its interesting aspects and competitive performance. * The research was done when J. Ye was an intern at Adobe in the summer of 2017.
ACM Transactions on Embedded Computing Systems, 2018
Convolutional neural networks (CNNs) are widely employed in many image recognition applications. With the proliferation of embedded and mobile devices, such applications are becoming commonplace on mobile devices. Network pruning is a commonly used strategy to reduce the memory and storage footprints of CNNs on mobile devices. In this article, we propose customized versions of the sparse matrix multiplication algorithm to speed up inference on mobile devices and make it more energy efficient. Specifically, we propose a Block Compressed Sparse Column algorithm and a bit-representation-based algorithm (BitsGEMM) that exploit sparsity to accelerate the fully connected layers of a network on the NVIDIA Jetson TK1 platform. We evaluate the proposed algorithms using real-world object classification and object detection applications. Experiments show that performance speedups can be achieved over the original baseline implementation using cuBLAS. On object detection CNNs, an average speedu...
Loading Preview
Sorry, preview is currently unavailable. You can download the paper by clicking the button above.