Low-bit Quantization of Neural Networks for Efficient Inference

Pavel Kisilev

Low-bit Quantization of Neural Networks for Efficient Inference

2019, 2019 IEEE/CVF International Conference on Computer Vision Workshop (ICCVW)

Abstract

Recent machine learning methods use increasingly large deep neural networks to achieve state of the art results in various tasks. The gains in performance come at the cost of a substantial increase in computation and storage requirements. This makes real-time implementations on limited resources hardware a challenging task. One popular approach to address this challenge is to perform lowbit precision computations via neural network quantization. However, aggressive quantization generally entails a severe penalty in terms of accuracy, and often requires retraining of the network, or resorting to higher bit precision quantization. In this paper, we formalize the linear quantization task as a Minimum Mean Squared Error (MMSE) problem for both weights and activations, allowing low-bit precision inference without the need for full network retraining. The main contributions of our approach are the optimizations of the constrained MSE problem at each layer of the network, the hardware aware partitioning of the network parameters, and the use of multiple low precision quantized tensors for poorly approximated layers. The proposed approach allows 4 bits integer (INT4) quantization for deployment of pretrained models on limited hardware resources. Multiple experiments on various network architectures show that the suggested method yields state of the art results with minimal loss of tasks accuracy.

Deep Neural Networks (DNNs) have achieved extraordinary performance in various application domains. To support diverse DNN models, efficient implementations of DNN inference on edge-computing platforms, e.g., ASICs, FPGAs, and embedded systems, are extensively investigated. Due to the huge model size and computation amount, model compression is a critical step to deploy DNN models on edge devices. This paper focuses on weight quantization, a hardware-friendly model compression approach that is complementary to weight pruning. Unlike existing methods that use the same quantization scheme for all weights, we propose the first solution that applies different quantization schemes for different rows of the weight matrix. It is motivated by (1) the distribution of the weights in the different rows are not the same; and (2) the potential of achieving better utilization of heterogeneous FPGA hardware resources. To achieve that, we first propose a hardware-friendly quantization scheme named sum-of-power-of-2 (SP2) suitable for Gaussianlike weight distribution, in which the multiplication arithmetic can be replaced with logic shifter and adder, thereby enabling highly efficient implementations with the FPGA LUT resources. In contrast, the existing fixed-point quantization is suitable for Uniform-like weight distribution and can be implemented efficiently by DSP. Then to fully explore the resources, we propose an FPGA-centric mixed scheme quantization (MSQ) with an ensemble of the proposed SP2 and the fixed-point schemes. Combining the two schemes can maintain, or even increase accuracy due to better matching with weight distributions. For the FPGA implementations, we develop a parameterized architecture with heterogeneous Generalized Matrix Multiplication (GEMM) cores-one using LUTs for computations with SP2 quantized weights and the other utilizing DSPs for fixedpoint quantized weights. Given the partition ratio among the two schemes based on resource characterization, MSQ quantization training algorithm derives an optimally quantized model for the FPGA implementation. We evaluate our FPGA-centric quantization framework across multiple application domains. With optimal SP2/fixed-point ratios on two FPGA devices, i.e., Zynq XC7Z020 and XC7Z045, we achieve performance improvement of 2.1 × −4.1× compared to solely exploiting DSPs for all multiplication operations. In addition, the CNN implementations with the proposed MSQ scheme can achieve higher accuracy and comparable hardware utilization efficiency compared to the state-of-the-art designs.

Loading Preview

Sorry, preview is currently unavailable. You can download the paper by clicking the button above.

Log In

Low-bit Quantization of Neural Networks for Efficient Inference

Sign up for access to the world's latest research

Abstract

Related papers

Related topics