0% found this document useful (0 votes)
45 views14 pages

Tcas-I Haco Final

This document proposes a hardware-oriented CNN compression strategy and hardware architecture to implement compressed CNN models with high performance on FPGAs. It divides CNN layers into "no-pruning layers" that use parallel computing and "pruning layers" that achieve high compression. It applies uniform and incremental quantization for compression while maintaining accuracy. It implements a compressed VGG-16 CNN on FPGA, achieving 27.5x compression with minimal accuracy loss and processing 83 frames/second, 1.8x faster than state-of-the-art designs.

Uploaded by

dali mohamed
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
45 views14 pages

Tcas-I Haco Final

This document proposes a hardware-oriented CNN compression strategy and hardware architecture to implement compressed CNN models with high performance on FPGAs. It divides CNN layers into "no-pruning layers" that use parallel computing and "pruning layers" that achieve high compression. It applies uniform and incremental quantization for compression while maintaining accuracy. It implements a compressed VGG-16 CNN on FPGA, achieving 27.5x compression with minimal accuracy loss and processing 83 frames/second, 1.8x faster than state-of-the-art designs.

Uploaded by

dali mohamed
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS I: REGULAR PAPERS, VOL. XX, NO.

X, 2020 1

High Performance CNN Accelerators based on


Hardware and Algorithm Co-Optimization
Tian Yuan, Weiqiang Liu, Senior Member, IEEE, Jie Han, Senior Member, IEEE,
and Fabrizio Lombardi, Fellow, IEEE

Abstract—Convolutional neural networks (CNNs) have been substantial increase of power limits its application to embed-
widely used in image classification and recognition due to their ded systems. For low power and high performance digital
effectiveness; however, CNNs use a large volume of weight systems, field programmable gate arrays (FPGA) [5], [6], [7]
data that is difficult to store in on-chip memory of embedded
designs. Pruning can compress the CNN model at a small and application specific integrated circuit (ASIC) [8], [9], [10]
accuracy loss; however, a pruned CNN model operates slower have been used for CNN accelerators in recent years.
when implemented on a parallel architecture. In this paper, However, the on-chip memory resources in current FPGAs
a hardware-oriented CNN compression strategy is proposed; a are not sufficient to completely store a large-scale CNN model.
deep neural network (DNN) model is divided into “no-pruning Therefore, off-chip memory is generally used in an FPGA
layers (N P -layers)” and “pruning layers (P -layers)”. A N P -
layer has a regular weights distribution for parallel computing implementation of CNNs; this causes a limitation in terms of
and high performance. A P -layer is irregular due to pruning, but bandwidth and speed. Therefore, model compression methods
it generates a high compression ratio. Uniform and incremental have been studied quite extensively. Among them, network
quantization schemes are used to achieve a tradeoff between pruning [11] is one of the most widely applied compression
compression ratio and processing efficiency at a small loss in methods [12], [13], [14]. As a cost of improving compression
accuracy. A distributed convolutional architecture with several
parallel finite impulse response (FIR) filters is further proposed ratio, the irregularity caused by pruning affects the perfor-
for the regular model in the N P -layers. A shift-accumulator mance of parallel computing. A compressed sparse model
based processing element with an activation-driven data flow not only requires decoding but it also causes an imbalanced
(ADF) is proposed for the irregular sparse model in the P -layers. weights load and a difficulty in activations reading. In [15], it
Based on the proposed compression strategy and hardware archi- has been shown that processing a sparse layer takes dozens of
tecture, a hardware/algorithm co-optimization (HACO) approach
is proposed for implementing a N P −P hybrid compressed CNN milliseconds and requires a large memory utilization.
model on FPGAs. For a hardware accelerator on a single FPGA To achieve a better tradeoff between model size and per-
chip without the use of off-chip memory, a 27.5× compression formance of large CNNs, hardware-oriented compression and
ratio is achieved with 0.44% top-5 accuracy loss for VGG-16. hybrid quantization strategies are proposed in this paper by
The implementation of the compressed VGG-16 model on a Xilinx requiring a smaller memory. By considering the processing
VCU118 evaluation board processes 83.0 frames per second (FPS)
for image applications, this is 1.8× superior than the state-of-
feature of a CNN, the size of feature maps is reduced, but the
the-art design found in the technical literature. model size expands as the layers deepen. The reduced feature
maps require less computation, and the expanded models
Index Terms—Convolutional neural network (CNN), field pro-
grammable gate array (FPGA), network compression, hardware require a larger memory. As per the above characteristic,
acceleration all layers are divided into two categories: “no-pruning layers
(N P -layers)” and “pruning layers (P -layers)”. With a regular
weights distribution, a N P -layer utilizes parallel computing
I. I NTRODUCTION for high performance. A P -layer is irregular due to the pruning
Convolutional neural networks (CNNs) have been exten- but it has a high compression ratio.
sively used for image classification [1], [2] and recognition Leveraging the proposed compression strategy, the VGG-16
[3]. For better accuracy, CNNs require intensive and extensive [1], one of the most useful CNN models for image classifi-
computation. For real-time processing, CNNs are usually cation, is implemented on a Xilinx VCU118 evaluation board
accelerated by parallel processors such as graphic processing without off-chip memory such as DRAM to store weight data.
units (GPUs) [4]. Although GPUs accelerate computation, a The proposed CNN accelerators achieve high performance
because only on-chip memory in an FPGA is used. Based on
Manuscript received XX, 2020; revised XX, 2020; accepted XX, 2020. the hardware-oriented compression-based architecture, a hard-
Date of publication XX, 2020; date of current version XX, 2020. This work
is supported by grants from the National Natural Science Foundation of ware/algorithm co-optimization scheme (HACO) is proposed
China (62022041 and 61871216), and the Six Talent Peaks Project in Jiangsu for implementation of the CNNs. To the best of the author’s
Province (2018XYDXX-009). knowledge, this is the first work that implements VGG-16 on
T. Yuan and W. Liu are with College of Electronic and Information
Engineering, Nanjing University of Aeronautics and Astronautics, Nanjing, a single FPGA chip without the use of off-chip memory. The
211106, China (email: liuweiqiang@[Link]). main contributions of this work are summarized as follows:
J. Han is with Department of Electrical and Computer Engineering, Uni- • A hardware-oriented compression method and a
versity of Alberta, Edmonton, AB, T6G 1H9, Canada.
F. Lombardi is with Department of Electrical and Computer Engineering, uniformly-incremental hybrid quantization strategy are
Northeastern University, Boston MA 02115, USA. proposed.
IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS I: REGULAR PAPERS, VOL. XX, NO. X, 2020 2

• The proposed compression strategy has been applied in FC layers are the last few layers in a CNN. All input neurons
the VGG-16 model and achieves a 27.5× compression are fully connected to every neuron in the next layer through
ratio with a 2.04% top-1 accuracy loss and a 0.44% top-5 weights. Therefore, FC layers have many weights to be stored.
accuracy loss compared to the single-precision floating- O denotes the number of multiply-accumulate (MAC) oper-
point VGG-16 model using the ISVRC2012 test data set. ations required in each layer (including both OF C and OConv ):
• A distributed convolutional architecture is proposed for
OF C = Uin × Uout
FPGAs with a fast pipeline data path of CNNs.
(2)
• A shift-accumulator based processing element and an
OConv = (W × H × N ) × (K × K × M )
efficient activation-driven data flow are proposed for a
sparse model. where Uin and Uout denote the number of input and output
• A hardware/algorithm co-optimization approach is pro- neurons, respectively.
posed for high performance implementations on FPGAs.
• The proposed VGG-16 network is implemented on the
B. Related Works
Xilinx VCU118 evaluation platform achieving 30.3∼83.0
frames per second (FPS) with a compression ratio of A tiling technique [16], [5] has widely been used to address
34.5∼27.5×, so attaining the highest performance com- insufficient memory in embedded architectures. For image
pared with the state-of-the-art designs. compression, the adaptive joint photographic experts group
This paper is organized as follows. Section II provides the (JPEG) method [17] has been proposed to dynamically adjust
background of CNNs. The proposed compression strategy is the compression ratio for the desired quality. For speeding
presented in Section III. Section IV proposes a distributed up convolution, a fast finite impulse response (FIR) algorithm
FIR based hardware architecture for the N P -layers and a (FFA) [6] has been proposed. The Winograd algorithm has
shift-accumulator based processing element for the P -layers. been studied for sparse networks [18]. To achieve a high
Section V evaluates the computation time and the hardware throughput, [7] has presented a design method to fully exploit
requirement. Section VI proposes a hardware/algorithm co- the limited resources in FPGAs. Approximate computing has
optimization method. Section VII provides the experimental widely been studied in recent years; its objectives are to
results and analysis. Comparison with the state-of-the-art achieve low energy and high performance at an acceptable
designs is also provided in this section. Section VIII concludes accuracy loss [19]. Neural networks require a significant
this paper. large-scale computation and have high error resilience, so
suitable for approximate computing. For example, approximate
multipliers [20] can be used in CNNs with a very small loss
II. BACKGROUND of accuracy [21]; however, this method has only been applied
to small-scale neural networks.
A. CNN Basics
Deep compression has been proposed in [12] to reduce the
CNNs extract the features of images and process these model size of CNNs. Using network pruning [11], weight
feature maps to classify images by finding and using the quantization and Huffman coding, a high compression ratio
weights in each layer. In a deep learning algorithm, the CNN has been achieved. A so-called dynamic network surgery has
weights are found through training. A typical CNN has three been used to accelerate training and avoid unacceptable prun-
types of layers: convolutional (Conv) layers, pooling layers ing [13]. Binary neural networks (BNNs) have been proposed
and fully connected (FC) layers. to train deep neural networks (DNNs) with weights and activa-
Convolutional layers are used to extract image features. tions constrained to +1 or -1 [22]. At a reduced complexity and
A convolutional layer receives a feature map XW,H,M and a small number of weights, BNNs achieve a high performance
generates a feature map of YW,H,N by the filter HK,K,M,N , for embedded systems [23], [24]. Activations for very low
which is calculated by: bit-width has also been proposed, thus saving memory and
M X
K X
K accelerating training.
X
Y (i, j, n) = H(p, q, m, n)X(i×s+p, j×s+q, m) Incremental Network Quantization (INQ) [25] has been pro-
m=1 q=1 p=1 posed for efficient CNN models with low precision weights.
(1) INQ divides weights into several groups, and incrementally
where W and H indicate the width and the height of the quantifies each set of data. At each time of quantization
feature map; K indicates the size of convolution kernel; M process, a network is retrained and weights that have not
and N indicate the input channels and output channels; s is been quantified are updated to compensate for the accuracy
the stride of filter. loss. Using the INQ algorithm, weights in a network can be
Pooling layers compute a local field of feature map to quantified as ±2n with only a small accuracy loss, where n
output a pixel, so reducing the size of feature maps. Average is an integer. The MAC is the main arithmetic unit in CNNs.
and max pooling are the two typical pooling operations that With the quantized data form, multiplication can be replaced
are commonly used in CNNs. Average pooling computes the by shift operations in a MAC; therefore, INQ has a high speed
average value of the local field while max pooling selects the performance on hardware.
largest value of the local field as the output. Max pooling is The significant difference between INQ and the proposed
used in this work due to its high efficiency. compression strategy is that the proposed strategy targets
IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS I: REGULAR PAPERS, VOL. XX, NO. X, 2020 3

Step 1
Pruning Remove Mask
NP-layers

Orignal NN Model Update Mask


Pruning
P-layers
Quantify
Part of weights

Step 2 Retrain
Compressed NN Model Model
Floating-point number Uniform
in NP-layers Quantization
Fixed-point number in
NP-layers Repeat Step 1
Floating-point number Reserve and Step 2
in P-layers

±𝟐𝐧 in P-layers

Fig. 1. Overview of the hardware-oriented compression strategy.

sparse neural network models, whereas the original INQ A. Hybrid Quantization Strategy
causes an unacceptable accuracy loss for sparse models. Due to
the use of non-pruning layers (NP-layers) defined in this work,
For P -layers, INQ is applied to achieve a higher compres-
the proposed method compensates the accuracy loss from the
sion ratio. For a layer l, the weights are stored in an array Wl .
pruning-layers (P-layers) in sparse networks. Furthermore, the
The network sparsity is determined by Tl as a binary array
implementation of NP-layers can increase parallel processing
with the same size of Wl . Tl is calculated by the following
performance due to its regular structure.
equation: 
0 Wl (i) = 0
Tl (i) = (3)
1 Wl (i) 6= 0
III. T HE P ROPOSED C OMPRESSION S TRATEGY A 0 and 1 in Tl indicates that the corresponding weight is zero
or non-zero, respectively. At each time of quantization, 1s in
We propose a hardware-oriented compression strategy for the array Tl are randomly divided into two groups: A1 and
large-scale CNNs to store weights on a chip. The front layers A2 . The data in A2 will be modified to zero which indicates
incur a significant level of computation but with a smaller that the corresponding weight will be quantified. The weight
model size. Also, the front layers are less error-resilient: [12] quantization in the P -layers is represented as:
has applied pruning (albeit at a smaller rate) in the front layers 
+2blog2 |Wl (i)|c Wl (i) > 0
of VGG and AlexNet. [26] shows that by pruning the filters Wl (i) = (4)
−2blog2 |Wl (i)|c Wl (i) ≤ 0
in the front layers, the accuracy significantly decreases. In
addition, an irregular sparse model affects the performance of By using quantization, the original multiplication in a MAC is
parallel computation. Hence, pruning the front layers is not replaced by a shift operation. Therefore, only the sign bit and
efficient. The principle of the proposed strategy is to apply exponent need to be stored for the direction and the number
different compression strategies to different layers, so making of bits to shift, respectively. After quantization, the model is
the network to be front-regular but back-irregular. retrained to compensate for the error. Weights are updated as
All layers are divided into two types: N P -layers and P - per the following equation:
layers. In the beginning, all layers are pruned. Then INQ ∂E
Wl (i) ← Wl (i) − η Tl (i) (5)
is used for quantifying the P -layers. During the incremental ∂Wl (i)
quantization, the error introduced by pruning and quantization where η is the learning rate that depends on the learning
in the P -layers is compensated by the weight update of policy; E is the objective function. Tl (i) denotes the mask
N P -layers, in which the N P -layers are no longer sparse. of weights, which determines whether the weight must be
Therefore, the proposed compressed model has more error re- updated. For zeros entries in Tl , the corresponding weight
silience, hence making the model easier to converge. When the will not be updated because it is already zero or has been
quantization in the P -layers is completed, the N P -layers are quantified. For the N P -layer l, the weights are quantified
quantified as fixed-point numbers so ready for computation. after the quantization of the P -layers. The following uniform
In general, the N P -layers have a regular structure for quantization method is applied:
higher parallel performance, while the P -layers are highly
Wl (i)
compressed for higher compression ratio. The N P -layers are Wl (i) = round( )Q (Q = 2−n ) (6)
generally Conv layers, but the P -layers can include Conv Q
layers and FC layers. The overview of the proposed strategy where Q indicates the quantization factor and n is the fraction
is shown in Fig. 1. bit. For the case that most of weights are less than 1, the
IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS I: REGULAR PAPERS, VOL. XX, NO. X, 2020 4

following quantization method is used: The standard VGG-16 model used in this work is down-
loaded from Caffe Model Zoo1 (as widely used). We tested this
1−Q Wl (i) ≥ 1

 model on the ImageNet dataset 2012 through the framework
Wl (i) = −1 Wl (i) < −1 (7)
Caffe; and obtained the same accuracy as in [29].
round( WQ l (i)
)Q Wl (i) ∈ [−1, 1)

TABLE I
VGG-16 C OMPRESSION R ESULT
Algorithm 1 Hybrid Sparse Network Quantization Strategy
Input: P -layers and N P -layers Data Type Top-1 error Top-5 error Size (MB)
Output: Quantized network model QM
1: for all layers ∈ P do
Baseline 31.90% 12.00% 553MB
2: Compute Tl through Eq. (3) Ours 33.94% 12.44% 20.1MB
3: end for
4: Set grouping ratio r
5: for all layers ∈ P do IV. T HE P ROPOSED H ARDWARE A RCHITECTURE AND
6: Randomly divide index of 1 from Tl into two parts: A1 , I MPROVED DATA F LOW
A2 by grouping ratio r Most state-of-the-art CNN designs on FPGAs use off-chip
7: Set Tl (i) (i ∈ A2 ) to 0 memory to store weights and feature maps. A key advantage
8: Quantify Wl (i) (i ∈ A2 ) through Eq. (4) of the proposed compression strategy is to implement VGG-16
9: end for on the latest Xilinx UltraScale FPGA using on-chip memory
10: Retrain and update weights by Eq. (5) only.
11: Goto 4 until all weights in P -layers have been quantified Based on the proposed compression strategy, a new hard-
12: for all layers ∈ N P do ware architecture for FPGA implementation is developed with
13: Quantify Wl through Eq. (6) or Eq. (7) a fast pipeline dataflow of the CNNs without off-chip memory.
14: end for As mentioned in Section III, the N P -layers are in front of the
15: return QM = P ∪ N P P -layers. Therefore, the compressed model can be processed
in a pipelined manner.
In the proposed compression strategy, weights in the N P - Two hardware architectures are proposed to process these
layers are given by fixed-point numbers, so processed without two types of layers. Due to the regular convolution model in
decoding and faster compared to the compressed weights in N P -layers, an FIR based convolution processing element and
the P -layers. The weights in the P -layers are discrete in an improved data flow are proposed for the N P -layers. For the
±2n , so requiring less memory due to the lower bit width irregular sparse model in P -layers, a parallel shift-accumulator
and smaller quantity. Moreover, the multiplication is replaced based processing element is proposed to reduce the redundant
by shift operations in the P -layers, requiring less resources. computation in the parallel processing of the sparse models.
Overall, the compressed CNN models have a high performance
on hardware implementation. The hybrid quantization strategy
A. Conv Processing Element
is detailed in Algorithm 1.
A 1-D convolution is processed by a FIR filter, and multiple
parallel FIR filters (the number of FIR filters is denoted as
B. Compression Experiments and Results F ) compute the 2-D convolution. The FIR filter is efficient
in convolution processing due to its stringent requirement on
We implemented the proposed compression strategy on the bandwidth. Considering a 3 × 3 convolution, three inputs are
Caffe [27] platform to compress the VGG-16 [1] model; required for an output in the FIR filters, while nine inputs are
this has been trained and tested with ILSVRC2012 data set required in traditional methods.
[28]. By using the proposed strategy, 96%, 96% and 77% of [6] has proposed a parallel fast FIR algorithm (FFA) for
the weights are pruned in three FC layers. For quantization, CNNs, which can save multiplications in convolution. Com-
weights bit-widths are set to 8-bit and 4-bit for N P -layers pared with [6], the proposed FIR filter design uses cut-set
and P -layers. Moreover, 5 bits are set to store the index for retiming for the convolution kernel; this approach requires
a weight in the pruned layers, which indicates the number less resources and can be used to process larger convolution
of zeros between two adjacent no-zero weights. Most of the kernels very efficiently. So, it is possible to highly parallelize
weights in the Conv layer are less than 1; therefore, we employ processing in higher levels with less resources. As shown in
Eq. (7) for the N P -layers. Overall, weights are incrementally Fig. 2, in our design, a 3-tap FIR filter is retimed to improve
quantified from 50%, to 75%, to 87.25%, and to 100%. In the frequency, and 3 parallel retimed FIR filters constitute a
addition, all activations bit-widths are set to 16-bits. In these 3 × 3 convolution processing element (Conv-PE) to compute
experiments, a 20.1MB compressed model is utilized with a the 2-D convolution.
33.94% top-1 error and a 12.44% top-5 error, as shown in A complex convolution processing unit (CCPU) consists of
Table I. Compared to the single-precision floating-point VGG- several Conv-PEs; the number of Conv-PEs in a CCPU is
16 model, we achieve a 27.5× compression ratio with 2.04%
top-1 accuracy loss and 0.44% top-5 accuracy loss. 1 [Link]
IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS I: REGULAR PAPERS, VOL. XX, NO. X, 2020 5

X0 Retimed
X
D D 3-tap FIR
cycles. To eliminate the redundancy caused by the FIR latency
X1
Retimed
Y Cf ir , neighbor rows of the feature map are connected by filling
3-tap FIR K zeros. By applying the so-called integrated rows of the
X2
Y Retimed feature map, the W ×H matrix becomes three (W +K−2)×H
3-tap FIR
arrays that requires (W + K − 2) × H + Cf ir cycles, where K
cut-set cut-set is the width of the Conv kernel. As shown in Fig. 4, Row 1 to
retiming retiming
Row H are integrated as the second input array of the Conv-
X0 Retimed
X D PE. In particular, the first and last arrays must be padded with
D D D 3-tap FIR
X1 Y zeros before processing. When W is small and Cf ir is large,
Retimed
D
3-tap FIR
such as in the last several layers of VGG-16, performance can
Y X2
D Retimed be significantly improved.
D
3-tap FIR

(a) (b)
It is worth mentioning that for convolution kernel of differ-
ent sizes, zero padding is also different. The processing speed
has been increased with the improved data flow by slightly
Fig. 2. Convolution processing element (Conv-PE): (a) a cut-set retimed 3-tap
FIR; (b) a cut-set retimed 3 × 3 convolution processing element. requiring more memory.
Input Feature Map Output Feature Map

0
denoted by Mnp . As Fig. 3 shows, a CCPU computes Mnp 0
Row 1 0 0
channels of a feature map in parallel. The Mnp data output 0
Row 2 0 0
from the Conv-PEs is also accumulated. To reduce the delay, 0

H
H
0 0
the Mnp data are divided into Mnp /a groups, where a denotes 0
0
Row H
the number of data that is added in a group. Therefore, the en-
W W+1
tire addition is divided into dloga (Mnp )e levels. Additionally, Integrate

when Mnp is less than the input channel M , the accumulated 0 0 Row1 0 0 Row H-1 0

data will be temporarily stored in the CCPU buffer; it will be Row1 0 Row2 0 0 Row H 0 Conv-PE X X X X

Row2 0 0 Row H 0 0 0

sent to the accumulator in the next iterations. After M/Mnp Accumulator,


ReLU and
Pooling
iterations, the CCPU outputs a channel of the feature map. 0 0 Row1 0 0 Row H-1 0

Row1 0 Row2 0 0 Row H 0 Conv-PE X X X X


Pooling layers may occasionally follow the Conv layers. Row2 0 0 Row H 0 0 0

When pooling is required, the data output from ReLU is stored (W+1)H (W+1)H
in a pooling buffer. Several rows of data need to be stored prior
to computing. Fig. 4. Feature map integration for 3 × 3 kernel and the data flow in Conv
layers.

Conv-PE Conv-PE Conv-PE


C. F×F Ping-Pong Buffer for Conv-PEs
a
PM F parallel FIR filters require an F times sampling rate of
Level 0 Addition
the non-parallel version. As the on-chip storage is limited, an
PM/a
Level 1 Addition F × F ping-pong buffer (FPPB) is proposed.
Pooling
A F parallel FIR requires F lines of input data in the
MUX

Accumulator
Pooling PE
proposed hardware architecture. The data in the feature map is
stored in rows. Then the F data in different addresses must be
PE Buffer ReLU Pooling Buffer
read to match the processing speed for parallel performance.
As shown in Figs. 5 and 6, we first integrate the F adjacent
data in the feature map. Then the integrated data (A, B, C...)
Fig. 3. Complex convolution processing unit (CCPU).
are sequentially sent to the FPPB, which consists of two F ×F
blocks, namely, BLOCK0 and BLOCK1. These two blocks
are controlled to alternatively receive and output data. Once a
B. Improved Data Flow in Conv Layers block is full, it outputs data in columns while the other block
For a better performance of 2-D convolution by FIR filters, receives the data.
an improved data flow in Conv layers is proposed; it integrates This scheme can transform spatial relations of data in a
a 2-D feature map matrix into an array. To integrate each row feature map to match the parallel FIR processing, while re-
of the feature map, zeros are filled between two neighbor rows, quiring less memory compared with the conventional caching
which are of no use after convolution. These values are reset method [6]. Through the FPPB, the bandwidth requirement is
to zero in the ReLU module. also relaxed with a small hardware utilization.
A Conv-PE receives three parallel arrays and output one
array. To acquire an array with a length of W , the process takes D. Distributed Conv Architecture
W + Cf ir cycles, where Cf ir denotes the latency of a FIR. To With no off-chip memory, a distributed Conv architecture is
acquire a W × H matrix, the process costs (W + Cf ir ) × H proposed for high speed processing. Each Conv-PE in the same
IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS I: REGULAR PAPERS, VOL. XX, NO. X, 2020 6

Input Array
A D
map buffer for the next layer computation.
B E
C ...
Mnp
Feature FMR Feature Map RAM
map FMR FMR FMR
Map FPPB 𝐅 × 𝐅 Ping-Pong Buffer
Buffer WR Weight RAM

FPPB FPPB FPPB WB Weight Buffer

𝐅 × 𝐅 Ping-Pong Buffer Output Array


A A A D D D A A A L
CCPU WB WR
[0:15] [16:31] [32:47] [0:15] [16:31] [32:47] [0:15] [16:31] [32:47] [32:47]

B B B E E E B B B M CCPU WB WR
[0:15] [16:31] [32:47] [0:15] [16:31] [32:47] [0:15] [16:31] [32:47] [32:47] Data

Nnp
C C C F F F C C C N Trans
[0:15] [16:31] [32:47] [0:15] [16:31] [32:47] [0:15] [16:31] [32:47] [32:47]

Block0 Block1
CCPU WB WR

Fig. 5. F×F ping-pong buffer for parallel FIRs.


Fig. 7. The distributed convolution architecture.

In the proposed distributed architecture, the feature map


storage format is determined by Mnp . As shown in Fig. 8,
the first Mnp feature maps are set as a group, stored in the
bottom of FMRs. The next groups are subsequently stored
in a stack fashion. Due to the different size of feature maps
between layers, the indexes of the next groups are hard to be
determined. To ensure the architecture work correctly, Mnp
should be a multiple of Nnp . Dual port RAMs are used in
this implementation and there is no read and write conflict.
Fig. 6. The timing diagram of the proposed FPPB.

Feature
CCPU has a RAM to transmit the feature map. In addition, Feature

FMR
Map
Map 1
PM+1
different CCPUs receive weights from different RAMs; so,

Nnp
the speed of data transmission matches the processing speed,
hence relieving the restriction of bandwidth.
Feature

FMR
Feature
As shown in Fig. 7, there are Mnp feature map RAMs CCPU Map
Map PN
PM+PN

Buffer
(FMRs) to store the feature map, and each one corresponds to

Mnp
Nnp

a FPPB. The Mnp data in the feature map is sent to each Conv-

FMR
Feature Feature
PE after the conversion of FPPB, so the Mnp input channels CCPU Map Map

are processed at the same time.


Nnp

In the Conv architecture, each CCPU processes a 3-D


Feature
feature map to output a 2-D feature map, as briefly discussed
FMR
Feature
Map
Map PM
2PM
in Section IV-A. Nnp denotes the number of CCPUs that
process the convolution in parallel. Similarly, Nnp weight
RAMs (WRs) store the weights in a distributed fashion, such Fig. 8. Feature map storage format in FMRs.
that each WR corresponds to a CCPU. All CCPUs receive
the same data of the input feature map. Due to the different
convolution kernels, this implies that Nnp channels are output E. Shift-Accumulator Based Processing Architecture for P -
at the same time. Layers
The entire convolution architecture processes Mnp channels Based on the proposed compression strategy, the P -layers
of input feature map in parallel, and outputs Nnp channels of make a highly compressed sparse model with ±2n weight
the output feature map. types. Multiplications are replaced by shift operations to
Nnp is generally smaller than the number of output channels reduce resources. To accelerate the processing of the P -layers,
N ; therefore, a buffer is required for temporarily storing an there are two levels in this parallel computing scheme:
intermediate feature map. Using the data integration method • Unrolling the multiple shift-accumulations.
of Section IV-C, the serial data that is output by a CCPU, is • Unrolling the different weight kernels (Conv kernel in
converted into F parallel streams and stored in the map buffer. Conv layers or all weights connected to a output neuron
The map buffer receives Nnp channels of data and output Mnp in FC layers). Due to the irregular sparse model, compu-
channels of data. If the input feature map data of the current tations of each weight kernels are different, so decreasing
layer is of no use, then they are replaced by the data from the the utilization of the hardware units.
IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS I: REGULAR PAPERS, VOL. XX, NO. X, 2020 7

Fig. 9 shows the Mp × Np shift-accumulator based pro- to determine Np .


cessing unit (SAC-PU). A shift-accumulator consists of Mp The architecture of SAC-PU has high fanouts when Np is
shifters while a SAC-PU consists of Np shift-accumulators. large. To reduce the influence of a large net delay, the output
Mp shift-accumulations are executed in a shift-accumulator; registers of the activations and decoder can be copied for
the Np shift-accumulators process different sparse weight sharing the loads.
kernels.
To fully exploit the sparsity of the model, computations in
F. Generalization of the Proposed Architecture
the shift-accumulator are selected by the non-zero weights.
Thus, activations must be read, and they require specific The proposed F × F Conv-PE also computes smaller Conv
indexes that depend on weights. As the sparse weight kernels kernels, as determined by the input weights. For example, 4
must be considered together, different indexes cause issues input non-zero weights and 5 input zeros correspond to a 2×2
when reading activations. So, an activations-driven data flow Conv kernel. The 3 × 3 Conv kernel is widely used in CNNs,
(ADF) is proposed for the P -layers to reduce redundant so with good efficiency when the size of the Conv-PE is 3×3.
computations and improve the read speed of activations. When Some CNNs use larger Conv kernels in some layers, such
an activation is not needed in a sparse weight kernel, it as 7×7 in the first layer of ResNet [2], but this is well beyond
may still be needed in other weight kernels; therefore, an the capability of a Conv-PE; however, it is not always effective
activation is processed at every shift-accumulator to assess to increase the size of Conv-PE due to rare use of large Conv
whether needed. Activations are processed in an active mode, kernels; therefore, a general Conv-PE is proposed for CNNs
so this is referred to as ADF. at a high efficiency.
Consider a 3 × 3 Conv-PE and three 3-tap FIR filters
connected in series, as shown in Fig. 10, MUXs are inserted
in each connection to select the functions of the Conv-PE.
Activations Activations 1 Shift-Accumulator
x0 x1 x2
Shifter

MUX
MUX
D D D D D D D
If needs?
Level 0 Addition

Level 1 Addition
Mp

0 0
Shifter

Indexes 1

MUX
MUX
Decoder
Shifter

Weights 1

Bit Splicing MUX MUX


y
Activations
SN
Shift-Accumulator
Fig. 10. The transformed Conv-PE for general CNN computation.
Shifter

D
If needs?
Level 0 Addition

Level 1 Addition

• Parallel Mode: When Conv kernels are smaller than or


Mp

equal to 3 × 3, the FIR filters receive multiple data from


Shifter

Indexes SN
the feature map to compute at the same time multiple
Shifter

Weights SN rows. A Conv-PE computes 2-D convolution. However,


when the size of Conv kernel is 1 × 1, this Conv-PE is
only 11.1% as powerful as the 3×3 scheme; therefore, the
Fig. 9. The architecture of SAC-PU and activations-driven data flow (ADF). output bit-width must be expanded 3 times for 3 output
data when Conv kernel is 1 × 1.
The efficiency of ADF is determined by Np and the sparsity • Serial Mode: When the Conv kernels are larger than
of P -layers. Eq. (8) is used to evaluate the efficiency, where 3 × 3, each FIR filter receives the data from the previous
Rkp denotes the pruning rate of each weight kernel. When filter to configure as a larger FIR filter. Each Conv-PE
Eadf is larger than 1, ADF is faster than traditional methods. loads one row of weights for 1-D convolution. If the
A large Np value implies that the activation is likely to be kernel size is smaller than 9 × 9, zeros will be filled
accepted by other weight kernels, so increasing the efficiency for the FIR filter. Thus, there may be some inefficiency
of ADF. However, Np cannot take a very large value because when the Conv-PE operates in serial mode; however as
it would require a substantial increase hardware; therefore, Mp large Conv kernels are a small part in most networks, the
can be reduced for a larger Np . performance will not be significantly affected. Multiple
PNp executions of the 1-D convolution are required for a 2-
Activations × (1 − Rkp ) D convolution. For example, if the size of the kernel is
Eadf = (8)
Activations 7 × 7, the feature maps should be input 7 times for a 2-D
The pruning rate of each weight kernel’s is difficult to find convolution. However, in some cases, the multiple 1-D
analytically; therefore, the average of the pruning rate in a convolutions can be unrolled; for example, if Mnp = 32
layer can be used to evaluate the efficiency, that can be used and M = 3, each seven Conv-PEs can load 49 weights
IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS I: REGULAR PAPERS, VOL. XX, NO. X, 2020 8

for the 7 × 7 convolution. The 7 outputs from Conv- Computation hardware is determined by Mnp and Nnp ,
PEs are added in the accumulator. These seven Conv-PEs which mostly originate from the Conv-PEs and the accumula-
compute a 2-D convolution; also, the 3 input channels tor. The number of multipliers and adders are given as follows:
should be expanded to 21 for parallel computing.
M ulnp = 9Mnp × Nnp
The proposed general Conv-PE can switch modes between dlog2 (Mnp )e
(13)
1-D and 2-D convolution with a processing capability for a
X
Mnp /2i ) × Nnp
 
Addnp = (8Mnp +
Conv kernel smaller than 9 × 9. The serial FIR filters cause a i=1
significant delay, that must be retimed for high performance.
In Eq. (13), all adders are converted into a two-input form for
evaluation.
V. E VALUATION

Under the N P -P hybrid model, the performance of each B. P -Layers Computation Time Evaluation
module is analyzed. In this section, computation time and The computation time in the P -layers can be divided into
hardware utilization are quantitatively evaluated for the pro- three parts: weights decoding, activation reading and SAC
posed optimization algorithm of Section VI. computing. The compressed weights are decoded into weights
and indexes, stored in caches. Then the activations are read
by ADF. After all data is cached, the SAC-PU processes such
A. Time Analysis of N P -Layer data. Each part operates based on the previous parts. Therefore,
the time consumption in each part must be accumulated.
For processing the Conv layers using the proposed CCPUs,
For weights decoding, the required number of clock cycles
a 2-D convolution takes (W + 1) × H + Cf ir clock cycles,
depends on the number of weights after pruning:
so all data of the feature map is read once. Consider the input
and output channels, then the clock cycles of all N P -layers P
X
can be computed as follows: Cw = Ql × (1 − Rlp ) (14)

XNP 
M
 
N
 where Rlp denotes the pruning rate of a layer.
Cnp = [(W + 1) × H + Cf ir ] × × (9) The P -layers may include Conv layers and FC layers, so
Mnp Nnp
two cases of clock cycle utilization must be differentiated:
If Conv-PEs work in a serial mode, it will take an additional
PX
time given by a number of clock cycles with a factor of Conv  
N
KH compared to the parallel mode. Consider the unrolling Ca = max (L) × W × H × +
Np
of convolution, then Eq. (9) is computed as follows: P
(15)
FC  
X Uout
NP 
M ×K
 
N
 max (L) ×
Np
X
Cnp = [(W + 1) × H + Cf ir ] × ×
Mnp Nnp
(10) where L is the length of the activations (equal to the last index
value of the weight kernels).
As weights are loaded in the buffer prior to the next
convolution, no additional time overhead is encountered. Due Similarly, the number of clock cycles required for SAC
to the map buffer, the feature map can be loaded in FMRs computing is given by:
when the map data in the last iteration is of no use. Therefore,
the total time is given by: PX
Conv    
max (Qk (1 − Rkp )) N
Cs = ×W ×H × +
Tnp = Cnp (Mnp , Nnp ) × tnp (11) Mp Np
PFC    
For storage, the largest feature map and the number of
X max (Qk (1 − Rkp )) Uout
×
weights determine the size of the used memory. As per reuse, Mp Np
(16)
Nnp affects the size of the map buffer. Additionally, each
where Qk is the weight number and Rkp is the pruning rate
CCPU requires a buffer to store at least a feature map for
of a weight kernel, different from Eq. (14).
accumulation. Therefore, the entire memory is given by:
As per the above equations, the total time execution of the
2N − Nnp P -layers is given as follows:
M Snp = × max(W × H × M ) × ba +
N NP
NP (12) Tp = Cw (Rlp )tw + Ca (L, Np )ta + Cs (Rkp , Mp , Np )ts (17)
X
Nnp max(W × H) × ba + Ql × bw
NP A pipelined design can be achieved between reading activa-
where ba and bw denote the activation bit-width and weight tions and the SAC computation for the processing time to be
bit-width; Ql denotes the weight number of a layer. In general, overlapped. Therefore, the time execution is given by:

the largest feature map can be found in the first several layers; Cw (Rlp )tw + Ca (L, Np )ta C a ta ≥ C s ts
thus the layer dividing method does not affect the size of the Tp =
Cw (Rlp )tw + Cs (Rkp , Mp , Np )ts Ca ta < Cs ts
memory in the N P -layers. (18)
IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS I: REGULAR PAPERS, VOL. XX, NO. X, 2020 9

Evaluation
Memory
Consumption
NP-Layers Compress

Splitter
Computation
Orignal Compressed
Move
Time NN Model NN Model
Adjust
Resources Pointer P-Layers

Fig. 11. Overview of hardware/algorithm co-optimization (HACO).

Storage in the P -layers is function of the largest feature


map, the number of weights and the size of the weight cache. NP-Layers
FMR Conv -PE P-Layers
The memory size is given by: Input
FMR Conv -PE Map Activations
P
X Buffer
Output
M Sp = 2 max(W × H × M ) × ba + Ql × bw + FMR Conv -PE
SAC-PE Activations
P
(19)
Np
X
Np × max (Qk ) × (bw + bi )
Computing AR AR AR AR
WD WD Margin
NP-layers SC SC SC SC
where bi denotes the index bit-width. WD Computing
Weights Decoding
Computational resources are determined by the number of AR Activations Reading
NP-layers

shifters and accumulators, which is the same as the N P -layers: SC SAC Computing Total time

Shtp = Mp × Np
dlog2 (Mp )e
Fig. 12. Time consumption of the pipelined hybrid system.
X (20)
Mp /2i × Np
 
Addp =
i=1
where Ar and Vr are constant, that are determined by user
In Eq. (20), all adders are also converted into a two-input form requirements.
for evaluation. Tnp and Tp are affected by multiple factors; among them,
the frequency is the most difficult to be evaluated quan-
VI. H ARDWARE /A LGORITHM C O -O PTIMIZATION titatively. However, it can be assessed after synthesis. By
In this section, HACO is proposed to find the best design utilizing more hardware, frequency could decrease, so a loss
by dividing the layers into N P -layers and P -layers and in performance, which is often unavoidable. Therefore, only
by proper sizing the CCPUs and SAC-PU, hence reducing parameters such as Mnp , Nnp , L, Rlp , Rkp , Mp , Np and layer
the computation time while retaining a high efficiency. The divisions need to be considered.
overview of HACO is shown in Fig. 11.
B. Goal Simplification
A. Optimization Objective For efficiency of the hardware, the considered parameters
For pipeline processing, N P -layers and P -layers are pro- are analyzed under a few sampling assumptions. As mentioned
cessed by CCPUs and SAC-PU, respectively. As Fig. 12 in Section IV-D, Mnp is a multiple of Nnp . Therefore, Tnp
shows, the N P -layers are processed by the CCPUs first. After is denoted as Tnp (k, Nnp ), where k is a positive integer.
the computation of the N P -layers has been completed, the However, to fully reuse the memory, Nnp must be increased
feature maps as output of the CCPUs are sent to the SAC-PU. as much as possible; due to hardware limitation, it can be
Meanwhile, the CCPUs execute the next computation. Thus, achieved when k is equal to 1.
the entire computation time is determined by the slower step L and Rkp are related to each weight kernel (that must be
in the entire process. considered when implementing pruning). To ensure correct-
Consider the limited resources on FPGA, then this problem ness, L is given by the largest amount such that all activations
can be formulated as: are read once per shift-accumulator. Therefore, there is a
min Tnp (Mnp , Nnp ) difference in values between the actual Tnp and Tp .
Rkp is difficult to determined prior to training due to the
s.t. Tnp (Mnp , Nnp ) ≥ TP (L, Rlp , Rkp , Mp , Np ) irregular pruning. Provided that SAC computing is faster than
Resources U tilization(Mnp , Nnp , Mp , Np ) ≤ Vr the process of activation reading, the proposed architecture
Accuracy(L, Rlp , Rkp ) ≥ Ar operates correctly. In the case that the same frequencies are
(21) used for them, Mp should be less than or equal to the number
IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS I: REGULAR PAPERS, VOL. XX, NO. X, 2020 10

of activations read out in a clock cycle. Due to the above A Larger Nnp may improve performance. However, the
simplification, the process is now given by: efficiency may decline due to the ceils in Eq. (9), which means
a waste in computation. The best scenario is that there are no
min Tnp (Nnp )
remainders of NMnp and NNnp in every N P -layers. Computation
s.t. Tnp (Nnp ) − Tp (Rlp , Mp , Np ) ≥ M argin is different between layers due to the different values for W
(22)
Resources U tilization(k, Nnp , Mp , Np ) ≤ Vr and H. If the efficiency needs to be taken into account to
Accuracy(Rlp ) ≥ Ar determine Nnp , it can be evaluated by:
The actual accuracy is hard to determine unless training and PN P
an implementation are pursued. However, the least value of the [(W + 1) × H + C] × NMnp × NNnp
EConv = P l m l m
accuracy can be acquired before retraining. For example, the NP
[(W + 1) × H + C] × NMnp × NNnp
requirement of accuracy is given by Ar . A network (all layers
(24)
are sparse), whose accuracy is A and A ≥ Ar , can be selected
as a baseline sparse network. Assume the actual accuracy is Algorithm 2 Hardware/Algorithm Co-Optimization Algorithm
A∗ , and A∗ is always larger than A because some layers are Input: spliter pointer, RA, prior Rlp
N P -layers. Therefore, it can be ensured that A∗ is larger than Output: Nnp , Mp , Np and new Rlp
Ar (A∗ ≥ A ≥ Ar ). 1: Allocate available hardware to CCPUs and SAC-PU by
RA.
C. Hardware/Algorithm Co-Optimization 2: for all Mp , Np and Nnp do
3: Find the largest value under the restriction of each
The utilization of CCPUs and the number of N P -layers are
available hardware.
two important parameters that determine the performance, but
4: end for
it is difficult to consider them together. Therefore, the proposed
5: for all layers do
approach considers them separately.
6: Compute the required memory and find the minimal
A spliter pointer is used to determine whether the layers
value (the furthest spliter pointer can move).
belong to either the N P -layers or the P -layers. At the begin-
7: end for
ning, the pointer must be initialized. As the Conv layers incur
8: Compute TN P , TP .
in significantly more computation than the FC layers, and the
9: if TN P − TP ≥ M argin then
number of weights is large in the FC layers, the spliter initially
10: while TN P − TP ≥ M argin and the pointer is under
indicates that N P -layers are Conv layers and FC layers are
the valid range do
P -layers.
11: Move the spliter pointer to the front layer.
Then available hardware is allocated for CCPUs by a RA.
12: Compute TN P , TP .
RA is defined as follows:
13: end while
Resourcesnp 14: Retrain the model through the spliter, acquiring new
RA = (23)
Resourcesp Rlp .
In FPGA, the Resources can be evaluated by the number of 15: Compute TN P , TP .
DSPs and LUTs, and the number of DSPs should be converted 16: if TN P − TP < M argin then
into the equivalent number of LUTs. 17: repeat
A higher RA means that more N P -layers can be processed 18: Reduce Nnp and enlarge Mp , Np .
with the restrictions of Tnp > Tp , which is faster due to higher 19: until Minimal TN P − TP
efficiency in N P -layers. A smaller RA means more P -layers, 20: else
which have a smaller model size. 21: repeat
After initialization, the spliter pointer moves to the front 22: Enlarge Nnp and reduce Mp , Np .
layers with a fixed RA, so reducing Tnp . Moreover, Tp is 23: until Minimal TN P − TP
calculated by using the prior value of Rlp which is evaluated 24: end if
by the pruning rate of a full sparse model. When the pointer 25: else
is fixed, the model must be retrained. Due to the N P -layers, 26: while TN P − TP < M argin do
the final Rlp is larger than the previous value. Then, RA is 27: Reduce Nnp and enlarge Mp , Np .
finely tuned for the final result. 28: Compute TN P , TP .
During the pointer adjustment process, the activation mem- 29: end while
ory increases, but the weight memory decreases. If the total 30: end if
memory increases, the pointer returns to the back layers, 31: return Nnp , Mp , Np and new Rlp
finding the least size for the memory. Therefore, the farthest
distance the spliter pointer can be adjusted, is rather limited, so
VII. E XPERIMENT AND R ESULTS
RA cannot be too small. Similarly, if Tnp − Tp < M argin at
the beginning, the pointer is not adjusted but more hardware is A. Experiment Set
allocated to SAC-PE for satisfying this condition. This process The proposed processing architecture is scalable with differ-
is described in Algorithm 2. ent numbers of Nnp and Mnp , so affecting performance and
IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS I: REGULAR PAPERS, VOL. XX, NO. X, 2020 11

TABLE II
C OMPARISON WITH OTHER VGG-16 D ESIGNS WITH THE S AME L EVEL ACCURACY (66.06% T OP -1)

Approaches FPGA’16 [29] TCAS-I’17 [6] TCAD’18 [7] Max RA Min RA


Platform Zynq ZC706 Virtex VC707 Virtex VC709 Virtex VCU118
LUTs 182616 215556 337152 695320 249432
FFs 127653 66792 606307 243802 127105
DSPs 780 2296 2877 4096 1024
BRAMs 486 - 882.5 1779 2045
URAMs - - - 779 661
DRAM used Yes Yes Yes No
FPS 8.9 33.8 45.5 83.0 30.3

the utilization of the DSP units. Memory depends on the size puting, as the improvement in frequency is not effective. If
of the network model and the input image; so, the number of activations are needed to be integrated, weights should be
on-chip memory resources of an FPGA determines whether decoded in the same form. However, the weights in a sparse
the network can be implemented on the FPGA without off- model are not serial; this causes the number of useful weights
chip DRAM. to be difficult to calculate. Therefore, parallel decoding cannot
HACO is implemented on VGG-16 with the Xilinx VCU118 be achieved when multiple activations are integrated. Since
platform. Performance varies as function of [Link] largest the feature map shrinks during pooling, the integration of
value of RA has the best performance and the largest model activations results in less efficiency in the P -layers. Therefore,
size (i.e., 21.1MB as mentioned in Section III-B). The least only the parallel weight decoding is used in our experiment.
value of RA has a slower speed but also the smallest model
size. Initially, the N P -layers include Conv layers only, and B. Performance Analysis and Comparison
P -layers are FC layers. With the decrease of RA, the spliter
In HACO, the largest and least values of RA are utilized
pointer moves to the front layers. When the pointer moves
as shown in Table IV. For the largest value of RA, Tnp is
to layer Conv 3-3, the needed memory increases due to the
larger than Tp initially. Therefore, the spliter pointer does not
large feature map. Therefore, the layer Conv 3-3 is the farthest
move, so staying at the end of the Conv layers. In this case,
distance that the pointer can move in VGG-16.
Nnp = 33 and Np = 51 are obtained by HACO. For the least
For the largest model, VCU118 requires significant on- value of RA, Np = 256 and Nnp = 16 are obtained, and
chip memory. URAMs and BRAMs are utilized for VCU118. finally the pointer moves to the layer Conv 4-3. To achieve
Due to the large model size, the memory for the weights is the highest efficiency, Nnp and Np are mapped to 2n in our
given by URAMs. To fully use the RAMs resources, 8 9-bit experiments. For the P -layers, the caches are synthesized as
weights are utilized. The remaining memories are allocated BRAMs and distributed RAMs. The results are shown in Table
for activations. 3 parallel activations are utilized for a bit- IV.
width of 48-bit. Therefore, every 3 FMRs with a bit-width of The FMRs and CCPUs use the same frequency (200 MHz
144-bit are synthesized as two groups of URAMs or BRAMs. for Nnp = 16; 150 MHz for Nnp = 32). Weights are loaded in
This allocation is adjusted by balancing URAMs and BRAMs. the buffer during convolution. As weight loading is faster than
Computation in VCU118 is performed by mostly DSPs and convolution computation and requires a large area and longer
LUTs. For high performance in multiplication, DSPs are delay in the design, the weight RAM uses a slower frequency
allocated for N P -layers, and LUTs are allocated for P -layers (100 MHz in this work). All frequencies are shown in Table
for shift operation. The available hardware are given by 60% of V.
the VCU118. As DSPs cannot be used in P -layers, a decrease Table II shows the hardware resources and performance
of RA is possible. For the highest efficiency in convolution, of previous VGG-16 FPGA designs and the proposed design
Nnp , Mp and Np should be given by 2n , where n is a positive with the largest and least values of RA. The proposed design
integer. Np is limited to a range of 32∼512 due to the ADF achieves the highest FPS of 83.0 at 150MHz; this is 1.8×
efficiency and the largest number of output channels. In the faster than [7]. 4,096 DSPs are used due to a large number
P -layers, activations and weights are required to be cached of CCPUs. All other three state-of-the-art designs use off-
before computing but accounting for large memory. When the chip DDR3 DRAM so limiting the processing speed due to
available BRAMs are insufficient, distributed RAMs can be bandwidth; the proposed design uses only on-chip BRAMs
used as alternatives. and URAMs for storing the feature map and weights.
To improve the speed of weight decoding and activation RA is reduced at a smaller model size. As shown is Table
reading, data must be integrated in RAMs for parallel com- II, the design with the least RA has a smaller memory (URAM
IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS I: REGULAR PAPERS, VOL. XX, NO. X, 2020 12

TABLE III
C OMPARISON BETWEEN P ROPOSED G ENERAL CCPU S AND OTHER D ESIGNS

Approaches [30] [31] [31] [31] Ours


Stratix V Arria 10 Arria 10 Arria 10
Platform Virtex VCU118
GSMD5 GX 1150 GX 1150 GX 1150
Network ResNet-152 VGG-16 ResNet-50 ResNet-152 VGG-16 ResNet-50 ResNet-152
(# of Operations (GOP)) (22.62) (30.70) (7.74) (22.62) (30.70) (7.74) (22.62)
Frequency (MHz) 150 200 200 200 150
Logic Elementsa 45.7K 138K 221K 235K 781K
b
DSPs 1044 1518 1518 1518 4096
(# of MAC Units) (2088) (3036) (3036) (3036) (4096)
Latency (ms) - 29.8 11.5 29.7 12.0 9.1 20.3
Throughput (GOPS)c 226.5 1030.2 672.0 761.6 2558.3 850.5 1114.3
a Xilinx FPGA in LUTs and Intel FPGA in ALMs.
b One DSP block in Intel FPGA can be configured as two independent 18-bit×19-bit multipliers.
c Throughput(GOPS) = Operations(GOP) / Latency(s)

TABLE IV with [7], the least RA achieves 67% performance with a 36%
R ESULTS ON VGG-16 BY HACO WITH L ARGEST AND L EAST RA usage of DSPs and, a 20% usage of FFs.
Although there are some designs of embedded BNNs which
Max RA Min RA could achieve even higher FPS, the accuracy loss is higher than
N P -layer P -layer N P -layer P -layer the designs in Table II. For example, [24] can only achieve
55.8% top-1 accuracy for VGG-16. Therefore, these designs
27241 105400 are not considered; only the designs with the same level of
LUTs 668079 144032
(86978) (342286) accuracy are compared in Table II.
18232 71468 CCPUs are also applied to ResNet. In the first layer of
FFs 225570 55637
(68176) (271143) ResNet, the size of the Conv kernels is 7 × 7, therefore, the
DSPs 4096 - 1024 - CCPUs will operate in a serial mode. Seven copies of the input
192 765 image are loaded into the FMRs, and each image is processed
BRAMs 1587 1280 by 1-D convolution. Then, the output channels is added to
(37) (125)
obtain an output feature map. For ResNet, an extra RAM is
URAMs 604 175 414 247 utilized to store the feature map of the shortcut connection.
Model Size 20.1MB/ 27.5× 16.0MB/ 37.5× The FMRs must store up to 224 × 224 × 32 data for input
feature maps; map buffer and shortcut buffer must store up to
FPS 83.0 30.3 112 × 112 × 64 data, separately. The required total memory
The results, of which caches are synthesized as distributed RAM, are is 67% of that used by VGG-16. As per the time evaluation
shown in parentheses. mentioned in Section V, computation in all the Conv layers
only requires 4.0 ms for ResNet-34, 9.1 ms for ResNet-50
TABLE V and 20.3ms for ResNet-152 when Nnp = 32 (largest RA).
F REQUENCIES OF N P - LAYERS AND P - LAYERS The proposed general CCPUs are compared to other state-of-
the-art FPGA designs in Table III.
N P Frequency (MHz) P Frequency (MHz)
fW fConv fW fA fSAC VIII. C ONCLUSION
Max RA 100 150 170 300 300 A hardware-oriented compression strategy has been ini-
tially proposed in this paper. This strategy achieves high
Min RA 100 200 150 250 250 performance and 27.5× compression ratio for VGG-16. It has
been shown that the proposed strategy incurs in a very small
accuracy loss compared to the single-precision floating-point
is 8 times larger than a BRAM, so the least RA saves 728 implementation as tested on the ILSVRC2012 data set through
BRAMs). Using HACO, redundancy in N P -layers is removed the Caffe framework.
to attain a higher efficiency; therefore, the number of DSPs is As a case study, the architecture of the proposed compressed
reduced to 1,024. Compared with [6], the least RA achieves VGG-16 has been designed with no off-chip memory on
90% performance with only a 46% usage of DSPs. Compared the Xilinx FPGA VCU118 platform. It has been shown that
IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS I: REGULAR PAPERS, VOL. XX, NO. X, 2020 13

the design achieved using the proposed tool HACO achieves [17] J. H. Ko, D. Kim, T. Na, and S. Mukhopadhyay, “Design and Analysis
the highest performance of 83.0 FPS for the same level of of a Neural Network Inference Engine Based on Adaptive Weight Com-
pression,” IEEE Transactions on Computer-Aided Design of Integrated
accuracy in image processing. The proposed general Conv- Circuits and Systems, vol. 38, no. 1, pp. 109–121, 2018.
PE structure has a high efficiency in processing large Conv [18] S. Li, J. Park, and P. T. P. Tang, “Enabling Sparse Winograd Convolution
kernels; therefore, the entire architecture can process a wide by Native Pruning,” arXiv preprint arXiv:1702.08597, 2017.
[19] W. Liu, F. Lombardi, and M. Shulte, “A Retrospective and Prospective
range of CNN models with small additional hardware. The View of Approximate Computing,” Proceedings of the IEEE, vol. 108,
proposed hardware design can be applied to several real-time no. 3, pp. 394–399, 2020.
resource constrained image processing applications. [20] W. Liu, L. Qian, C. Wang, H. Jiang, J. Han, and F. Lombardi, “Design of
Approximate Radix-4 Booth Multipliers for Error-Tolerant Computing,”
The proposed compression method is applicable to other IEEE Transactions on Computers, vol. 66, no. 8, pp. 1435–1441, 2017.
CNN models; new quantization and pruning methods can be [21] Z. Liu, K. Jia, W. Liu, Q. Wei, F. Qiao, and H. Yang, “INA: Incremental
Network Approximation Algorithm for Limited Precision Deep Neural
also used in the proposed HACO framework to obtain designs Networks,” in Proc. IEEE/ACM International Conference on Computer-
with even higher performance levels. Aided Design (ICCAD). IEEE, 2019, pp. 1–7.
[22] M. Courbariaux, I. Hubara, D. Soudry, R. El-Yaniv, and Y. Bengio,
“Binarized Neural Networks: Training Deep Neural Networks with
Weights and Activations Constrained to + 1 or -1,” arXiv preprint
R EFERENCES arXiv:1602.02830, 2016.
[23] M. Rastegari, V. Ordonez, J. Redmon, and A. Farhadi, “XNOR-Net:
[1] K. Simonyan and A. Zisserman, “Very Deep Convolutional Networks Imagenet Classification Using Binary Convolutional Neural Networks,”
for Large-Scale Image Recognition,” arXiv preprint arXiv:1409.1556, in European Conference on Computer Vision (ECCV). Springer, 2016,
2014. pp. 525–542.
[2] K. He, X. Zhang, S. Ren, and J. Sun, “Deep Residual Learning for [24] J. Wang, Q. Lou, X. Zhang, C. Zhu, Y. Lin, and D. Chen, “Design
Image Recognition,” in Proc. IEEE Conference on Computer Vision and Flow of Accelerating Hybrid Extremely Low Bit-Width Neural Network
Pattern Recognition, 2016, pp. 770–778. in Embedded FPGA,” in Proc. 28th International Conference on Field
[3] R. Girshick, “Fast R-CNN,” in Proc. IEEE International Conference on Programmable Logic and Applications (FPL). IEEE, 2018, pp. 163–
Computer Vision, 2015, pp. 1440–1448. 1636.
[4] S. Chetlur, C. Woolley, P. Vandermersch, J. Cohen, J. Tran, B. Catanzaro, [25] A. Zhou, A. Yao, Y. Guo, L. Xu, and Y. Chen, “Incremental Network
and E. Shelhamer, “cuDNN: Efficient Primitives for Deep Learning,” Quantization: Towards Lossless CNNs with Low-Precision Weights,”
arXiv preprint arXiv:1410.0759, 2014. arXiv preprint arXiv:1702.03044, 2017.
[5] Y.-H. Chen, T. Krishna, J. S. Emer, and V. Sze, “Eyeriss: An Energy- [26] M. A. Hanif, R. Hafiz, and M. Shafique, “Error Resilience Analysis
Efficient Reconfigurable Accelerator for Deep Convolutional Neural for Systematically Employing Approximate Computing in Convolutional
Networks,” IEEE Journal of Solid-State Circuits, vol. 52, no. 1, pp. Neural Networks,” in Proc. 2018 Design, Automation Test in Europe
127–138, 2016. Conference Exhibition (DATE), pp. 913–916.
[6] J. Wang, J. Lin, and Z. Wang, “Efficient Hardware Architectures for [27] Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick,
Deep Convolutional Neural Network,” IEEE Transactions on Circuits S. Guadarrama, and T. Darrell, “Caffe: Convolutional Architecture for
and Systems I: Regular Papers, vol. 65, no. 6, pp. 1941–1953, 2017. Fast Feature Embedding,” in Proc. 22nd ACM international conference
[7] S. Yin, S. Tang, X. Lin, P. Ouyang, F. Tu, L. Liu, and S. Wei, “A High on Multimedia, 2014, pp. 675–678.
Throughput Acceleration for Hybrid Neural Networks with Efficient [28] J. Deng, W. Dong, R. Socher, L. Li, Kai Li, and Li Fei-Fei, “ImageNet:
Resource Management on FPGA,” IEEE Transactions on Computer- A Large-Scale Hierarchical Image Database,” in Proc. 2009 IEEE
Aided Design of Integrated Circuits and Systems, vol. 38, no. 4, pp. Conference on Computer Vision and Pattern Recognition, pp. 248–255.
678–691, 2018. [29] J. Qiu, J. Wang, S. Yao, K. Guo, B. Li, E. Zhou, J. Yu, T. Tang,
N. Xu, S. Song et al., “Going Deeper with Embedded FPGA Platform
[8] D. Han, J. Lee, J. Lee, and H. Yoo, “A Low-Power Deep Neural Network
for Convolutional Neural Network,” in Proc. ACM/SIGDA International
Online Learning Processor for Real-Time Object Tracking Application,”
Symposium on Field-Programmable Gate Arrays. ACM, 2016, pp.
IEEE Transactions on Circuits and Systems I: Regular Papers, vol. 66,
26–35.
no. 5, pp. 1794–1804, 2019.
[30] Y. Guan, H. Liang, N. Xu, W. Wang, S. Shi, X. Chen, G. Sun, W. Zhang,
[9] Y. Wang, J. Lin, and Z. Wang, “FPAP: A Folded Architecture for Energy-
and J. Cong, “FP-DNN: An automated framework for mapping deep
Quality Scalable Convolutional Neural Networks,” IEEE Transactions
neural networks onto FPGAs with RTL-HLS hybrid templates,” in Proc.
on Circuits and Systems I: Regular Papers, vol. 66, no. 1, pp. 288–301,
IEEE 25th Annual International Symposium on Field-Programmable
2019.
Custom Computing Machines (FCCM). IEEE, 2017, pp. 152–159.
[10] Y. Lin and T. S. Chang, “Data and Hardware Efficient Design for [31] Y. Ma, Y. Cao, S. Vrudhula, and J.-S. Seo, “Optimizing the Convolution
Convolutional Neural Network,” IEEE Transactions on Circuits and Operation to Accelerate Deep Neural Networks on FPGA,” IEEE
Systems I: Regular Papers, vol. 65, no. 5, pp. 1642–1651, 2018. Transactions on Very Large Scale Integration (VLSI) Systems, vol. 26,
[11] B. Hassibi and D. G. Stork, “Second Order Derivatives for Network no. 7, pp. 1354–1367, 2018.
Pruning: Optimal Brain Surgeon,” in Advances in neural information
processing systems, 1993, pp. 164–171.
[12] S. Han, H. Mao, and W. J. Dally, “Deep Compression: Compressing
Deep Neural Networks with Pruning, Trained Quantization and Huffman
Coding,” arXiv preprint arXiv:1510.00149, 2015.
[13] Y. Guo, A. Yao, and Y. Chen, “Dynamic Network Surgery for Efficient
DNNs,” in Advances In Neural Information Processing Systems, 2016,
pp. 1379–1387. Tian Yuan received the B.S. degree in Information
[14] S. Ye, X. Feng, T. Zhang, X. Ma, S. Lin, Z. Li, K. Xu, W. Wen, S. Liu, Engineering from Nanjing University of Aeronautics
J. Tang et al., “Progressive DNN Compression: A Key to Achieve Ultra- and Astronautics (NUAA), Nanjing, China, in 2018,
High Weight Pruning and Quantization Rates using ADMM,” arXiv where he is currently working toward the M.S.
preprint arXiv:1903.09769, 2019. degree in circuits and systems. His research interests
[15] L. Lu and Y. Liang, “SpWA: An Efficient Sparse Winograd Con- include hardware architectures design for deep learn-
volutional Neural Networks Accelerator on FPGAs,” in Proc. 55th ing, neural network compression and quantization
ACM/ESDA/IEEE Design Automation Conference (DAC). IEEE, 2018, and approximate computing.
pp. 1–6.
[16] T. Chen, Z. Du, N. Sun, J. Wang, C. Wu, Y. Chen, and O. Temam,
“DianNao: A small-footprint high-throughput accelerator for ubiqui-
tous machine-learning,” ACM SIGARCH Computer Architecture News,
vol. 42, no. 1, pp. 269–284, 2014.
IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS I: REGULAR PAPERS, VOL. XX, NO. X, 2020 14

Weiqiang Liu is currently a Professor and the Fabrizio Lombardi (M’81-SM’02-F’09) received
Vice Dean of College of Electronic and Information the B.S. degree (Hons.) in electronic engineering
Engineering, Nanjing University of Aeronautics and from the University of Essex, U.K., in 1977, the
Astronautics (NUAA), Nanjing, China. He received master’s degree in microwaves and modern optics
the B.S. degree in Information Engineering from and the Diploma degree in microwave engineering
NUAA and the Ph.D. degree in Electronic Engineer- from the Microwave Research Unit, University Col-
ing from Queen’s University Belfast (QUB), Belfast, lege London, in 1978, and the Ph.D. degree from
United Kingdom, in 2006 and 2012, respectively. the University of London in 1982. He is currently
In Dec. 2013, he joined the College of Electronic the International Test Conference (ITC) Endowed
and Information Engineering, NUAA. He has served Chair Professorship with Northeastern University,
as a Guest Editor of Proceedings of the IEEE and Boston, USA. His research interests are bio-inspired
Associate Editors for IEEE Transactions on Circuits and Systems I: Regular and nano manufacturing/computing, VLSI design, testing, and fault/defect
Paper, IEEE Transactions on Computers, IEEE Transactions on Emerging tolerance of digital systems. He has extensively published in these areas
Topic in Computing and Computers and IEEE Open Journal of Computer and coauthored/edited seven books. He was the Editor-In-Chief of the IEEE
Society, and a Steering Committee Member of IEEE Transactions on Multi- Transaction on Computers from 2007 to 2010, the IEEE Transactions on
Scale Computing Systems. He is the Program Co-Chair of IEEE Symposium Emerging Topics in Computing from 2013 to 2017, the IEEE Transactionson
on Computer Arithmetic (ARITH), and program members for a number of Nanotechnology from 2014 to 2019. He is currently the Vice President
international conferences. He is a member of both Circuits & Systems for for Publications of both the IEEE Computer Society and a member of the
Communications (CASCOM) Technical Committee and VLSI Systems and executive committee of the IEEE Nanotechnology Council.
Applications (VSA) Technical Committee, IEEE Circuits and Systems Society.
His research interests include emerging technologies in computing systems,
computer arithmetic, hardware security and VLSI design for digital signal
processing and cryptography. He has published one research book by Artech
House and over 100 leading journal and conference papers. One of his papers
was selected as the Feature Paper of IEEE TC in the 2017 December issue.
He received the prestigious Outstanding Young Scholar Award by National
Natural Science Foundation China (NSFC) in 2020. He is a Senior Member
of the IEEE and the Chinese Institute of Electronics.

Jie Han received the B.S. degree in electronic engi-


neering from Tsinghua University, Beijing, China, in
1999 and the Ph.D. degree from the Delft University
of Technology, The Netherlands, in 2004. He is
currently a Professor in the Department of Electrical
and Computer Engineering at the University of Al-
berta, Edmonton, AB, Canada. His research interests
include approximate computing, stochastic comput-
ing, reliability and fault tolerance, nanoelectronic
circuits and systems, novel computational models
for nanoscale and biological applications. Dr. Han
was a recipient of the Best Paper Award at the International Symposium
on Nanoscale Architectures (NanoArch) 2015 and Best Paper Nominations
at the 25th Great Lakes Symposium on VLSI (GLSVLSI) 2015, NanoArch
2016 and the 19th International Symposium on Quality Electronic Design
(ISQED) 2018. He was nominated for the 2006 Christiaan Huygens Prize of
Science by the Royal Dutch Academy of Science. His work was recognized
by Science, for developing a theory of fault-tolerant nanocircuits (2005). He is
currently an Associate Editor for the IEEE Transactions on Emerging Topics
in Computing (TETC), the IEEE Transactions on Nanotechnology, the IEEE
Circuits and Systems Magazine, the IEEE Open Journal of the Computer
Society and Microelectronics Reliability (Elsevier Journal). He served as a
General Chair for GLSVLSI 2017 and the IEEE International Symposium
on Defect and Fault Tolerance in VLSI and Nanotechnology Systems (DFT)
2013, and a Technical Program Committee Chair for GLSVLSI 2016, DFT
2012 and the Symposium on Stochastic & Approximate Computing for Signal
Processing and Machine Learning, 2017.

You might also like