Electronics 13 01564 v2
Electronics 13 01564 v2
Article
High-Speed CNN Accelerator SoC Design Based on a Flexible
Diagonal Cyclic Array
Dong-Yeong Lee 1 , Hayotjon Aliev 1 , Muhammad Junaid 1 , Sang-Bo Park 1 , Hyung-Won Kim 1 ,
Keon-Myung Lee 2 and Sang-Hoon Sim 1, *
1 Department of Electronics Engineering, Chungbuk National University, Cheongju 28644, Republic of Korea;
dongyeonglee@[Link] (D.-Y.L.); hayotjon@[Link] (H.A.); junaid@[Link] (M.J.);
sangbopark@[Link] (S.-B.P.); hwkim@[Link] (H.-W.K.)
2 Department of Computer Science, Chungbuk National University, Cheongju 28644, Republic of Korea;
kmlee@[Link]
* Correspondence: shsim@[Link]
Abstract: The latest convolutional neural network (CNN) models for object detection include complex
layered connections to process inference data. Each layer utilizes different types of kernel modes, so
the hardware needs to support all kernel modes at an optimized speed. In this paper, we propose a
high-speed and optimized CNN accelerator with flexible diagonal cyclic arrays (FDCA) that supports
the acceleration of CNN networks with various kernel sizes and significantly reduces the time
required for inference processing. The accelerator uses four FDCAs to simultaneously calculate
16 input channels and 8 output channels. Each FDCA features a 4 × 8 systolic array that contains a
3 × 3 processing element (PE) array and is designed to handle the most commonly used kernel sizes.
To evaluate the proposed CNN accelerator, we mapped the widely used YOLOv5 CNN model and
evaluated the performance of its implementation on the Zynq UltraScale+ MPSoC ZCU102 FPGA.
The design consumes 249,357 logic cells, 2304 DSP blocks, and only 567 KB BRAM. In our evaluation,
the YOLOv5n model achieves an accuracy of 43.1% ([email protected]). A prototype accelerator has been
implemented using Samsung’s 14 nm CMOS technology. It achieves 1.075 TOPS, a peak performance
with a 400 MHz clock frequency.
Citation: Lee, D.-Y.; Aliev, H.; Junaid,
M.; Park, S.-B.; Kim, H.-W.; Lee, K.-M.;
Keywords: convolution neural network accelerator; flexible diagonal cyclic array; field-programmable
Sim, S.-H. High-Speed CNN gate arrays; YOLOv5n
Accelerator SoC Design Based on a
Flexible Diagonal Cyclic Array.
Electronics 2024, 13, 1564. https://
[Link]/10.3390/electronics13081564 1. Introduction
Academic Editors: Antonio Vincenzo Convolutional neural networks are widely used in a wide range of applications for
Radogna and Stefano D’Amico image recognition and object detection. Compared to other computer vision algorithms,
CNNs offer a significant improvement in accuracy for object recognition, target detection,
Received: 20 March 2024 and video tracking. As a result, CNN models have become popular and have played
Revised: 17 April 2024
a crucial role in the rapid advancement of computer vision applications. Meanwhile,
Accepted: 17 April 2024
CNN models are becoming more complex, using different kernel sizes and applying
Published: 19 April 2024
more depth and scale of CNN networks to achieve higher prediction accuracy. These
advancements and changes in CNN models have a significant impact on the performance
of the storage and processing capabilities of current hardware accelerators. Thus, new
Copyright: © 2024 by the authors.
acceleration architectures for object detection CNN models are necessary, applying efficient
Licensee MDPI, Basel, Switzerland. data streaming, storing, and processing methods [1].
This article is an open access article Most CNN models comprise a series of multiple convolutional layers, and each layer is
distributed under the terms and convolved with different sized kernels. For example, our target CNN model is the YOLOv5
conditions of the Creative Commons object detector, which is made up of 99% convolutional computations with various kernel
Attribution (CC BY) license (https:// sizes [2]. Accelerating YOLO-like CNN networks on hardware devices can significantly
[Link]/licenses/by/ improve their inference speed, enabling faster execution compared to traditional CPU or
4.0/). GPU implementations. Furthermore, it is essential to optimize the computations in the
• Input Zero Padding: The CNN accelerator supports adding zero-padding data to the
around the input feature map. When the CNN accelerator reads the input feature map
from DDR, it decides to add input zero padding at the corresponding position. This
function provides several advantages, such as reducing DDR access and effectively
utilizing on-chip memory, instead of writing padding data for the output feature map.
To implement input padding, we designed the input zero padding circuit, which
utilizes a 2-bit register to indicate the status required for each situation and a wire to
control the global input buffer read enable signal.
• Reconfigurable Input FIFO and FIFO Output Cache Memory: The reconfigurable
input FIFO consists of three SRAM FIFOs and one register connected in a predefined
sequence. When we define the kernel mode for a convolutional operation using a
specific stride and kernel size, the reconfigurable input FIFOs are interconnected
according to the kernel mode to efficiently reuse the data. The data that are read from
the first FIFO will flow to another FIFO and systolic array processing, respectively.
The FIFO output cache memory is a register that supplies a large amount of data to
the PE. It can transfer two different data to the PE depending on the kernel mode. By
using the address for this register, it enables the activation of the “read enable” for the
reconfigurable input FIFO connected to the register and generating a “write enable”
signal for the register, allowing data to be read sequentially whenever needed.
• Weight Parameter Quantization: Quantization is a method for reducing model size by
converting model weights, biases, and activations from high-precision floating-point
representation to low-precision floating-point (FP) or integer (INT) representations,
such as 16-bit or 8-bit. By converting the weights of a model from high-precision
floating-point representation to lower precision, the model size and inference speed
can significantly improve without sacrificing too much accuracy. Additionally, quanti-
zation improves the model performance by reducing memory bandwidth requirements
and increasing resource utilization [16].
In this work, we used 32-bit floating point parameters for training and then quantized
them to 8-bit integers to enable high-speed lightweight CNN inference. By applying a
low-bit quantization, we can utilize small-size on-chip memory, multipliers, and adders.
The rest of this paper is organized as follows: In Section 2, the background of the
overall architecture of the CNN Accelerator is described. A detailed explanation of the
proposed flexible PE array architecture is given in Section 3. In Section 4, we outlined the
advantages of using the proposed architecture. Section 5 includes the verification process
and the results obtained. Section 6 concludes the paper with future research plans.
Figure
Figure Proposed
2. 2. CNNCNN
Proposed accelerator architecture
accelerator with RISC-V
architecture processor.
with RISC-V processor.
In this study, we introduce a new systolic array (SA) structure called the flexible
diagonalIn cyclic
this study, we introduce
array (FDCA) a new the
that also supports systolic
stride 2array
kernel(SA)
mode. structure called the fle
The SA structure
isagonal cyclic
designed in thearray (FDCA)
form of an arraythat also
of 3 × supports
3 PEs, called a the
kernelstride 2 kernel
unit (KU), [Link]
to optimize The SA s
convolution using a 3 ×
is designed in the form of an array of 3 × 3 PEs, called a kernel unit (KU),a to opti
operation 3 kernel with stride 1. In general, each PE computes
partial sum of the convolution and sends it to other PEs to generate a single convolution
convolution operation using a 3 × 3 kernel with stride 1. In general, each PE com
result using an accumulator. If the CNN model includes a layer with N × N filters, the
partial sum
proposed of the
PE array can convolution and sends
be easily configured it totheother
to support PEskernel
required to generate
size and a single con
stride.
result
The FDCAusing an accumulator.
consists of 4 × 8 KUs, withIf the CNN
each model
KU with 9 PEs,includes a layer
and it can with N × N fil
simultaneously
process four input channels and eight output channels.
proposed PE array can be easily configured to support the required kernel size an
TheFour
3.1. FDCA
FDCAconsists of 4 Acceleration
for Convolution × 8 KUs, with each KU with 9 PEs, and it can simulta
process four input
To accelerate channels operation,
the convolution and eightweoutput
employchannels.
four FDCAs to calculate 16 input
data channels simultaneously. The relevant processing architecture is illustrated in Figure 3.
3.1. Four FDCA for Convolution Acceleration
To accelerate the convolution operation, we employ four FDCAs to calculate
data channels simultaneously. The relevant processing architecture is illustrated i
3. The convolutional result generated by a single filter using FDCA is stored in a
FOR PEER REVIEW 5 of 18
cludes an additional input buffer called “Instruction Memory”, which is used for writing
the instruction microcode from the RISC-V CPU core through the 32-bit AXI4-Lite bus
protocol.
Electronics 2024, 13, x FOR PEER REVIEW
Figure Kernel
Figure4. 4. unitunit
Kernel for convolution 3 × 3 stride
for convolution 3 × 31. stride 1.
The PE2, PE5, and PE8 perform convolution operations for one input channel by
The PE2,
accumulating PE5,
nine data,and PE8 perform
including convolutionbyoperations
six data accumulated for
the Pes of the one input
chann
previous two
cumulating nine data, including six data accumulated by the Pes of the previous t
tical lines. The last vertical line produces one convolution result in every clock cy
the first result is produced. Since only one final result is produced in KU for eac
the conv_select signal is reused to select the results sequentially. The color of each
arrow indicates the path of partial sums used to compute one convolution result. T
Electronics 2024, 13, 1564 7 of 18
Electronics 2024, 13, x FOR PEER REVIEW 7 of 18
vertical lines. The last vertical line produces one convolution result in every clock cycle
4.3. Convolution
after Using
the first result Kernel 3 ×Since
is produced. 3 with Stride
only one 2final
in Kernel
result Unit
is produced in KU for each
clock, the conv_select signal is reused to select the results sequentially. The color of each7PE
Electronics 2024, 13, x FOR PEER REVIEW The convolution operation with a 3 × 3 kernel and a stride of 2 uses the sameofPE 18 array
and arrow indicates the path of partial sums used to compute one convolution result. The
architecture, known as the KU, as that used for a stride of 1. However, in the stride 2 mode
convolution results are repeatedly calculated in the order of red, blue, and green.
data sharing between PEs differs; the data accumulated in each PE will be transferred
4.3. Convolution
horizontally
4.3. Convolutionto Using
the next
Using Kernel
KernelPE,3× ×not
33with Stride
diagonally.
with Stride 2 inin To
Kernel
Kernel Unit a convolution operation with a
perform
Unit
stride
The of n (n
Theconvolution > 1), six
convolution operation pixels
operation withof data must be
withaa33×× 33 kernel loaded
kerneland into
andaastride
stridethe
ofof2KU
2usesatthe
uses the
the same
same
same PEtime.
PE We used
array
array
a FIFO output
architecture,
architecture, known
known cache memory
as the KU,
KU, as as between
that
thatused
usedforthe
foraareconfigurable
stride
[Link],inputinFIFO
However, the
in the and
stride
stride2KU
mode,to prepare
2 mode,
data
data sharing
for between
further PEs
processing differs;
in the
PEs. data
Using accumulated
this cache
data sharing between PEs differs; the data accumulated in each PE will be transferred in each
memory, PE will
we be
can transferred
read two pixels o
horizontally
data from the
horizontally to the
input
to the next
next PE,
FIFO
PE, atnot
notthe diagonally.
same time.
diagonally. To perform a convolution operation
To perform a convolution operation with a with a
strideof
stride ofnn (n
(n >> 1),
1), six
six pixels
pixels of
of data
data must
must
As illustrated in Figure 5, the first three pixels be
be loaded
loaded into
intotheofKU
the KUatat
data thethe
will same
same
be time.
sent toWethe
time. Weusedusedin the
PEs
a FIFO output cache memory between the reconfigurable input FIFO and KU to prepare
a first
FIFOcolumn,
output cache as seen memory
in thebetween
stride 1 the reconfigurable
mode. The next three inputpixels
FIFO and KU to
of data prepare
from the cache
data for further processing in PEs. Using this cache memory, we can read two pixels of data
data for
register further processing in PEs. Using this cache memory, we can read two pixels of cal
from the will
inputbe simultaneously
FIFO at the same time. sent to the second-column PEs. After completing clock
data from
culations the input
in PEs,inwe FIFO at the
have5,to same time.
As illustrated Figure thepass
first the
threepixel
pixelsvalues
of datafromwill bethesentfirst-column
to the PEs in PEs to the last
the first
As
column as
column, illustrated
[Link] Figure
theorder,
stride 1we5, the first
maintain
mode. three
The next pixels
the three of
convolution data will
pixels of operation be sent
data from the to
with the PEs
stride
cache in the
2, efficiently
register
first column,
reusing
will as seen
the data fromsent
be simultaneously in the stride 1
thetofirst-column mode.
the second-column The
PEs. Pixelnext three
PEs. data pixels
After values
completing of data
fromclock from
the first the cache
column PEs o
calculations
register
in PEs, will
we be
have simultaneously
to pass the pixel sent to
values the second-column
from the first-column
the KU will be transferred to the diagonally downward PEs in the last column. PEs. After
PEs to completing
the last-column clock
PEs.
Thiscal-means
culations
In this in
order, PEs,
we we have
maintain to
the pass the
convolution pixel values
operation from
with the first-column
stride
that data reuse only occurs in the first and third columns of the KU, specifically in the 2, efficiently PEs to
reusingthe thelast-
column
data PEs.
from theInfirst-column
this order, we PEs. maintain
Pixel data thevalues
convolution
from operation with stride 2, KU
efficiently
stride 2 mode. In addition, weights are reused inthe firstclock
every column PEs
cycle of the
by will rotating
vertically
reusing the datatofrom
be transferred the first-column
the diagonally downward PEs. PEs
Pixelindata values
the last from This
column. the first
means column PEs of
that data
them from top to bottom.
reuse
the KUonlywill occurs in the first
be transferred to and third columns
the diagonally of the KU,PEs
downward specifically
in the last incolumn.
the strideThis2 mode.
means
Figureweights
In addition, 6b illustrates
are reused an example
in every of a convolution
clock cyclecolumns operation
by vertically rotating withthemstride
from2,top represent
that data reuse only occurs in the first and third of the KU, specifically in the
ing motion
to bottom. of the filter over the input feature data.
stride 2 mode. In addition, weights are reused in every clock cycle by vertically rotating
them from top to bottom.
Figure 6b illustrates an example of a convolution operation with stride 2, represent-
ing motion of the filter over the input feature data.
[Link]
Figure Kernelunit
unitforfor
convolution 3 × 33 ×
convolution stride 2. 2.
3 stride
(a) (b)
Figure 6. Example of filter shifting over the input feature map: (a) 3 × 3 stride 1; (b) 3 × 3 stride 2.
(a) (b)
Figure 6. Example of filter shifting over the input feature map: (a) 3Unit
4.4. Convolution Using Kernel 1 × 1 with Stride 1 in Kernel × 3 stride 1; (b) 3 × 3 stride 2.
Figure 6. Example of filter shifting over the input feature map: (a) 3 × 3 stride 1; (b) 3 × 3 stride 2.
Using the proposed 3 × 3 PEs array of KU, we can run the convolution operation
4.4. Convolution
using Using Kernel
a 1 × 1 kernel 1 × 1awith
size with Stride
stride 1 in
of 1. Kernel
For the 1Unit
× 1 convolution operation, we have
designed an accelerator
Using the proposed 3 circuit
× 3 PEsthat exclusively
array of KU, we utilizes
can runthethe
first horizontaloperation
convolution line of PEs in
the KU
using a 1 unit. In thissize
× 1 kernel mode,
withthe upperofleft
a stride PE receives
1. For the 1 × 1 only one datum
convolution in every
operation, weclock
havecycle
while the
designed anupper rightcircuit
accelerator PE generates the output
that exclusively of the
utilizes theoperation. Afterline
first horizontal theof
initial
PEs inoutpu
the KU unit.
result In this mode,
is calculated the
in the upper leftcircuit,
proposed PE receives only one
each clock datum
cycle in every
produces clock cycle,
a result of the con
while the upper
volution output,right PE generates
similar the output
to convolution with ofa 3the
× 3operation.
kernel [Link] the initial output
Electronics 2024, 13, 1564 8 of 18
[Link]
4.6.
4.6. ReconfigurableInput
Reconfigurable InputFIFO
Input FIFO
FIFO
Figure888shows
Figure
Figure showsthe
shows the
the block
block
block diagram
diagram
diagram of the
of the
of the reconfigurable
reconfigurable input
input
reconfigurable FIFO.
FIFO.
input For each
For each
FIFO. For each kernel
kernel
kernel
mode, we
mode, we utilize
we utilize a portion
utilize aa portion of
portion of the
of thereconfigurable
the reconfigurable input
reconfigurable input FIFO. Each
input FIFO. FIFO
FIFO. Each
Each FIFOstores one
FIFO stores column
stores one
one column
column
mode,
of
ofdata
datafor
forthe
theinput
inputslice. Since
slice. Sinceeach address
each address of the
of FIFO
the contains
FIFO fourfour
contains feature mapmap
feature data,data,
of data for the input slice. Since each address of the FIFO contains four feature map data,
the
thedepth
depthofofthe
theFIFO
FIFOis is
equal
equalto to
thethe
slice size/4.
slice size/4.
the depth of the FIFO is equal to the slice size/4.
Thefollowing
The
The followingconfigurations
following configurations
configurations of
of of the
thethe reconfigurable
reconfigurable input
input
reconfigurable FIFO
FIFO
input are used
are used
FIFO are used for different
different
for different
for
operation
operationmodes:
modes:
operation modes:
The convolution
The convolution operation
operation with
with aa 33 ×× 33 kernel
kernel and
and aa stride
stride of
of 11 requires
requires the
the use
use of
of two
two
SRAM-FIFOs and a register. While the KU reads data from the SRAM-FIFO
SRAM-FIFOs and a register. While the KU reads data from the SRAM-FIFO or regis- or regis-
ter, the
ter, the data
data not
not only
only go
go to
to the
the KU
KU but but also
also to
to another
another SRAM-FIFO
SRAM-FIFO containing
containing the
the
previous data from the left column in the feature
previous data from the left column in the feature map. map.
The convolution
The convolution operation
operation with
with aa 33 ×× 33 kernel
kernel and
and aa stride
stride ofof 22 uses
uses three
three SRAM-
SRAM-
Electronics 2024, 13, 1564 9 of 18
• The convolution operation with a 3 × 3 kernel and a stride of 1 requires the use of
two SRAM-FIFOs and a register. While the KU reads data from the SRAM-FIFO or
register, the data not only go to the KU but also to another SRAM-FIFO containing the
previous data from the left column in the feature map.
• The convolution operation with a 3 × 3 kernel and a stride of 2 uses three SRAM-
FIFOs. When the PEs array loads data from the FIFO address that stores the data
Electronics 2024, 13, x FOR PEER REVIEW for the last column, it also sends the data to another FIFO for reuse in the next first
9 of 18
column, bypassing the middle column. This indicates that the stride 2 mode requires
the KU to reuse only the data from the last column as the first column next time.
• The convolution operation with a 1 × 1 kernel and a stride of 1 only utilizes one
4.7. FIFO Output Cache
SRAM-FIFO. In thisMemory
mode, the output data only come from PE2.
We used FIFO output cache memories to share feature map data between the input
4.7. FIFO Output Cache Memory
FIFO and KU. Each address in the input FIFO contains four feature map data at a moment.
We used
Therefore, theFIFO
FIFO output
outputcache memories
cache memory to share
block feature
receives map
fourdata
databetween the input
from FIFO at once,
FIFO and KU. Each address in the input FIFO contains four feature
queues them in order, and transfers them to the KU, respectively. Each cache map data at a moment.
memory
Therefore, the FIFO output cache memory block receives four data from FIFO at once,
consists of eight 16-bit registers and enables the loading of one or two feature map data to
queues them in order, and transfers them to the KU, respectively. Each cache memory
the KU. We divided eight 16-bit registers into two parts, named Area0 and Area1, each
consists of eight 16-bit registers and enables the loading of one or two feature map data
consisting
to the KU. We of four 16-bit
divided registers.
eight 16-bit registers into two parts, named Area0 and Area1, each
In our design, we used
consisting of four 16-bit registers. a cache memory with depth = 8, which is twice the size of the
dataIn read
our from thewe
design, FIFO.
usedBy usingmemory
a cache cache memory
with depthwith= 8,eight
whichregisters,
is twice wethe will bethe
size of able to
read two data from the cache without waiting for the next four data from
data read from the FIFO. By using cache memory with eight registers, we will be able to case FIFO. In the
of using
read cache
two data memory
from with
the cache depth waiting
without = 4, it will notnext
for the be possible
four datatofrom
readFIFO.
and Insend
the two
case data
to using
of the KU at the
cache same with
memory [Link] = 4, it will not be possible to read and send two data to
the KU Forat example,
the same time.
if we use cache memory with depth = 4, after reading three data from
the cache memory,we
For example, if weusewould
cache memory withto
not be able depth
read= another
4, after reading
two datathreefrom
data it.
fromWetheface a
cache memory, we would not be able to read another two data from it.
memory limitation problem in our circuit. If we read only one datum, which is left in theWe face a memory
limitation
cache memory, problemthen inwe
ourmust
circuit.
waitIffor
wetheread only
next one
four datum,
data readingswhich
fromis left in the
FIFO. Thiscache
problem
memory, then we must wait for the next four data
causes circuit insufficiency and produces incorrect output. readings from FIFO. This problem causes
circuit insufficiency and produces incorrect output.
Figure 9 shows that the “finish” signal becomes active when reading data from a
Figure 9 shows that the “finish” signal becomes active when reading data from a
specific location in Area0 and Area1. For example, if the address is 2, then finish[0] goes
specific location in Area0 and Area1. For example, if the address is 2, then finish[0] goes
high when reading two data in Area0. The “finish” signal acts as the read enable signal
high when reading two data in Area0. The “finish” signal acts as the read enable signal for
for the input
the input FIFO andFIFO and generates
generates the write
the write enable signalenable
for thesignal for thecache
FIFO output FIFO outputviacache
memory
amemory
register. via
Whena register. When data
data are initially areininitially
stored the FIFOstored
outputincache
the FIFO output
memory, cachesignal
the finish memory,
the finish signal cannot be activated until the data are read from
cannot be activated until the data are read from the FIFO output cache memory. Therefore,the FIFO output cache
memory. Therefore, the cache memory must read data from the FIFO
the cache memory must read data from the FIFO using the init_rd_en signal generated by using the init_rd_en
signal
the generated by the controller.
controller.
[Link]
Figure FIFOoutput
outputcache
cache memory.
memory.
0 0 0 0ㆍ ㆍㆍ ㆍ ㆍㆍ0 0 0 0
0 0
0 2 2 2 0
0 0
2 4 2 4 2 4 2
2 2 2
2 4 2 4 2 4 2
2 2 2
2 4 2 4 2 4 2
0 0
2 2 2
0 0
0 0
0 0 0 0ㆍ ㆍㆍ ㆍ ㆍㆍ 0 0 0 0
4.9.
[Link]
InputPadding
Padding
To add padding
To paddingtotothetheinput feature
input map,map,
feature mostmost
CNN CNN
accelerators use a software-based
accelerators use a software-
approach
based beforebefore
approach loadingloading
the datathe
to data
the DDR DRAM.
to the In this work,
DDR DRAM. wework,
In this designed the circuit the
we designed
to addto
circuit zero
addpadding aroundaround
zero padding the input thefeature
input map, a concept
feature map, a known
conceptasknown
input padding
as input inpad-
the circuit. Using input padding is more effective than using output padding.
ding in the circuit. Using input padding is more effective than using output padding. Fig- Figure 11
presents
ure the ratiothe
11 presents of ratio
input of
padding storage pixels
input padding to output
storage pixelspadding based
to output on the based
padding size of on
the the
input feature map. The figure shows that as the feature map size decreases, the proportion
size of the input feature map. The figure shows that as the feature map size decreases, the
of storage pixels also decreases by up to 82.6%.
proportion of storage pixels also decreases by up to 82.6%.
based approach before loading the data to the DDR DRAM. In this work, we designed the
circuit to add zero padding around the input feature map, a concept known as input pad-
ding in the circuit. Using input padding is more effective than using output padding. Fig-
ure 11 presents the ratio of input padding storage pixels to output padding based on the
Electronics 2024, 13, 1564 11 of 18
size of the input feature map. The figure shows that as the feature map size decreases, the
proportion of storage pixels also decreases by up to 82.6%.
The input padding circuit includes the register and control signal. The first component
The input padding circuit includes the register and control signal. The first compo-
is the 2-bit padding state register, known as the “FIFO write selection”, which varies
nent is the 2-bit padding state register, known as the “FIFO write selection”, which varies
dependingononthe
depending thecurrent/total
current/total
sliceslice iteration
iteration and kernel
and kernel mode (Figure
mode (Figure 12).
12). Table Table 1 and
1 and
Figure13,
Figure 13,along
along with
with thethe description
description below,
below, explain
explain the adding-zero-padding
the adding-zero-padding method for
method for
eachcase.
each case.
The input padding circuit includes the register and control signal. The first compo-
nent is the 2-bit padding state register, known as the “FIFO write selection”, which varies
depending on the current/total slice iteration and kernel mode (Figure 12). Table 1 and
Figure 13, along with the description below, explain the adding-zero-padding method for
each case.
Figure
Figure12.
[Link]
Paddingcircuit structure.
circuit structure.
Figure 13. The use cases of a FIFO write select signal (number) depending on the padding position
(red frame).
1. ZEROZERO: When all data entering the reconfigurable FIFO is zero, only zeros are
needed for padding.
2. READZERO: All data from the global input buffer are loaded into the FIFO when the
slice iteration does not require any padding.
3. ZEROREAD: All data in the global input buffer are loaded, and a single zero is
inserted in front of the data as padding. It is used to load a part of the zero padding at
the top slice of the input feature map.
4.
Electronics 2024, 13, x FOR PEER REVIEW SHIFTREAD: When importing new data, the last pixel of the previously imported 12 of 18
data is concatenated with the newly read data. The function exists to use the most
recently imported data in the ZEROREAD scenario.
The 1-bit wire, InB_flag, generates a read enable signal for the global input buffer buffer by
entering a two-input AND gate with a read enable signal asserted by the controller in the
KU. If the
the “FIFO
“FIFOwrite
writeselection”
selection”isisZEROZERO,
ZEROZERO, then
then thethe wire
wire hashas a value
a value of 0,ofand
0, and the
the KU
KU receives
receives onlyonly a “zero”
a “zero” value
value for for
the the padding.
padding. Therefore,
Therefore, KUKU receives
receives “zero”
“zero” datadata with-
without
accessing
out the global
accessing input
the global buffer.
input buffer.
The kernel mode also affects
affects padding. The 33 ×
padding. The × 33 convolution
convolution operation
operation with a stride
of 1 requires adding zeros around all sides of the input feature map. For For aa stride
stride ofof 2,
2,
additional zeros are only required
required for the upper and left sides of the input. Padding is not
applied in a 1 ××11convolution
convolutionoperation.
operation.
4.10. Bias–Activation–Scaling
4.10. Bias–Activation–Scaling Pipeline
Pipeline Architecture
Architecture
In this
In this study,
study, we targeted the
we targeted YOLOv5n model
the YOLOv5n model and quantized the
and quantized model to
the model to an
an integer
integer
representation. Thus, we designed an extra circuit to calculate bias, activation, and
representation. Thus, we designed an extra circuit to calculate bias, activation, and scalingscaling
parameters
parameters for for converting
converting the
the final
final convolution
convolution result
result into
into the
the feature
feature map
map forfor the
the next
next
layer. Figure 14 illustrates the bias–activation–scaling (BAS) pipeline architecture
layer. Figure 14 illustrates the bias–activation–scaling (BAS) pipeline architecture used in used
in the proposed SoC hardware. All parameters, bias, activation, and scaling parameters
the proposed SoC hardware. All parameters, bias, activation, and scaling parameters are
are represented as 16-bit integers, allowing for a fast and low-cost area architecture. The
represented as 16-bit integers, allowing for a fast and low-cost area architecture. The pipe-
pipeline process consists of six stages and requires seven cycles to process one piece of data.
line process consists of six stages and requires seven cycles to process one piece of data.
during the training process, and they remain constant within each layer during the
inference [14,28,29,31].
The convolution operation with bias can be represented by the following equation:
where w1 represents weight, X FMAP is input feature map, and β bias means the constant
number called “bias”. We simplify the addition in Equation (2) as Equation (3). This
simplification reduces hardware costs without sacrificing accuracy. In addition, we used
techniques like rounding and truncation in the BAS circuit.
The 36-bit dividers are used in the Leaky ReLU activation circuit, and they require
two clock cycles to prevent setup-time violations at a high frequency of 400 MHz. The
scaling process consists of four stages and operates for four cycles. The process involves
multiplication, division, rounding, addition, truncation, and subtraction, in that order. The
scaling applies the parameters generated by quantization.
architecture provided more data in the same time frame for the speed optimization of
convolution operation with 3 × 3 stride 2.
Although the proposed architecture supports two additional kernel modes, the total
chip area increases by only 6% compared to using the predicted 3 × 3 stride 1 mode
separately. If we expand the proposed CNN accelerator architecture to support 6 × 6
convolution operations with stride 1 and stride 2 modes in the future, the expected chip
area will increase by 6% compared to the current area.
these three reconfigurable input FIFOs, the KU can reuse feature map data, reducing
the number of data loads by up to one-third.
• The KU reuses data by sharing them among connected PEs. Typically, the GPU reads
each pixel of input data from DDR memory three times. In our design, the KU reuses
the same pixel data three times by passing it to other PEs. This mechanism reduces
the number of memory accesses by three times.
× 2780
(3943 µmµm). Theµm).
× 2780 chipThe
operates at a frequency
chip operates of 400
at a frequency of MHz,
400 MHz,with a timing
with a timingconstraint
constraintset at
set
2.5atns.
2.5The
ns. total
The total
powerpower consumption
consumption of the
of the chip
chip is is 18.52mW.
18.52 [Link]
implementation uses
uses on-chip
on-chip SRAMSRAM withwith a sizeofof275.75
a size 275.75 KB.
KB. Figure
Figure1515shows
shows thethe
overall chipchip
overall layout of the
layout of the
proposed CNN accelerator
proposed CNN accelerator SoC. SoC.
Figure15.
Figure [Link]
Fullchip
chiplayout
layout implemented
implemented in nm
in 14 14 nm CMOS
CMOS process.
process.
[Link]
Conclusions
In
Inthis
thispaper,
paper,we weproposed
proposed a high-speed
a high-speedCNNCNN accelerator architecture
accelerator based
architecture on a on a
based
flexible diagonal cyclic array (FDCA). The proposed four-FDCA architecture
flexible diagonal cyclic array (FDCA). The proposed four-FDCA architecture comprises comprises
1152
1152PEsPEsthat
thatcan
canprocess
process thethe
data
data forfor
sixteen input
sixteen channels
input andand
channels eighteight
output channels
output channels
simultaneously. The proposed architecture enables the execution of convolution operations
simultaneously. The proposed architecture enables the execution of convolution opera-
with different kernel modes and strides to accelerate the latest CNN models. In the
tions with different kernel modes and strides to accelerate the latest CNN models. In the
proposed design, we introduced new optimization techniques that improved chip area
proposed design, we introduced new optimization techniques that improved chip area
efficiency by 6% and reduced total chip area utilization by 2.14 times compared to individual
efficiency
block designs by for
6%each
andkernel
reduced total
mode. Wechip
alsoarea utilization
minimized by 2.14of
the number times
DRAMcompared
accesses to
byindi-
vidual block designs
using data reuse methods. for each kernel mode. We also minimized the number of DRAM ac-
cesses
ThebyCNN
using data reuse
accelerator wasmethods.
synthesized and verified on the Xilinx ZCU102 FPGA and
The CNN accelerator
implemented in SoC silicon using 14was synthesized and verified
nm CMOS process on theThe
technology. Xilinx ZCU102
results FPGA and
demonstrate
implemented in SoC silicon using 14 nm CMOS process technology.
that the proposed CNN accelerator can perform convolution operations 3.8 times faster, The results demon-
stratethe
using that the proposed
proposed new PECNN arrayaccelerator can perform
structure, compared convolution
to previous operations 3.8 times
CNN accelerators.
faster, using the proposed new PE array structure, compared to previous CNN accelera-
tors. Contributions: Conceptualization, D.-Y.L., H.A. and H.-W.K.; Designing, D.-Y.L. and H.A.;
Author
verification, M.J.; validation, S.-B.P. and M.J.; formal analysis H.A. and D.-Y.L.; writing—original
draft preparation D.-Y.L.; writing—review and editing, H.A. and S.-B.P.; funding, S.-H.S. and K.-M.L.
Author Contributions: Conceptualization, D.-Y.L., H.A. and H.-W.K.; Designing, D.-Y.L. and H.A.;
All authors have read and agreed to the published version of the manuscript.
verification, M.J.; validation, S.-B.P. and M.J.; formal analysis H.A. and D.-Y.L.; writing—original
draft preparation D.-Y.L.; writing—review and editing, H.A. and S.-B.P.; funding, S.-H.S. and K.-
M.L. All authors have read and agreed to the published version of the manuscript.
Funding: This work was supported by the National Research Foundation of Korea (NRF) grant for
RLRC funded by the Korea government (MSIT) (No. 2022R1A5A8026986, RLRC, 25%), and was also
supported by the Institute of Information and communications Technology Planning and Evaluation
(IITP) grant funded by the Korea government (MSIT) (No. 2020-0-01304, Development of Self-Learn-
Electronics 2024, 13, 1564 17 of 18
Funding: This work was supported by the National Research Foundation of Korea (NRF) grant for
RLRC funded by the Korea government (MSIT) (No. 2022R1A5A8026986, RLRC, 25%), and was also
supported by the Institute of Information and communications Technology Planning and Evaluation
(IITP) grant funded by the Korea government (MSIT) (No. 2020-0-01304, Development of Self-
Learnable Mobile Recursive Neural Network Processor Technology, 25%). It was partly supported by
Innovative Human Resource Development for Local Intellectualization program through the Institute
of Information & Communications Technology Planning & Evaluation (IITP) grant funded by the
Korea government (MSIT) (IITP-2024-2020-0-01462, 25%). The National R&D Program supported
this research through the National Research Foundation of Korea (NRF) funded by the Ministry of
Science and ICT (No. 2020M3H2A1076786, System Semiconductor specialist nurturing, 25%).
Data Availability Statement: Data are contained within the article.
Acknowledgments: We thank Thaising Thaing (thaisingtaing@[Link]) for his invaluable
contributions to this work.
Conflicts of Interest: The authors declare no conflicts of interest.
References
1. Akkad, G.; Mansour, A.; Inaty, E. Embedded Deep Learning Accelerators: A Survey on Recent Advances. IEEE Trans. Artif. Intell.
2023, early access.
2. Jocher, G.; Stoken, A.; Chaurasia, A.; Borovec, J.; Xie, T.; Kwon, Y.; Michael, K.; Changyu, L.; Fang, J. Yolov5. NanoCode012.
v6.0—Models. 2021. Available online: [Link] (accessed on 12 October 2021).
3. Huang, W.; Wu, H.; Chen, Q.; Luo, C.; Zeng, S.; Li, T.; Huang, Y. FPGA-Based High-Throughput CNN Hardware Accelerator
with High Computing Resource Utilization Ratio. IEEE Trans. Neural Netw. Learn. Syst. 2022, 33, 4069–4083. [CrossRef] [PubMed]
4. Yang, J.; Fu, W.; Cheng, X.; Ye, X.; Dai, P.; Zhao, W. S2 Engine: A Novel Systolic Architecture for Sparse Convolutional Neural
Networks. IEEE Trans. Comput. 2022, 71, 1440–1452.
5. Wei, X.; Yu, C.H.; Zhang, P.; Chen, Y.; Wang, Y.; Hu, H.; Liang, Y.; Cong, J. Automated systolic array architecture synthesis for
high throughput CNN inference on FPGAs. In Proceedings of the 2017 54th ACM/EDAC/IEEE Design Automation Conference
(DAC), Austin, TX, USA, 18–22 June 2017; pp. 1–6.
6. Andri, R.; Cavigelli, L.; Rossi, D.; Benini, L. Hyperdrive: A Multi-Chip Systolically Scalable Binary-Weight CNN Inference Engine.
IEEE J. Emerg. Sel. Top. Circuits Syst. 2019, 9, 309–322. [CrossRef]
7. Sedukhin, S.; Tomioka, Y.; Yamamoto, K. In search of the performance-and energy-efficient CNN accelerators. IEICE Trans.
Electron. 2022, 105, 209–221. [CrossRef]
8. Liu, C.-N.; Lai, Y.-A.; Kuo, C.-H.; Zhan, S.-A. Design of 2D Systolic Array Accelerator for Quantized Convolutional Neural
Networks. In Proceedings of the 2021 International Symposium on VLSI Design, Automation and Test (VLSI-DAT), Hsinchu,
Taiwan, 19–22 April 2021; pp. 1–4.
9. Jouppi, N.P.; Young, C.; Patil, N.; Patterson, D.; Agrawal, G.; Bajwa, R.; Bates, S.; Bhatia, S.; Boden, N.; Borchers, A.; et al.
In-datacenter performance analysis of a tensor processing unit. In Proceedings of the 2017 ACM/IEEE 44th Annual International
Symposium on Computer Architecture (ISCA), Toronto, ON, Canada, 24–28 June 2017; pp. 1–12.
10. Wang, Y.; Wang, Y.; Shi, C.; Cheng, L.; Li, H.; Li, X. An Edge 3D CNN Accelerator for Low-Power Activity Recognition. IEEE
Trans. Comput. Aided Des. Integr. Circuits Syst. 2021, 40, 918–930. [CrossRef]
11. Parmar, Y.; Sridharan, K. A Resource-Efficient Multiplierless Systolic Array Architecture for Convolutions in Deep Networks.
IEEE Trans. Circuits Syst. II Express Briefs 2020, 67, 370–374. [CrossRef]
12. Chen, Y.-H.; Krishna, T.; Emer, J.S.; Sze, V. Eyeriss: An Energy-Efficient Reconfigurable Accelerator for Deep Convolutional
Neural Networks. IEEE J. Solid-State Circuits 2017, 52, 127–138. [CrossRef]
13. Lu, Y.C.; Chen, C.W.; Pu, C.C.; Lin, Y.T.; Jhan, J.K.; Liang, S.P. Live Demo: An 176.3 GOPs Object Detection CNN Accelerator
Emulated in a 28 nm CMOS Technology. In Proceedings of the 2021 IEEE 3rd International Conference on Artificial Intelligence
Circuits and Systems (AICAS), Washington, DC, USA, 6–9 June 2021; pp. 1–4.
14. Nguyen, D.T.; Nguyen, T.N.; Kim, H.; Lee, H.J. A High-Throughput and Power-Efficient FPGA Implementation of YOLO CNN
for Object Detection. IEEE Trans. Very Large Scale Integr. (VLSI) Syst. 2019, 27, 1861–1873. [CrossRef]
15. Yepez, J.; Ko, S.-B. Stride 2 1-D, 2-D, and 3-D Winograd for Convolutional Neural Networks. IEEE Trans. Very Large Scale Integr.
(VLSI) Syst. 2020, 28, 853–863. [CrossRef]
16. Li, Y.; Lu, S.; Luo, J.; Pang, W.; Liu, H. High-performance Convolutional Neural Network Accelerator Based on Systolic Arrays
and Quantization. In Proceedings of the 2019 IEEE 4th International Conference on Signal and Image Processing (ICSIP), Wuxi,
China, 19–21 July 2019; pp. 335–339.
17. Yang, G.; Lei, J.; Xie, W.; Fang, Z.; Li, Y.; Wang, J.; Zhang, X. Algorithm/Hardware Codesign for Real-Time On-Satellite CNN-Based
Ship Detection in SAR Imagery. IEEE Trans. Geosci. Remote Sens. 2022, 60, 5226018. [CrossRef]
18. Ansari, A.; Ogunfunmi, T. Hardware Acceleration of a Generalized Fast2-D Convolution Method for Deep Neural Networks.
IEEE Access 2022, 10, 16843–16858. [CrossRef]
Electronics 2024, 13, 1564 18 of 18
19. Yan, T.; Zhang, N.; Li, J.; Liu, W.; Chen, H. Automatic Deployment of Convolutional Neural Networks on FPGA for Spaceborne
Remote Sensing Application. Remote Sens. 2022, 14, 3130. [CrossRef]
20. Ardakani, A.; Condo, C.; Ahmadi, M.; Gross, W.J. An Architecture to Accelerate Convolution in Deep Neural Networks. IEEE
Trans. Circuits Syst. I Regul. Pap. 2018, 65, 1349–1362. [CrossRef]
21. Wang, J.; Yuan, Z.; Liu, R.; Feng, X.; Du, L.; Yang, H.; Liu, Y. GAAS: An Efficient Group Associated Architecture and Scheduler
Module for Sparse CNN Accelerators. IEEE Trans. Comput. Aided Des. Integr. Circuits Syst. 2020, 39, 5170–5182. [CrossRef]
22. Wang, J.; Park, S.; Park, C.S. Spatial Data Dependence Graph Based Pre-RTL Simulator for Convolutional Neural Network
Dataflows. IEEE Access 2022, 10, 11382–11403. [CrossRef]
23. Li, J.; Un, K.-F.; Yu, W.-H.; Mak, P.-I.; Martins, R.P. An FPGA-Based Energy-Efficient Reconfigurable Convolutional Neural
Network Accelerator for Object Recognition Applications. IEEE Trans. Circuits Syst. II Express Briefs 2021, 68, 3143–3147. [CrossRef]
24. Qiu, J.; Wang, J.; Yao, S.; Guo, K.; Li, B.; Zhou, E. Going deeper with embedded fpga platform for convolutional neural network.
In Proceedings of the 2016 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays, Monterey, CA, USA,
21–23 February 2016.
25. Huan, Y.; Xu, J.; Zheng, L.; Tenhunen, H.; Zou, Z. A 3D Tiled Low Power Accelerator for Convolutional Neural Network. In
Proceedings of the 2018 IEEE International Symposium on Circuits and Systems (ISCAS), Florence, Italy, 27–30 May 2018; pp. 1–5.
26. Tu, F.; Yin, S.; Ouyang, P.; Tang, S.; Liu, L.; Wei, S. Deep Convolutional Neural Network Architecture with Reconfigurable
Computation Patterns. IEEE Trans. Very Large Scale Integr. (VLSI) Syst. 2017, 25, 2220–2233. [CrossRef]
27. Redmon, J.; Divvala, S.; Girshick, R.; Farhadi, A. You only look once: Unified, real-time object detection. In Proceedings of the
2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 779–788.
28. Son, H.; Na, Y.; Kim, T.; Al-Hamid, A.A.; Kim, H. CNN Accelerator with Minimal On-Chip Memory Based on Hierarchical Array.
In Proceedings of the 2021 18th International SoC Design Conference (ISOCC), Jeju Island, Republic of Korea, 6–9 October 2021;
pp. 411–412.
29. Zhang, S.; Cao, J.; Zhang, Q.; Zhang, Q.; Zhang, Y.; Wang, Y. An FPGA-Based Reconfigurable CNN Accelerator for YOLO. In
Proceedings of the 2020 IEEE 3rd International Conference on Electronics Technology (ICET), Chengdu, China, 8–12 May 2020;
pp. 74–78.
30. Adiono, T.; Putra, A.; Sutisna, N.; Syafalni, I.; Mulyawan, R. Low Latency YOLOv3-Tiny Accelerator for Low-Cost FPGA Using
General Matrix Multiplication Principle. IEEE Access 2021, 9, 141890–141913. [CrossRef]
31. Li, P.; Che, C. Mapping YOLOv4-Tiny on FPGA-Based DNN Accelerator by Using Dynamic Fixed-Point Method. In Proceedings
of the 2021 12th International Symposium on Parallel Architectures, Algorithms and Programming (PAAP), Xi’an, China, 10–12
December 2021; pp. 125–129.
32. Babu, P.; Parthasarathy, E. Hardware acceleration for object detection using YOLOv4 algorithm on Xilinx Zynq platform.
J. Real-Time Image Process. 2022, 19, 931–940. [CrossRef]
33. Ma, Y.; Cao, Y.; Vrudhula, S.; Seo, J.-S. Optimizing the Convolution Operation to Accelerate Deep Neural Networks on FPGA.
IEEE Trans. Very Large Scale Integr. (VLSI) Syst. 2018, 26, 1354–1367. [CrossRef]
34. Zhang, C.; Sun, G.; Fang, Z.; Zhou, P.; Pan, P.; Cong, J. Caffeine: Toward Uniformed Representation and Acceleration for Deep
Convolutional Neural Networks. IEEE Trans. Comput. Aided Des. Integr. Circuits Syst. 2019, 38, 2072–2085. [CrossRef]
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual
author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to
people or property resulting from any ideas, methods, instructions or products referred to in the content.