0% found this document useful (0 votes)
55 views18 pages

Electronics 13 01564 v2

This paper presents a high-speed CNN accelerator SoC design based on a Flexible Diagonal Cyclic Array (FDCA) to optimize the acceleration of convolutional neural networks (CNNs) for various kernel sizes. The proposed architecture demonstrates significant improvements in inference processing speed and resource utilization, achieving 1.075 TOPS performance on a YOLOv5 CNN model implemented on an FPGA. Key innovations include input zero padding, reconfigurable FIFO memory, and weight parameter quantization to enhance efficiency and reduce memory bandwidth requirements.

Uploaded by

dmb06283
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
55 views18 pages

Electronics 13 01564 v2

This paper presents a high-speed CNN accelerator SoC design based on a Flexible Diagonal Cyclic Array (FDCA) to optimize the acceleration of convolutional neural networks (CNNs) for various kernel sizes. The proposed architecture demonstrates significant improvements in inference processing speed and resource utilization, achieving 1.075 TOPS performance on a YOLOv5 CNN model implemented on an FPGA. Key innovations include input zero padding, reconfigurable FIFO memory, and weight parameter quantization to enhance efficiency and reduce memory bandwidth requirements.

Uploaded by

dmb06283
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

electronics

Article
High-Speed CNN Accelerator SoC Design Based on a Flexible
Diagonal Cyclic Array
Dong-Yeong Lee 1 , Hayotjon Aliev 1 , Muhammad Junaid 1 , Sang-Bo Park 1 , Hyung-Won Kim 1 ,
Keon-Myung Lee 2 and Sang-Hoon Sim 1, *

1 Department of Electronics Engineering, Chungbuk National University, Cheongju 28644, Republic of Korea;
dongyeonglee@[Link] (D.-Y.L.); hayotjon@[Link] (H.A.); junaid@[Link] (M.J.);
sangbopark@[Link] (S.-B.P.); hwkim@[Link] (H.-W.K.)
2 Department of Computer Science, Chungbuk National University, Cheongju 28644, Republic of Korea;
kmlee@[Link]
* Correspondence: shsim@[Link]

Abstract: The latest convolutional neural network (CNN) models for object detection include complex
layered connections to process inference data. Each layer utilizes different types of kernel modes, so
the hardware needs to support all kernel modes at an optimized speed. In this paper, we propose a
high-speed and optimized CNN accelerator with flexible diagonal cyclic arrays (FDCA) that supports
the acceleration of CNN networks with various kernel sizes and significantly reduces the time
required for inference processing. The accelerator uses four FDCAs to simultaneously calculate
16 input channels and 8 output channels. Each FDCA features a 4 × 8 systolic array that contains a
3 × 3 processing element (PE) array and is designed to handle the most commonly used kernel sizes.
To evaluate the proposed CNN accelerator, we mapped the widely used YOLOv5 CNN model and
evaluated the performance of its implementation on the Zynq UltraScale+ MPSoC ZCU102 FPGA.
The design consumes 249,357 logic cells, 2304 DSP blocks, and only 567 KB BRAM. In our evaluation,
the YOLOv5n model achieves an accuracy of 43.1% ([email protected]). A prototype accelerator has been
implemented using Samsung’s 14 nm CMOS technology. It achieves 1.075 TOPS, a peak performance
with a 400 MHz clock frequency.
Citation: Lee, D.-Y.; Aliev, H.; Junaid,
M.; Park, S.-B.; Kim, H.-W.; Lee, K.-M.;
Keywords: convolution neural network accelerator; flexible diagonal cyclic array; field-programmable
Sim, S.-H. High-Speed CNN gate arrays; YOLOv5n
Accelerator SoC Design Based on a
Flexible Diagonal Cyclic Array.
Electronics 2024, 13, 1564. https://
[Link]/10.3390/electronics13081564 1. Introduction
Academic Editors: Antonio Vincenzo Convolutional neural networks are widely used in a wide range of applications for
Radogna and Stefano D’Amico image recognition and object detection. Compared to other computer vision algorithms,
CNNs offer a significant improvement in accuracy for object recognition, target detection,
Received: 20 March 2024 and video tracking. As a result, CNN models have become popular and have played
Revised: 17 April 2024
a crucial role in the rapid advancement of computer vision applications. Meanwhile,
Accepted: 17 April 2024
CNN models are becoming more complex, using different kernel sizes and applying
Published: 19 April 2024
more depth and scale of CNN networks to achieve higher prediction accuracy. These
advancements and changes in CNN models have a significant impact on the performance
of the storage and processing capabilities of current hardware accelerators. Thus, new
Copyright: © 2024 by the authors.
acceleration architectures for object detection CNN models are necessary, applying efficient
Licensee MDPI, Basel, Switzerland. data streaming, storing, and processing methods [1].
This article is an open access article Most CNN models comprise a series of multiple convolutional layers, and each layer is
distributed under the terms and convolved with different sized kernels. For example, our target CNN model is the YOLOv5
conditions of the Creative Commons object detector, which is made up of 99% convolutional computations with various kernel
Attribution (CC BY) license (https:// sizes [2]. Accelerating YOLO-like CNN networks on hardware devices can significantly
[Link]/licenses/by/ improve their inference speed, enabling faster execution compared to traditional CPU or
4.0/). GPU implementations. Furthermore, it is essential to optimize the computations in the

Electronics 2024, 13, 1564. [Link] [Link]


traditional CPU or GPU implementations. Furthermore, it is essential to
putations in the convolution operation to support various kernel sizes
Electronics 2024, 13, 1564 new high-speed CNN accelerator. These kernel-based optimizations 2 of 18 of
ation blocks help to map any CNN models to the hardware accelerator
Most operation
convolution of the CNN accelerator
to support architectures
various kernel use parallel
sizes when designing processing e
a new high-speed
which consist These
CNN accelerator. of akernel-based
multiplier and accumulator
optimizations of convolution (MAC),
operation as shown
blocks help in F
to map any CNN models to the hardware accelerator.
arrayMost is aof the
special PE array
CNN accelerator structure
architectures use built
parallelfor the fast
processing and
element efficient
(PE) units, op
array
which algorithms and to
consist of a multiplier andreduce their
accumulator computation
(MAC), as shown in time.
Figure [Link] systolic arr
A systolic
array is a special PE array structure built for the fast and efficient operation of regular array
tive at reducing memory access by reusing data that have already bee
algorithms and to reduce their computation time. The systolic arrays are also effective at
other PEs.
reducing Therefore,
memory manydata
access by reusing research
that have works have
already been proposed
passed systolic
through other PEs. arra
Therefore, many research works have proposed systolic
maximize the speed of iterative convolution computations [3–14]. array architectures to maximize
the speed of iterative convolution computations [3–14].

Figure 1. Processing element architecture.


Figure 1. Processing element architecture.
In many CNN models, the convolutional layer uses a stride of 2 to convolve the input
data with a 3 × 3 kernel, which is faster than using a stride of 1. A convolutional layer
In many CNN models, the convolutional layer uses a stride of 2 to c
with a stride of 2 is advantageous in that it requires less computation. The filter moves
data withataa 3time
two pixels × 3over
kernel, which
the input featureis faster
map, whichthan using
results a stride
in faster of 1. Aofconvo
down-sampling
the input. However, most research on CNN accelerators shows that the time consumption
a stride of 2 is advantageous in that it requires less computation. The
for stride 2 is the same as for the stride 1 mode. The reason for this is that the feature
pixels
map data atare
a time over
supplied the
to the PE input feature
in the same way asmap,
the 3 ×which
3 stride 1results in faster
kernel mode, and dow
it produces
input. relatively most
However, low PEresearch
utilization. on
ThisCNN
indicates that the stride shows
accelerators 2 mode isthat
not as
the tim
efficiently implemented as the stride 1 mode on CNN accelerators [15].
stride In 2 ispaper,
this the same as for
we propose the stride
a hardware 1 mode.
architecture Theinference
for a CNN reasonaccelerator,
for this is tha
data
usingare supplied
a novel to the
systolic array PE in the
architecture calledsame waydiagonal
the flexible as thecyclic
3 × 3array
stride 1 kernel
(FDCA)
to accelerate the convolution operation and support various kernel sizes, including the
duces relatively low PE utilization. This indicates that the stride 2 mode
3 × 3 kernel with stride 2. This paper introduces the following new methods to minimize
implemented
repeated memoryas the stride
accesses, optimize1 hardware
mode on CNNfor
resources accelerators
various kernel [15].
modes, and
enable the mapping of diverse CNN models onto a wide range of FPGA devices.
In this paper, we propose a hardware architecture for a CNN infe

usingFlexible
a novelDiagonal Cyclic Array (FDCA) for kernel modes: The FDCA is a novel
systolic array architecture called the flexible diagonal cy
systolic array structure designed to maximize data reuse and speed up computation by
to accelerate the convolution
efficiently performing [Link]
In the FDCA, and
PEs aresupport
arranged invarious kernel siz
a 3 × 3 systolic
array. Multiplication and accumulation operations are performed
× 3 kernel with stride 2. This paper introduces the following new met to calculate a partial
sum, which is then forwarded to the diagonal PEs to accumulate the output result. In
repeated memory
this study, accesses,
we optimized optimize
a DCA systolic array hardware resources
for the convolution operationfor various
using a k
3 ×the
enable mapping
3 kernel ofofdiverse
with strides CNN
1 and 2, which aremodels
commonlyonto a CNN
used in wideaccelerators.
range of FPGA
 Flexible Diagonal Cyclic Array (FDCA) for kernel modes: The FD
tolic array structure designed to maximize data reuse and speed u
efficiently performing convolutions. In the FDCA, PEs are arrange
Electronics 2024, 13, 1564 3 of 18

• Input Zero Padding: The CNN accelerator supports adding zero-padding data to the
around the input feature map. When the CNN accelerator reads the input feature map
from DDR, it decides to add input zero padding at the corresponding position. This
function provides several advantages, such as reducing DDR access and effectively
utilizing on-chip memory, instead of writing padding data for the output feature map.
To implement input padding, we designed the input zero padding circuit, which
utilizes a 2-bit register to indicate the status required for each situation and a wire to
control the global input buffer read enable signal.
• Reconfigurable Input FIFO and FIFO Output Cache Memory: The reconfigurable
input FIFO consists of three SRAM FIFOs and one register connected in a predefined
sequence. When we define the kernel mode for a convolutional operation using a
specific stride and kernel size, the reconfigurable input FIFOs are interconnected
according to the kernel mode to efficiently reuse the data. The data that are read from
the first FIFO will flow to another FIFO and systolic array processing, respectively.
The FIFO output cache memory is a register that supplies a large amount of data to
the PE. It can transfer two different data to the PE depending on the kernel mode. By
using the address for this register, it enables the activation of the “read enable” for the
reconfigurable input FIFO connected to the register and generating a “write enable”
signal for the register, allowing data to be read sequentially whenever needed.
• Weight Parameter Quantization: Quantization is a method for reducing model size by
converting model weights, biases, and activations from high-precision floating-point
representation to low-precision floating-point (FP) or integer (INT) representations,
such as 16-bit or 8-bit. By converting the weights of a model from high-precision
floating-point representation to lower precision, the model size and inference speed
can significantly improve without sacrificing too much accuracy. Additionally, quanti-
zation improves the model performance by reducing memory bandwidth requirements
and increasing resource utilization [16].
In this work, we used 32-bit floating point parameters for training and then quantized
them to 8-bit integers to enable high-speed lightweight CNN inference. By applying a
low-bit quantization, we can utilize small-size on-chip memory, multipliers, and adders.
The rest of this paper is organized as follows: In Section 2, the background of the
overall architecture of the CNN Accelerator is described. A detailed explanation of the
proposed flexible PE array architecture is given in Section 3. In Section 4, we outlined the
advantages of using the proposed architecture. Section 5 includes the verification process
and the results obtained. Section 6 concludes the paper with future research plans.

2. Related Work and Motivation


Many researchers have studied lightweight object detectors and proposed hardware
accelerators to accomplish real-time object detection on edge devices. The acceleration of
CNN model inference for object detection is discussed in more detail, with a focus on FPGA-
based implementations. Thus, various hardware architecture approaches and optimization
methods are explored to examine their impact on throughput and accuracy [17–26].
Since its initial release [27] in 2016, several versions of YOLO have been developed
and accelerated for improving the processing efficiency. The problem is that designing a
new dedicated accelerator for a new version of YOLO is a time-consuming process.
Most of the studies analyze the acceleration of the YOLOv2 algorithm to improve
development speed, power efficiency, and computing performance. Many analyses provide
developers with new insights for choosing hardware and architectures to optimize the
YOLOv2 algorithm [14,28,29].
The lightweight YOLO versions, including Tinier-YOLO, Tiny-YOLOv3, and Tiny-
YOLOv4, have fewer parameters and require fewer computations compared to the full
versions. However, they also exhibit some reduction in accuracy. They are deployed on
FPGA-based embedded computing platforms and have achieved better real-time detection
results, utilizing architectures with high performance and low energy consumption [30–32].
Electronics 2024, 13, x FOR PEER REVIEW

Electronics 2024, 13, 1564 4 of 18


Other studies [13,25,26,33,34] have investigated the implementation of th
CNN-based object detection networks on FPGA devices by building customized
tation
The units and
accelerators havedata
been flows intototheir
configured accelerator [30]
run Tiny-YOLOv3 designs. As a result,
and Tiny-YOLOv4 [32]the
in impac
real time, achieving performance
communication bottlenecksofisover 8.3 and 30and
minimized frames
theper second.
overall performance is enhance
Other studies [13,25,26,33,34] have investigated the implementation of the entire CNN-
ever, the accelerator designs proposed in these works are tailored to specific ver
based object detection networks on FPGA devices by building customized computation
the YOLO
units and datanetwork
flows intoand
theirlack the versatility
accelerator designs. As toa target more
result, the recent
impact object
of data commu-detection
nicationThis paper is
bottlenecks introduces
minimized anda novel architecture
the overall performancethatis enables
[Link] deployment
However, the of
generation
accelerator of models
designs proposed from the YOLO
in these works arefamily ontoaspecific
tailored varietyversions
of FPGA devices.
of the YOLO Our p
network and lack the versatility to target more recent object detection
toolflow is designed to efficiently process YOLOv5 and the latest YOLO models, models.
This paper introduces a novel architecture that enables the deployment of the next
high performance and reconfigurability to accommodate new changes and arch
generation of models from the YOLO family on a variety of FPGA devices. Our proposed
updates.
toolflow To achieve
is designed efficientprocess
to efficiently implementation
YOLOv5 andofthe YOLOv5-based algorithms,
latest YOLO models, offering we co
research
high on various
performance customized hardware
and reconfigurability accelerators
to accommodate and proposed
new changes new method
and architecture
updates. To
timize them. achieve efficient implementation of YOLOv5-based algorithms, we conducted
research on various customized hardware accelerators and proposed new methods to
optimize them.
3. Overall Hardware Architecture
3. Overall Hardware
Figure Architecture
2 illustrates the overall architecture of the proposed CNN accelerator
Figure 2consists
chitecture illustratesofthe
theoverall architecture
following of theblock
hardware proposed CNN accelerator.
components: The block w
PE arrays
architecture consists of the following hardware block components: PE arrays block with
FDCAs, 5 × 5 max-pooling, element-wise adder, upsampling, global in/outpu
four FDCAs, 5 × 5 max-pooling, element-wise adder, upsampling, global in/output buffer,
AXI4data
AXI4 data bus,
bus, and and
CNNCNN controller
controller blocks. blocks.

Figure
Figure Proposed
2. 2. CNNCNN
Proposed accelerator architecture
accelerator with RISC-V
architecture processor.
with RISC-V processor.
In this study, we introduce a new systolic array (SA) structure called the flexible
diagonalIn cyclic
this study, we introduce
array (FDCA) a new the
that also supports systolic
stride 2array
kernel(SA)
mode. structure called the fle
The SA structure
isagonal cyclic
designed in thearray (FDCA)
form of an arraythat also
of 3 × supports
3 PEs, called a the
kernelstride 2 kernel
unit (KU), [Link]
to optimize The SA s
convolution using a 3 ×
is designed in the form of an array of 3 × 3 PEs, called a kernel unit (KU),a to opti
operation 3 kernel with stride 1. In general, each PE computes
partial sum of the convolution and sends it to other PEs to generate a single convolution
convolution operation using a 3 × 3 kernel with stride 1. In general, each PE com
result using an accumulator. If the CNN model includes a layer with N × N filters, the
partial sum
proposed of the
PE array can convolution and sends
be easily configured it totheother
to support PEskernel
required to generate
size and a single con
stride.
result
The FDCAusing an accumulator.
consists of 4 × 8 KUs, withIf the CNN
each model
KU with 9 PEs,includes a layer
and it can with N × N fil
simultaneously
process four input channels and eight output channels.
proposed PE array can be easily configured to support the required kernel size an
TheFour
3.1. FDCA
FDCAconsists of 4 Acceleration
for Convolution × 8 KUs, with each KU with 9 PEs, and it can simulta
process four input
To accelerate channels operation,
the convolution and eightweoutput
employchannels.
four FDCAs to calculate 16 input
data channels simultaneously. The relevant processing architecture is illustrated in Figure 3.
3.1. Four FDCA for Convolution Acceleration
To accelerate the convolution operation, we employ four FDCAs to calculate
data channels simultaneously. The relevant processing architecture is illustrated i
3. The convolutional result generated by a single filter using FDCA is stored in a
FOR PEER REVIEW 5 of 18

Electronics 2024, 13, 1564 5 of 18

In addition, the four-FDCA architecture is specially designed to maximize data reuse


and speed up the processing of theresult
The convolutional convolutional
generated by alayer.
single Therefore, optimized
filter using FDCA is stored data reuse
in a convolutional
memory (Conv_mem). After calculating the convolutional outputs
on KUs of the architecture provides a higher utilization ratio in any 3 × 3 or 6 × 6 kernel for all input channels,
the final result is produced by accumulating them and storing the result in the Conv_mem
modes compared to previous studies.
as the final output.

Figure 3. Four-FDCAFigure 3. Four-FDCA


processing processing
architecture architecture
block diagram. block diagram.

In addition, the four-FDCA architecture is specially designed to maximize data reuse


3.2. Max Pooling and speed up the processing of the convolutional layer. Therefore, optimized data reuse
on KUs
The max pooling of the architecture
operation comparesprovides
25 inputa higher
featureutilization
maps and ratio in any 3 ×the
produces 6 × 6 kernel
3 orlargest
modes compared to previous studies.
value from them. For the max pooling operation, we designed a 5 × 5 max pooling hard-
ware block that is composed of 128 comparators. The max pooling block includes 16 in/out
3.2. Max Pooling
flip-flop memories andThe a (de)channeling controller
max pooling operation that25isinput
compares used to reorder
feature and
maps and write data
produces the largest
back to DDR. value from them. For the max pooling operation, we designed a 5 × 5 max pooling
hardware block that is composed of 128 comparators. The max pooling block includes 16
in/out flip-flop memories and a (de)channeling controller that is used to reorder and write
3.3. Element‐Wise Adder
data back to DDR.
The element-wise adder is a hardware architecture designed to perform element-
3.3. Element-Wise Adder
wise addition of data from two different feature maps with equal size. The hardware block
The element-wise adder is a hardware architecture designed to perform element-wise
consists of parallel adders and input/output buffers. The concept of element-wise addition
addition of data from two different feature maps with equal size. The hardware block
operation is derivedconsists
from ofthe latestadders
parallel YOLO models,
and which
input/output merge
buffers. Thedata from
concept two streams.
of element-wise addition
operation is derived from the latest YOLO models, which merge data from two streams.
3.4. Upsampling (Resize)
3.4. Upsampling (Resize)
Upsampling is a novel hardware
Upsampling sub-circuit
is a novel hardwareused to increase
sub-circuit used to the sizethe
increase of size
the of
input
the input
feature map. For anfeature
inputmap.
feature
For anmap with
input a size
feature mapofwith
n × an,size
theofupsampling layer increases
n × n, the upsampling layer increases
the output feature map size to
the output 2n ×map
feature 2n bysizemaking anby
to 2n × 2n exact
making copy of each
an exact copyinput feature
of each map map
input feature
and placing it at the bottom, right, and bottom-right
and placing it at the bottom, right, and bottom-right diagonal pixel positions. diagonal pixel positions.

3.5. Global Input/Output Buffers and AXI4 Data Bus


3.5. Global Input/OutputInBuffers and AXI4
the proposed Data Bus the global input/output buffers are used for send-
architecture,
ing/receiving
In the proposed datathe
architecture, to/from DDR
global via the 256-bitbuffers
input/output AXI4 data
arebus.
usedThefor
proposed architecture
sending/re-
includes an additional input buffer called “Instruction Memory”,
ceiving data to/from DDR via the 256-bit AXI4 data bus. The proposed architecture in- which is used for writ-

cludes an additional input buffer called “Instruction Memory”, which is used for writing
the instruction microcode from the RISC-V CPU core through the 32-bit AXI4-Lite bus
protocol.
Electronics 2024, 13, x FOR PEER REVIEW

Electronics 2024, 13, 1564 6 of 18uses mi


max pooling, element-wise adder, and upsampling. The CNN controller
information to manage all data processing operations, from reading input data to
cessing
ing hardware
the instruction block from
microcode fromathe
predefined address
RISC-V CPU to writing
core through output
the 32-bit data to DDR
AXI4-Lite
using
bus AXI4 transactions.
protocol.

3.6. CNN Controller


4. The Proposed CNN Accelerator
The CNN controller hardware block controls the processing of functional hardware
4.1. Stride
blocks in thein Convolutional
proposed Operation
design, such as the four FDCAs for convolution acceleration, 5 × 5
max pooling, element-wise
The stride defines adder, and upsampling.
the number Thesteps
of moving CNN ofcontroller uses
the filter microcode
through the inpu
information to manage all data processing operations, from reading input data to the
map to generate an output value. If the stride is bigger than one (>1), the output
processing hardware block from a predefined address to writing output data to DDR
map size
DRAM decreases
using compared to the input.
AXI4 transactions.
Equation (1) defines the output feature map size (o) for a given stride (s), w
4.
theThe Proposed
input CNNmap
feature Accelerator
size, k
is the kernel size, and p is the padding added to th
4.1. Stride in Convolutional Operation
feature map.
The stride defines the number of moving steps of the filter through the input feature
map to generate an output value. If the stride is bigger than𝑖 𝑘one 2𝑝
(>1), the output feature
𝑜 𝑅𝑜𝑢𝑛𝑑𝑜𝑓𝑓 1
map size decreases compared to the input. 𝑠
Equation (1) defines the output feature map size (o) for a given stride (s), where i is
Most
the input CNN
feature accelerators
map size, k is theonly use
kernel a stride
size, ofthe
and p is 1 when shifting
padding a kernel
added to the inputover an i
this article,
feature map. we introduce a new method for implementing stride 2 convolution o
i − k + 2p
 
ware. o = Roundo f f +1 (1)
s

4.2. Most CNN accelerators


Convolution only use
Using Kernel 3 ×a 3stride
withofStride
1 when1 shifting
in Kernela kernel
Unitover an input. In
this article, we introduce a new method for implementing stride 2 convolution on hardware.
The convolution operation using a 3 × 3 kernel with a stride of 1 is a basi
4.2.
mode Convolution Using Kernel
that utilizes 3 × 3 with
the entire 3 × 3Stride 1 in Kernel
PE array Unit
structure, known as the KU. In this w
The convolution operation using a 3 ×
filter convolves the input feature map in a direction fromof
3 kernel with a stride up1 to
is adown;
basic kernel
the kernel i
mode that utilizes the entire 3 × 3 PE array structure, known as the KU. In this work,
vertically over the input feature map with a stride of 1. In the KU, pixel data mov
the filter convolves the input feature map in a direction from up to down; the kernel is
zontally
shifted from over
vertically left the
to right and ismap
input feature reused
withfor oneofclock
a stride cycle
1. In the KU,in each
pixel PE.
data The weigh
moves
vertically, from
horizontally uptotoright
from left [Link] The weights
is reused will
for one be cycle
clock reused in every
in each PE. Theclock cycle in th
weights
move
each clock cycle, FIFOs supply three pixels and weight data values tothe
vertically, from up to down. The weights will be reused in every clock cycle in the KU.
KU. In each clock cycle, FIFOs supply three pixels and weight data values to the KU. Each
transfers its accumulated value to the bottom-right diagonal PE over KU, during
PE transfers its accumulated value to the bottom-right diagonal PE over KU, during the
cessing. AsAsshown
processing. shown in in Figure
Figure 4,4,PE6
PE6andand
PE7PE7
do do
not not
havehave bottom-right
bottom-right diagonal
diagonal PEs. PEs
fore, these
Therefore, PEs
these PEstransfer
transfer accumulated values
accumulated values to PE1
to PE1 andrespectively.
and PE2, PE2, respectively.

Figure Kernel
Figure4. 4. unitunit
Kernel for convolution 3 × 3 stride
for convolution 3 × 31. stride 1.
The PE2, PE5, and PE8 perform convolution operations for one input channel by
The PE2,
accumulating PE5,
nine data,and PE8 perform
including convolutionbyoperations
six data accumulated for
the Pes of the one input
chann
previous two
cumulating nine data, including six data accumulated by the Pes of the previous t
tical lines. The last vertical line produces one convolution result in every clock cy
the first result is produced. Since only one final result is produced in KU for eac
the conv_select signal is reused to select the results sequentially. The color of each
arrow indicates the path of partial sums used to compute one convolution result. T
Electronics 2024, 13, 1564 7 of 18
Electronics 2024, 13, x FOR PEER REVIEW 7 of 18

vertical lines. The last vertical line produces one convolution result in every clock cycle
4.3. Convolution
after Using
the first result Kernel 3 ×Since
is produced. 3 with Stride
only one 2final
in Kernel
result Unit
is produced in KU for each
clock, the conv_select signal is reused to select the results sequentially. The color of each7PE
Electronics 2024, 13, x FOR PEER REVIEW The convolution operation with a 3 × 3 kernel and a stride of 2 uses the sameofPE 18 array
and arrow indicates the path of partial sums used to compute one convolution result. The
architecture, known as the KU, as that used for a stride of 1. However, in the stride 2 mode
convolution results are repeatedly calculated in the order of red, blue, and green.
data sharing between PEs differs; the data accumulated in each PE will be transferred
4.3. Convolution
horizontally
4.3. Convolutionto Using
the next
Using Kernel
KernelPE,3× ×not
33with Stride
diagonally.
with Stride 2 inin To
Kernel
Kernel Unit a convolution operation with a
perform
Unit
stride
The of n (n
Theconvolution > 1), six
convolution operation pixels
operation withof data must be
withaa33×× 33 kernel loaded
kerneland into
andaastride
stridethe
ofof2KU
2usesatthe
uses the
the same
same
same PEtime.
PE We used
array
array
a FIFO output
architecture,
architecture, known
known cache memory
as the KU,
KU, as as between
that
thatused
usedforthe
foraareconfigurable
stride
[Link],inputinFIFO
However, the
in the and
stride
stride2KU
mode,to prepare
2 mode,
data
data sharing
for between
further PEs
processing differs;
in the
PEs. data
Using accumulated
this cache
data sharing between PEs differs; the data accumulated in each PE will be transferred in each
memory, PE will
we be
can transferred
read two pixels o
horizontally
data from the
horizontally to the
input
to the next
next PE,
FIFO
PE, atnot
notthe diagonally.
same time.
diagonally. To perform a convolution operation
To perform a convolution operation with a with a
strideof
stride ofnn (n
(n >> 1),
1), six
six pixels
pixels of
of data
data must
must
As illustrated in Figure 5, the first three pixels be
be loaded
loaded into
intotheofKU
the KUatat
data thethe
will same
same
be time.
sent toWethe
time. Weusedusedin the
PEs
a FIFO output cache memory between the reconfigurable input FIFO and KU to prepare
a first
FIFOcolumn,
output cache as seen memory
in thebetween
stride 1 the reconfigurable
mode. The next three inputpixels
FIFO and KU to
of data prepare
from the cache
data for further processing in PEs. Using this cache memory, we can read two pixels of data
data for
register further processing in PEs. Using this cache memory, we can read two pixels of cal
from the will
inputbe simultaneously
FIFO at the same time. sent to the second-column PEs. After completing clock
data from
culations the input
in PEs,inwe FIFO at the
have5,to same time.
As illustrated Figure thepass
first the
threepixel
pixelsvalues
of datafromwill bethesentfirst-column
to the PEs in PEs to the last
the first
As
column as
column, illustrated
[Link] Figure
theorder,
stride 1we5, the first
maintain
mode. three
The next pixels
the three of
convolution data will
pixels of operation be sent
data from the to
with the PEs
stride
cache in the
2, efficiently
register
first column,
reusing
will as seen
the data fromsent
be simultaneously in the stride 1
thetofirst-column mode.
the second-column The
PEs. Pixelnext three
PEs. data pixels
After values
completing of data
fromclock from
the first the cache
column PEs o
calculations
register
in PEs, will
we be
have simultaneously
to pass the pixel sent to
values the second-column
from the first-column
the KU will be transferred to the diagonally downward PEs in the last column. PEs. After
PEs to completing
the last-column clock
PEs.
Thiscal-means
culations
In this in
order, PEs,
we we have
maintain to
the pass the
convolution pixel values
operation from
with the first-column
stride
that data reuse only occurs in the first and third columns of the KU, specifically in the 2, efficiently PEs to
reusingthe thelast-
column
data PEs.
from theInfirst-column
this order, we PEs. maintain
Pixel data thevalues
convolution
from operation with stride 2, KU
efficiently
stride 2 mode. In addition, weights are reused inthe firstclock
every column PEs
cycle of the
by will rotating
vertically
reusing the datatofrom
be transferred the first-column
the diagonally downward PEs. PEs
Pixelindata values
the last from This
column. the first
means column PEs of
that data
them from top to bottom.
reuse
the KUonlywill occurs in the first
be transferred to and third columns
the diagonally of the KU,PEs
downward specifically
in the last incolumn.
the strideThis2 mode.
means
Figureweights
In addition, 6b illustrates
are reused an example
in every of a convolution
clock cyclecolumns operation
by vertically rotating withthemstride
from2,top represent
that data reuse only occurs in the first and third of the KU, specifically in the
ing motion
to bottom. of the filter over the input feature data.
stride 2 mode. In addition, weights are reused in every clock cycle by vertically rotating
them from top to bottom.
Figure 6b illustrates an example of a convolution operation with stride 2, represent-
ing motion of the filter over the input feature data.

[Link]
Figure Kernelunit
unitforfor
convolution 3 × 33 ×
convolution stride 2. 2.
3 stride

Figure 6b illustrates an example of a convolution operation with stride 2, representing


motion of the filter
Figure 5. Kernel over
unit for the input 3feature
convolution data.
× 3 stride 2.

(a) (b)
Figure 6. Example of filter shifting over the input feature map: (a) 3 × 3 stride 1; (b) 3 × 3 stride 2.
(a) (b)
Figure 6. Example of filter shifting over the input feature map: (a) 3Unit
4.4. Convolution Using Kernel 1 × 1 with Stride 1 in Kernel × 3 stride 1; (b) 3 × 3 stride 2.
Figure 6. Example of filter shifting over the input feature map: (a) 3 × 3 stride 1; (b) 3 × 3 stride 2.
Using the proposed 3 × 3 PEs array of KU, we can run the convolution operation
4.4. Convolution
using Using Kernel
a 1 × 1 kernel 1 × 1awith
size with Stride
stride 1 in
of 1. Kernel
For the 1Unit
× 1 convolution operation, we have
designed an accelerator
Using the proposed 3 circuit
× 3 PEsthat exclusively
array of KU, we utilizes
can runthethe
first horizontaloperation
convolution line of PEs in
the KU
using a 1 unit. In thissize
× 1 kernel mode,
withthe upperofleft
a stride PE receives
1. For the 1 × 1 only one datum
convolution in every
operation, weclock
havecycle
while the
designed anupper rightcircuit
accelerator PE generates the output
that exclusively of the
utilizes theoperation. Afterline
first horizontal theof
initial
PEs inoutpu
the KU unit.
result In this mode,
is calculated the
in the upper leftcircuit,
proposed PE receives only one
each clock datum
cycle in every
produces clock cycle,
a result of the con
while the upper
volution output,right PE generates
similar the output
to convolution with ofa 3the
× 3operation.
kernel [Link] the initial output
Electronics 2024, 13, 1564 8 of 18

4.4. Convolution Using Kernel 1 × 1 with Stride 1 in Kernel Unit


Using the proposed 3 × 3 PEs array of KU, we can run the convolution operation
using a 1 × 1 kernel size with a stride of 1. For the 1 × 1 convolution operation, we have
designed an accelerator circuit that exclusively utilizes the first horizontal line of PEs in the
KU unit. In this mode, the upper left PE receives only one datum in every clock cycle, while
Electronics 2024, 13, x FOR PEER REVIEW 8 of 18
Electronics 2024, 13, x FOR PEER REVIEW
the upper right PE generates the output of the operation. After the initial output result is 8 of 18
calculated in the proposed circuit, each clock cycle produces a result of the convolution
output, similar to convolution with a 3 × 3 kernel mode.
4.5. Convolution Using Kernel 6 × 6 in Kernel Unit
[Link]
ConvolutionUsing
UsingKernel
Kernel 6 ×6 6ininKernel
Kernel Unit
4.5. 6× Unit
The convolution
The convolution operation
operation withwith a 6 × 66 kernel
kernel is is performed
performed by by executing
executing the
the convo-
convo-
The convolution operation with a 6a×6 6× kernel is performed by executing the convo-
lution operation
lutionoperation with
operationwith a
witha a3 ×3 × 3 kernel
3 ×3 3kernel
kernel on the KU. Figure 7 illustrates the segmentation of a
lution ononthethe KU.
KU. Figure
Figure 7 illustrates
7 illustrates the segmentation
the segmentation of a of a
convolution operation
convolution operation using
using aa 66 ×× 66 kernel
kernel into
into four
four 3 × 33 kernels
kernels for
for processing
processing with
with the
convolution operation using a 6 × 6 kernel into four 3 ×33×kernels for processing with the the
proposed
proposed
design.
proposeddesign. In this
[Link] mode,
thismode,
mode, the
thethe
input
input
input
feature
feature
feature data
data
data for
for for
each
each
each convolution
convolution
convolution
operation
operation
operation
with
with with
a 3
aa33×× 3 kernel
× 3 kernel process
kernel process a dedicated
processaadedicated
dedicatedpart part
part of
ofof
thethe whole
whole
the whole input
input
input data,
data, with
data, with overlapped
overlapped
with contents.
contents.
overlapped contents.
To
To calculate
Tocalculate the
calculatethe final
thefinal output
finaloutput
output of
of of the
thethe convolution
convolution
convolution operation
operation
operation with
a6×
withwith a 6 × 6 kernel,
a 6 kernel, we ause
we use
× 6 kernel, we use aa
two-stage
two-stage adder
two-stageadder
addertreetree after
treeafter finishing
finishing
after finishing convolutions
convolutions
convolutions in
in the the FDCAs.
FDCAs.
in the FDCAs.

Figure 7. Applying 3 × 3 kernel mode for convolution operation with a 6 × 6 kernel.


[Link]
Figure Applying3 3××33kernel
kernelmode
modefor
for convolution
convolution operation
operation with
with a6a×66×kernel.
6 kernel.

[Link]
4.6.
4.6. ReconfigurableInput
Reconfigurable InputFIFO
Input FIFO
FIFO
Figure888shows
Figure
Figure showsthe
shows the
the block
block
block diagram
diagram
diagram of the
of the
of the reconfigurable
reconfigurable input
input
reconfigurable FIFO.
FIFO.
input For each
For each
FIFO. For each kernel
kernel
kernel
mode, we
mode, we utilize
we utilize a portion
utilize aa portion of
portion of the
of thereconfigurable
the reconfigurable input
reconfigurable input FIFO. Each
input FIFO. FIFO
FIFO. Each
Each FIFOstores one
FIFO stores column
stores one
one column
column
mode,
of
ofdata
datafor
forthe
theinput
inputslice. Since
slice. Sinceeach address
each address of the
of FIFO
the contains
FIFO fourfour
contains feature mapmap
feature data,data,
of data for the input slice. Since each address of the FIFO contains four feature map data,
the
thedepth
depthofofthe
theFIFO
FIFOis is
equal
equalto to
thethe
slice size/4.
slice size/4.
the depth of the FIFO is equal to the slice size/4.

Figure 8. Reconfigurable input FIFO (red frame) with kernel unit.


[Link]
Figure Reconfigurableinput FIFO
input (red
FIFO frame)
(red with
frame) kernel
with [Link].
kernel

Thefollowing
The
The followingconfigurations
following configurations
configurations of
of of the
thethe reconfigurable
reconfigurable input
input
reconfigurable FIFO
FIFO
input are used
are used
FIFO are used for different
different
for different
for
operation
operationmodes:
modes:
operation modes:
 The convolution
The convolution operation
operation with
with aa 33 ×× 33 kernel
kernel and
and aa stride
stride of
of 11 requires
requires the
the use
use of
of two
two
SRAM-FIFOs and a register. While the KU reads data from the SRAM-FIFO
SRAM-FIFOs and a register. While the KU reads data from the SRAM-FIFO or regis- or regis-
ter, the
ter, the data
data not
not only
only go
go to
to the
the KU
KU but but also
also to
to another
another SRAM-FIFO
SRAM-FIFO containing
containing the
the
previous data from the left column in the feature
previous data from the left column in the feature map. map.
 The convolution
The convolution operation
operation with
with aa 33 ×× 33 kernel
kernel and
and aa stride
stride ofof 22 uses
uses three
three SRAM-
SRAM-
Electronics 2024, 13, 1564 9 of 18

• The convolution operation with a 3 × 3 kernel and a stride of 1 requires the use of
two SRAM-FIFOs and a register. While the KU reads data from the SRAM-FIFO or
register, the data not only go to the KU but also to another SRAM-FIFO containing the
previous data from the left column in the feature map.
• The convolution operation with a 3 × 3 kernel and a stride of 2 uses three SRAM-
FIFOs. When the PEs array loads data from the FIFO address that stores the data
Electronics 2024, 13, x FOR PEER REVIEW for the last column, it also sends the data to another FIFO for reuse in the next first
9 of 18
column, bypassing the middle column. This indicates that the stride 2 mode requires
the KU to reuse only the data from the last column as the first column next time.
• The convolution operation with a 1 × 1 kernel and a stride of 1 only utilizes one
4.7. FIFO Output Cache
SRAM-FIFO. In thisMemory
mode, the output data only come from PE2.
We used FIFO output cache memories to share feature map data between the input
4.7. FIFO Output Cache Memory
FIFO and KU. Each address in the input FIFO contains four feature map data at a moment.
We used
Therefore, theFIFO
FIFO output
outputcache memories
cache memory to share
block feature
receives map
fourdata
databetween the input
from FIFO at once,
FIFO and KU. Each address in the input FIFO contains four feature
queues them in order, and transfers them to the KU, respectively. Each cache map data at a moment.
memory
Therefore, the FIFO output cache memory block receives four data from FIFO at once,
consists of eight 16-bit registers and enables the loading of one or two feature map data to
queues them in order, and transfers them to the KU, respectively. Each cache memory
the KU. We divided eight 16-bit registers into two parts, named Area0 and Area1, each
consists of eight 16-bit registers and enables the loading of one or two feature map data
consisting
to the KU. We of four 16-bit
divided registers.
eight 16-bit registers into two parts, named Area0 and Area1, each
In our design, we used
consisting of four 16-bit registers. a cache memory with depth = 8, which is twice the size of the
dataIn read
our from thewe
design, FIFO.
usedBy usingmemory
a cache cache memory
with depthwith= 8,eight
whichregisters,
is twice wethe will bethe
size of able to
read two data from the cache without waiting for the next four data from
data read from the FIFO. By using cache memory with eight registers, we will be able to case FIFO. In the
of using
read cache
two data memory
from with
the cache depth waiting
without = 4, it will notnext
for the be possible
four datatofrom
readFIFO.
and Insend
the two
case data
to using
of the KU at the
cache same with
memory [Link] = 4, it will not be possible to read and send two data to
the KU Forat example,
the same time.
if we use cache memory with depth = 4, after reading three data from
the cache memory,we
For example, if weusewould
cache memory withto
not be able depth
read= another
4, after reading
two datathreefrom
data it.
fromWetheface a
cache memory, we would not be able to read another two data from it.
memory limitation problem in our circuit. If we read only one datum, which is left in theWe face a memory
limitation
cache memory, problemthen inwe
ourmust
circuit.
waitIffor
wetheread only
next one
four datum,
data readingswhich
fromis left in the
FIFO. Thiscache
problem
memory, then we must wait for the next four data
causes circuit insufficiency and produces incorrect output. readings from FIFO. This problem causes
circuit insufficiency and produces incorrect output.
Figure 9 shows that the “finish” signal becomes active when reading data from a
Figure 9 shows that the “finish” signal becomes active when reading data from a
specific location in Area0 and Area1. For example, if the address is 2, then finish[0] goes
specific location in Area0 and Area1. For example, if the address is 2, then finish[0] goes
high when reading two data in Area0. The “finish” signal acts as the read enable signal
high when reading two data in Area0. The “finish” signal acts as the read enable signal for
for the input
the input FIFO andFIFO and generates
generates the write
the write enable signalenable
for thesignal for thecache
FIFO output FIFO outputviacache
memory
amemory
register. via
Whena register. When data
data are initially areininitially
stored the FIFOstored
outputincache
the FIFO output
memory, cachesignal
the finish memory,
the finish signal cannot be activated until the data are read from
cannot be activated until the data are read from the FIFO output cache memory. Therefore,the FIFO output cache
memory. Therefore, the cache memory must read data from the FIFO
the cache memory must read data from the FIFO using the init_rd_en signal generated by using the init_rd_en
signal
the generated by the controller.
controller.

[Link]
Figure FIFOoutput
outputcache
cache memory.
memory.

4.8. Slice and Iteration


Modern CNN models are becoming increasingly complex by using large image sizes
for input data and increasing the depth and scale of neural networks to achieve high pre-
diction accuracy. Due to these changes in CNN models, there is a need to develop new
hardware accelerator models with high processing capabilities and reconfigurability. This
Electronics 2024, 13, 1564 10 of 18

4.8. Slice and Iteration


Modern CNN models are becoming increasingly complex by using large image sizes
for input data and increasing the depth and scale of neural networks to achieve high
prediction accuracy. Due to these changes in CNN models, there is a need to develop new
hardware accelerator models with high processing capabilities and reconfigurability. This
work introduces the concept of slicing, which involves uniformly cutting a whole input
feature map into specific-sized parts. The concept of slicing is not only used for defining
Electronics 2024, 13, x FOR PEER REVIEW 10 of 18
the height and width of each slice but also the depth of the input channel, which is divided
into slices based on the number of input channels.
After applying the slicing of the input data, each slice must be processed separately
with filters.
input data. In It our
is determined
design, webyintroduced
the numberand of iterations required to ideas,
used new-iteration processsuch
the entire input
as input, slice,
data.
and In ouriteration,
output design, we to introduced
process theand used
input new-iteration
data efficiently and ideas, such as input, slice, and
quickly.
output
By iteration,
applyingtothe process
concept theof input data we
slicing, efficiently
reduceand the quickly.
amount of input data needed for
processing in the CNN accelerator at once. As a result, amount
By applying the concept of slicing, we reduce the the size ofof the
input data needed
on-chip memory forand
processing in the CNN accelerator at once. As a result, the size of
computational circuits is effectively reduced. To achieve the aforementioned improve- the on-chip memory and
computational circuits is effectively reduced. To achieve the aforementioned improvements,
ments, this study not only introduced the concept of slicing but also implemented the idea
this study not only introduced the concept of slicing but also implemented the idea for
for a limited number of kernels. Figure 10 illustrates how the iteration is defined in our
a limited number of kernels. Figure 10 illustrates how the iteration is defined in our
architecture based on the feature map size and slice. When using the concept of slicing in
architecture based on the feature map size and slice. When using the concept of slicing
convolution,
in convolution, there is is
there ananoverlap
overlapbetween
between two slices. Figure
two slices. Figure10a 10ashows
shows that
that thethe overlap
overlap
between
between slices occurs when using 3 × 3 kernels. Figure 10b,c represent an example of
slices occurs when using 3 × 3 kernels. Figure 10b,c represent an example of in-
put/output
input/output iteration. InIn
iteration. this
thisstudy,
study,because
because 16
16 input channelsand
input channels and8 8filters
filters can
can be be used
used for for
calculation simultaneously, the number of channels used for one input
calculation simultaneously, the number of channels used for one input iteration becomes 16, iteration becomes
16, contrary
contrary to what
to what is shown
is shown in theinfigure.
the figure. Similarly,
Similarly, eight
eight (8) (8) are
filters filters
used are
forused for convo-
convolution
lution processing
processing in eachinoutput
each output
iteration. iteration.

0 0 0 0ㆍ ㆍㆍ ㆍ ㆍㆍ0 0 0 0
0 0
0 2 2 2 0
0 0

2 4 2 4 2 4 2

2 2 2

2 4 2 4 2 4 2

2 2 2

2 4 2 4 2 4 2

0 0
2 2 2
0 0
0 0
0 0 0 0ㆍ ㆍㆍ ㆍ ㆍㆍ 0 0 0 0

(a) (b) (c)


Figure 10. Concept of slice and iteration: (a) slice iteration and overlap (the color indicates the area
Figure 10. Concept of slice and iteration: (a) slice iteration and overlap (the color indicates the area of
of each slice, and the number indicates the overlap size.); (b) input iteration; (c) output iteration.
each slice, and the number indicates the overlap size.); (b) input iteration; (c) output iteration.

4.9.
[Link]
InputPadding
Padding
To add padding
To paddingtotothetheinput feature
input map,map,
feature mostmost
CNN CNN
accelerators use a software-based
accelerators use a software-
approach
based beforebefore
approach loadingloading
the datathe
to data
the DDR DRAM.
to the In this work,
DDR DRAM. wework,
In this designed the circuit the
we designed
to addto
circuit zero
addpadding aroundaround
zero padding the input thefeature
input map, a concept
feature map, a known
conceptasknown
input padding
as input inpad-
the circuit. Using input padding is more effective than using output padding.
ding in the circuit. Using input padding is more effective than using output padding. Fig- Figure 11
presents
ure the ratiothe
11 presents of ratio
input of
padding storage pixels
input padding to output
storage pixelspadding based
to output on the based
padding size of on
the the
input feature map. The figure shows that as the feature map size decreases, the proportion
size of the input feature map. The figure shows that as the feature map size decreases, the
of storage pixels also decreases by up to 82.6%.
proportion of storage pixels also decreases by up to 82.6%.
based approach before loading the data to the DDR DRAM. In this work, we designed the
circuit to add zero padding around the input feature map, a concept known as input pad-
ding in the circuit. Using input padding is more effective than using output padding. Fig-
ure 11 presents the ratio of input padding storage pixels to output padding based on the
Electronics 2024, 13, 1564 11 of 18
size of the input feature map. The figure shows that as the feature map size decreases, the
proportion of storage pixels also decreases by up to 82.6%.

Electronics 2024, 13, x FOR PEER REVIEW 11 of 18


Input padding vs. output
Figure 11. Input output padding.
padding.

The input padding circuit includes the register and control signal. The first component
The input padding circuit includes the register and control signal. The first compo-
is the 2-bit padding state register, known as the “FIFO write selection”, which varies
nent is the 2-bit padding state register, known as the “FIFO write selection”, which varies
dependingononthe
depending thecurrent/total
current/total
sliceslice iteration
iteration and kernel
and kernel mode (Figure
mode (Figure 12).
12). Table Table 1 and
1 and
Figure13,
Figure 13,along
along with
with thethe description
description below,
below, explain
explain the adding-zero-padding
the adding-zero-padding method for
method for
eachcase.
each case.

Electronics 2024, 13, x FOR PEER REVIEW 11 of 18

The input padding circuit includes the register and control signal. The first compo-
nent is the 2-bit padding state register, known as the “FIFO write selection”, which varies
depending on the current/total slice iteration and kernel mode (Figure 12). Table 1 and
Figure 13, along with the description below, explain the adding-zero-padding method for
each case.

Figure
Figure12.
[Link]
Paddingcircuit structure.
circuit structure.

Table 1. FIFO write selection signal states.

FIFO Write Select Reg0 Reg1 Reg2 Reg3 Reg4



1 ZEROZERO 0 0 0 0 0

2 READZERO InBuf[0] InBuf[1] InBuf[2] InBuf[3] 0

3 ZEROREAD 0 InBuf[0] InBuf[1] InBuf[2] InBuf[3]

4 SHIFTREAD Reg4 InBuf[0] InBuf[1] InBuf[2] InBuf[3]
Figure 12. Padding circuit structure.

Figure 13. The use cases of a FIFO write select signal (number) depending on the padding position
(red frame).

Table 1. FIFO write selection signal states.

FIFO Write Select Reg0 Reg1 Reg2 Reg3 Reg4


① ZEROZERO 0 0 0 0 0
② READZERO InBuf[0] InBuf[1] InBuf[2] InBuf[3] 0
③ ZEROREAD 0 InBuf[0] InBuf[1] InBuf[2] InBuf[3]
④ SHIFTREAD Reg4 InBuf[0] InBuf[1] InBuf[2] InBuf[3]
Figure 13.
Figure [Link]
usecases of aofFIFO
cases write
a FIFO select
write signalsignal
select (number) depending
(number) on the padding
depending position position
on the padding
(red frame).
1. ZEROZERO: When all data entering the reconfigurable FIFO is zero, only zeros are
(red frame).
needed for padding.
TableREADZERO:
2. 1. FIFO write selection
All data signal states.
from the global input buffer are loaded into the FIFO when the
slice iteration
FIFO Write Select does not
Reg0 require any padding. Reg2
Reg1 Reg3 Reg4
3. ① ZEROREAD:
ZEROZERO All data0 in the global0 input buffer 0are loaded, and 0 a single zero0 is in-
serted in front of the data as padding. It is used to load a part of the zero padding at
② READZERO InBuf[0] InBuf[1] InBuf[2] InBuf[3] 0
the top slice of the input feature map.
Electronics 2024, 13, 1564 12 of 18

1. ZEROZERO: When all data entering the reconfigurable FIFO is zero, only zeros are
needed for padding.
2. READZERO: All data from the global input buffer are loaded into the FIFO when the
slice iteration does not require any padding.
3. ZEROREAD: All data in the global input buffer are loaded, and a single zero is
inserted in front of the data as padding. It is used to load a part of the zero padding at
the top slice of the input feature map.
4.
Electronics 2024, 13, x FOR PEER REVIEW SHIFTREAD: When importing new data, the last pixel of the previously imported 12 of 18
data is concatenated with the newly read data. The function exists to use the most
recently imported data in the ZEROREAD scenario.
The 1-bit wire, InB_flag, generates a read enable signal for the global input buffer buffer by
entering a two-input AND gate with a read enable signal asserted by the controller in the
KU. If the
the “FIFO
“FIFOwrite
writeselection”
selection”isisZEROZERO,
ZEROZERO, then
then thethe wire
wire hashas a value
a value of 0,ofand
0, and the
the KU
KU receives
receives onlyonly a “zero”
a “zero” value
value for for
the the padding.
padding. Therefore,
Therefore, KUKU receives
receives “zero”
“zero” datadata with-
without
accessing
out the global
accessing input
the global buffer.
input buffer.
The kernel mode also affects
affects padding. The 33 ×
padding. The × 33 convolution
convolution operation
operation with a stride
of 1 requires adding zeros around all sides of the input feature map. For For aa stride
stride ofof 2,
2,
additional zeros are only required
required for the upper and left sides of the input. Padding is not
applied in a 1 ××11convolution
convolutionoperation.
operation.

4.10. Bias–Activation–Scaling
4.10. Bias–Activation–Scaling Pipeline
Pipeline Architecture
Architecture
In this
In this study,
study, we targeted the
we targeted YOLOv5n model
the YOLOv5n model and quantized the
and quantized model to
the model to an
an integer
integer
representation. Thus, we designed an extra circuit to calculate bias, activation, and
representation. Thus, we designed an extra circuit to calculate bias, activation, and scalingscaling
parameters
parameters for for converting
converting the
the final
final convolution
convolution result
result into
into the
the feature
feature map
map forfor the
the next
next
layer. Figure 14 illustrates the bias–activation–scaling (BAS) pipeline architecture
layer. Figure 14 illustrates the bias–activation–scaling (BAS) pipeline architecture used in used
in the proposed SoC hardware. All parameters, bias, activation, and scaling parameters
the proposed SoC hardware. All parameters, bias, activation, and scaling parameters are
are represented as 16-bit integers, allowing for a fast and low-cost area architecture. The
represented as 16-bit integers, allowing for a fast and low-cost area architecture. The pipe-
pipeline process consists of six stages and requires seven cycles to process one piece of data.
line process consists of six stages and requires seven cycles to process one piece of data.

(a) (b) (c)


Figure
Figure 14.
14. Bias–activation–scaling
Bias–activation–scalingpipeline
pipelinearchitecture: (a)(a)
architecture: bias; (b)(b)
bias; activation (Leaky
activation ReLU);
(Leaky (c)
ReLU);
scaling.
(c) scaling.

The bias isisseen


The bias seenasasa apart
partof of batch
batch normalization
normalization (BN).(BN).
BN isBN is usually
usually used used to CNN
to train train
CNN models. In YOLOv5 training, the BN for CNN layers is calculated
models. In YOLOv5 training, the BN for CNN layers is calculated using Equation (2): using Equation
(2):
Xconv − E( X )
Y = p𝑋 𝐸 𝑋 × ρ + β. (2)
𝑌 Var ( X ) + ϵ 𝜌 𝛽. (2)
𝑉𝑎𝑟 𝑋 𝜖
Here, theX𝑋conv is is
Here, the thethe
output of convolutional
output of convolutional E( X ) represents
filter,filter, theVar
the mean,
𝐸 𝑋 represents (X)
mean,
means the variance, epsilon (ϵ) is added for numerical stability,
𝑉𝑎𝑟 𝑋 means the variance, epsilon (𝜖) is added for numerical stability, 𝜌 is the batchρ is the batch normal-
ization scalingscaling
normalization factor, factor,
and β is andthe𝛽shift factor
is the shift (bias). These These
factor (bias). parameters are determined
parameters are deter-
mined during the training process, and they remain constant within each layer during the
inference [14,28,29,31].
The convolution operation with bias can be represented by the following equation:
𝑌 ∑ 𝑤 ∗𝑋 𝛽 , (3)
Electronics 2024, 13, 1564 13 of 18

during the training process, and they remain constant within each layer during the
inference [14,28,29,31].
The convolution operation with bias can be represented by the following equation:

Y= ∑(w1 ∗ XFMAP ) + βbias , (3)

where w1 represents weight, X FMAP is input feature map, and β bias means the constant
number called “bias”. We simplify the addition in Equation (2) as Equation (3). This
simplification reduces hardware costs without sacrificing accuracy. In addition, we used
techniques like rounding and truncation in the BAS circuit.
The 36-bit dividers are used in the Leaky ReLU activation circuit, and they require
two clock cycles to prevent setup-time violations at a high frequency of 400 MHz. The
scaling process consists of four stages and operates for four cycles. The process involves
multiplication, division, rounding, addition, truncation, and subtraction, in that order. The
scaling applies the parameters generated by quantization.

5. Advantage of the Proposed Architecture


The YOLOv5n model requires numerous computations for each layer, with almost 99%
of them involving the convolution operation. Therefore, we developed a reconfigurable
and optimized hardware accelerator for convolution operations. The proposed computing
method in the CNN accelerator supports stride 1 and stride 2 convolution operations with
various kernel sizes. It enhances computational efficiency by using slicing and iterations,
thereby accelerating image processing in hardware. Our design efficiently utilizes hardware
resources to perform fast convolution operations, offering numerous structural advantages.
In this section, we will discuss the improvements in the proposed architecture.

5.1. High PE Array Utilization with Flexibility


Most CNN accelerator architectures demonstrate good resource utilization, reaching
up to 90% in commonly used kernel modes [3,28]. Due to CNN models becoming more
complex by increasing the depth and scale of deep neural networks (DNNs) and using
different kernel sizes and striding for image processing, they cannot perform all necessary
convolution operations. Traditional CNN accelerators typically only operate with a kernel
size of 3 × 3 and a stride of 1 or they may support a stride of 2 with less than 25%
utilization. Therefore, we designed a new CNN accelerator architecture with FDCA to
efficiently perform convolution operations using newly introduced kernel modes. Our
proposed architecture demonstrates that PE resource utilization exceeds 95%, even for
convolutions with kernel sizes of 3 × 3 or 6 × 6 at a stride of 2. Table 2 presents the PE
utilization ratio for various kernel modes with a slice size of 160.

Table 2. Clock cycle and utilization according to kernel mode.

Architecture Kenel Mode Clock Cycle (N) Utilization (%)


3 × 3 Stride 1 26,224 99.86
Previous
3 × 3 Stride 2 26,224 24.96
Architecture
1 × 1 Stride 1 25,609 11.11
3 × 3 Stride 1 26,015 99.98
Proposed
3 × 3 Stride 2 6848 96.4
Architecture
1 × 1 Stride 1 25,611 11.11
Future Work 6 × 6 Stride 2 6848 96.4

5.2. Convolution Operation 3 × 3 Stride 2 Speed Optimization


Table 3 presents a comparison of the clock cycles required for 3 × 3 convolution with a
stride of 2 in the previous and proposed architectures for processing the YOLOv5n model.
The proposed architecture consumes about 9.4-times-fewer clock cycles. The proposed
Electronics 2024, 13, 1564 14 of 18

architecture provided more data in the same time frame for the speed optimization of
convolution operation with 3 × 3 stride 2.

Table 3. Comparison of stride-2-mode clock cycle.

Layer Previous (1DCA) [28] Proposed (4FDCA)


Layer0 3,841,024 475,136
Layer1 1,920,512 237,568
Layer3 1,868,032 185,088
Layer5 1,841,792 158,848
Layer7 1,828,687 145,728
Layer18 920,896 79,424
Layer21 914,336 72,864
Total 13,135,264 1,354,656
Clock Cycle Ratio 9.39 1

5.3. Area Efficiency


In general, the optimized architecture for the specific kernel (size) mode demonstrates
high PE utilization. However, designing individual sub-circuits for each kernel mode
requires a significant amount of hardware resources. Therefore, in our proposed CNN
accelerator, we have designed it so that more than 90% of the KU area is shared among all
kernel modes. The multiplexer (Mux) is used to configure the connection between the PEs
of the KU for the required kernel mode.
Table 4 presents the total number of logic gates for the proposed CNN architecture
and previous architectures in different kernel modes. It shows that the proposed archi-
tecture’s area is 2.14 times smaller than the total area of the previous architecture. The
area was reduced to 2.14 times instead of 3 times due to the convolution operation with a
1 × 1 kernel mode, which utilizes only one PE, but the total area of KU is still 9 times larger.

Table 4. Comparison of area using gate count number.

Architecture Kernel Mode Gate Count Number


3 × 3 Stride 1 10,100,000
Application Using Previous 3 × 3 Stride 2 10,100,000
Architecture [28] 1 × 1 Stride 1 2,700,000
Merge 22,900,000
3 × 3 Stride 1
Proposed Architecture 3 × 3 Stride 2 10,697,551
1 × 1 Stride 1
Expanded Architecture 6 × 6 stride 1, 2 +200,000

Although the proposed architecture supports two additional kernel modes, the total
chip area increases by only 6% compared to using the predicted 3 × 3 stride 1 mode
separately. If we expand the proposed CNN accelerator architecture to support 6 × 6
convolution operations with stride 1 and stride 2 modes in the future, the expected chip
area will increase by 6% compared to the current area.

5.4. Data Load Optimization


The proposed design significantly improves data loading speed, performing nine
times faster than a GPU. In our design, we efficiently utilized the following components to
achieve faster data loading on the circuit:
• We used reconfigurable input FIFOs to organize, transfer, and reuse data on the KU.
FIFOs manage all data feeding and reusing procedures on the vertical and horizontal
lines of the KU unit during convolution operations. Our design allows for the use of
up to three FIFOs, depending on the configuration of the KU (PEs array). Because of
Electronics 2024, 13, 1564 15 of 18

these three reconfigurable input FIFOs, the KU can reuse feature map data, reducing
the number of data loads by up to one-third.
• The KU reuses data by sharing them among connected PEs. Typically, the GPU reads
each pixel of input data from DDR memory three times. In our design, the KU reuses
the same pixel data three times by passing it to other PEs. This mechanism reduces
the number of memory accesses by three times.

5.5. Small On-Chip Memory Size


Each of the four FDCAs consists of 32 KUs. Each FDCA is designed to simultaneously
compute four input and eight output channels. The overall design supports the computing
of 16 parallel input channels by employing four FDCAs in a parallel architecture. This
allows for the simultaneous processing of data from 16 (sixteen) input feature map channels.
Modern CNN models, such as the YOLOv5n used in this study, require the processing
of more than sixteen input and output data channels during convolutional layer computa-
tion. Simultaneously processing all corresponding channels in parallel requires multiple
connected PE arrays and on-chip memories, which increases the hardware costs of the CNN
accelerator by occupying a large amount of hardware resources. Therefore, we applied the
slicing and iteration concepts to efficiently process the input data. As a result, we were able
to optimize power consumption and chip area utilization.
By using the concept of slicing, we convolve a part of the input feature map with given
filters to generate the sliced output result. In this scenario, only the essential slice data will
be copied from DDR memory for processing in the FDCA block. Therefore, the proposed
architecture stores partial input feature map data corresponding to a single slice, and it uses
a smaller on-chip memory size in the design compared to storing the entire feature map.
Utilizing small on-chip memories helps to minimize the number of accesses to the DDR
memory, reduce data loading time, and maximize data reuse in slice rotating operations,
which involve reusing the same data for different filter weights.

6. Hardware Implementation Results


In order to evaluate the hardware cost of the proposed CNN accelerator, we imple-
mented it on the Xilinx Zynq UltraScale+ MPSoC ZCU102 FPGA platform. For hardware
synthesis, Vivado 2022.2 is used. Our implementation occupies 249,357 LUTs, 2304 DSPs,
and 567 KB of BRAMs in FPGA resource utilization. The CNN accelerator operates at
400 MHz, and the reference image inference speed is 47.17 frames per second (FPS). Table 5
shows the implementation results on FPGA.

Table 5. FPGA implementation result.

[14] [29] [31] Proposed Architecture


FPGA VC707 ZCU102 Zynq-7020 ZCU102
LUT 86k 95k 30.1k 249k
DSP 168 609 149 2304
BRAM (kB) 2308 2160 4731 567
GOPs 464.7 85.8 - 1075.2
Parameter Variable bit fixed point 16-bit fixed point 16-bit fixed point 16-bit integer
Model YOLOv2-tiny YOLOv2-tiny YOLOv4-tiny YOLOv5n
CMOS chip SRAM (kB) - - - 275.75

To test the performance of the proposed CNN accelerator, we accelerated a quantized


YOLOv5n model for inference. For this purpose, we have developed a microcode-based
CNN controller circuit that allows for the programmability of any CNN model. Our modi-
fied YOLOv5n is an object detection model pre-trained on the COCO dataset. All model
parameters, including weights, bias values, and input feature map data, were quantized to
8-bit integers. The model’s object detection performance was evaluated using mean average
precision (mAP). We set our threshold at 0.5 ([email protected]) and achieved a detection mAP of
Electronics 2024, 13, 1564 16 of 18

Electronics 2024, 13, x FOR PEER REVIEW 16 of 18


43.1% on the FPGA. The results demonstrate that the proposed architecture implementation
significantly improves inference throughput while maintaining high accuracy, similar to
the software model.
Furthermore,the
Furthermore, theproposed
proposed CNN
CNN accelerator
accelerator waswas implemented
implemented as aas a system
system on chip
on chip
(SoC) using a Samsung 14 nm CMOS process. The die consists
(SoC) using a Samsung 14 nm CMOS process. The die consists of a shared LPDDR, of a shared LPDDR, RISC-
V core, core,
RISC-V and CNN
and CNNaccelerator withwith
accelerator FDCA,FDCA,which are utilized
which in collaboration
are utilized in collaborationwith partner
with
partner companies. The area allocated for the CNN accelerator architecture is 10.96 mm2 µm
companies. The area allocated for the CNN accelerator architecture is 10.96 mm 2 (3943

× 2780
(3943 µmµm). Theµm).
× 2780 chipThe
operates at a frequency
chip operates of 400
at a frequency of MHz,
400 MHz,with a timing
with a timingconstraint
constraintset at
set
2.5atns.
2.5The
ns. total
The total
powerpower consumption
consumption of the
of the chip
chip is is 18.52mW.
18.52 [Link]
implementation uses
uses on-chip
on-chip SRAMSRAM withwith a sizeofof275.75
a size 275.75 KB.
KB. Figure
Figure1515shows
shows thethe
overall chipchip
overall layout of the
layout of the
proposed CNN accelerator
proposed CNN accelerator SoC. SoC.

Figure15.
Figure [Link]
Fullchip
chiplayout
layout implemented
implemented in nm
in 14 14 nm CMOS
CMOS process.
process.

[Link]
Conclusions
In
Inthis
thispaper,
paper,we weproposed
proposed a high-speed
a high-speedCNNCNN accelerator architecture
accelerator based
architecture on a on a
based
flexible diagonal cyclic array (FDCA). The proposed four-FDCA architecture
flexible diagonal cyclic array (FDCA). The proposed four-FDCA architecture comprises comprises
1152
1152PEsPEsthat
thatcan
canprocess
process thethe
data
data forfor
sixteen input
sixteen channels
input andand
channels eighteight
output channels
output channels
simultaneously. The proposed architecture enables the execution of convolution operations
simultaneously. The proposed architecture enables the execution of convolution opera-
with different kernel modes and strides to accelerate the latest CNN models. In the
tions with different kernel modes and strides to accelerate the latest CNN models. In the
proposed design, we introduced new optimization techniques that improved chip area
proposed design, we introduced new optimization techniques that improved chip area
efficiency by 6% and reduced total chip area utilization by 2.14 times compared to individual
efficiency
block designs by for
6%each
andkernel
reduced total
mode. Wechip
alsoarea utilization
minimized by 2.14of
the number times
DRAMcompared
accesses to
byindi-
vidual block designs
using data reuse methods. for each kernel mode. We also minimized the number of DRAM ac-
cesses
ThebyCNN
using data reuse
accelerator wasmethods.
synthesized and verified on the Xilinx ZCU102 FPGA and
The CNN accelerator
implemented in SoC silicon using 14was synthesized and verified
nm CMOS process on theThe
technology. Xilinx ZCU102
results FPGA and
demonstrate
implemented in SoC silicon using 14 nm CMOS process technology.
that the proposed CNN accelerator can perform convolution operations 3.8 times faster, The results demon-
stratethe
using that the proposed
proposed new PECNN arrayaccelerator can perform
structure, compared convolution
to previous operations 3.8 times
CNN accelerators.
faster, using the proposed new PE array structure, compared to previous CNN accelera-
tors. Contributions: Conceptualization, D.-Y.L., H.A. and H.-W.K.; Designing, D.-Y.L. and H.A.;
Author
verification, M.J.; validation, S.-B.P. and M.J.; formal analysis H.A. and D.-Y.L.; writing—original
draft preparation D.-Y.L.; writing—review and editing, H.A. and S.-B.P.; funding, S.-H.S. and K.-M.L.
Author Contributions: Conceptualization, D.-Y.L., H.A. and H.-W.K.; Designing, D.-Y.L. and H.A.;
All authors have read and agreed to the published version of the manuscript.
verification, M.J.; validation, S.-B.P. and M.J.; formal analysis H.A. and D.-Y.L.; writing—original
draft preparation D.-Y.L.; writing—review and editing, H.A. and S.-B.P.; funding, S.-H.S. and K.-
M.L. All authors have read and agreed to the published version of the manuscript.
Funding: This work was supported by the National Research Foundation of Korea (NRF) grant for
RLRC funded by the Korea government (MSIT) (No. 2022R1A5A8026986, RLRC, 25%), and was also
supported by the Institute of Information and communications Technology Planning and Evaluation
(IITP) grant funded by the Korea government (MSIT) (No. 2020-0-01304, Development of Self-Learn-
Electronics 2024, 13, 1564 17 of 18

Funding: This work was supported by the National Research Foundation of Korea (NRF) grant for
RLRC funded by the Korea government (MSIT) (No. 2022R1A5A8026986, RLRC, 25%), and was also
supported by the Institute of Information and communications Technology Planning and Evaluation
(IITP) grant funded by the Korea government (MSIT) (No. 2020-0-01304, Development of Self-
Learnable Mobile Recursive Neural Network Processor Technology, 25%). It was partly supported by
Innovative Human Resource Development for Local Intellectualization program through the Institute
of Information & Communications Technology Planning & Evaluation (IITP) grant funded by the
Korea government (MSIT) (IITP-2024-2020-0-01462, 25%). The National R&D Program supported
this research through the National Research Foundation of Korea (NRF) funded by the Ministry of
Science and ICT (No. 2020M3H2A1076786, System Semiconductor specialist nurturing, 25%).
Data Availability Statement: Data are contained within the article.
Acknowledgments: We thank Thaising Thaing (thaisingtaing@[Link]) for his invaluable
contributions to this work.
Conflicts of Interest: The authors declare no conflicts of interest.

References
1. Akkad, G.; Mansour, A.; Inaty, E. Embedded Deep Learning Accelerators: A Survey on Recent Advances. IEEE Trans. Artif. Intell.
2023, early access.
2. Jocher, G.; Stoken, A.; Chaurasia, A.; Borovec, J.; Xie, T.; Kwon, Y.; Michael, K.; Changyu, L.; Fang, J. Yolov5. NanoCode012.
v6.0—Models. 2021. Available online: [Link] (accessed on 12 October 2021).
3. Huang, W.; Wu, H.; Chen, Q.; Luo, C.; Zeng, S.; Li, T.; Huang, Y. FPGA-Based High-Throughput CNN Hardware Accelerator
with High Computing Resource Utilization Ratio. IEEE Trans. Neural Netw. Learn. Syst. 2022, 33, 4069–4083. [CrossRef] [PubMed]
4. Yang, J.; Fu, W.; Cheng, X.; Ye, X.; Dai, P.; Zhao, W. S2 Engine: A Novel Systolic Architecture for Sparse Convolutional Neural
Networks. IEEE Trans. Comput. 2022, 71, 1440–1452.
5. Wei, X.; Yu, C.H.; Zhang, P.; Chen, Y.; Wang, Y.; Hu, H.; Liang, Y.; Cong, J. Automated systolic array architecture synthesis for
high throughput CNN inference on FPGAs. In Proceedings of the 2017 54th ACM/EDAC/IEEE Design Automation Conference
(DAC), Austin, TX, USA, 18–22 June 2017; pp. 1–6.
6. Andri, R.; Cavigelli, L.; Rossi, D.; Benini, L. Hyperdrive: A Multi-Chip Systolically Scalable Binary-Weight CNN Inference Engine.
IEEE J. Emerg. Sel. Top. Circuits Syst. 2019, 9, 309–322. [CrossRef]
7. Sedukhin, S.; Tomioka, Y.; Yamamoto, K. In search of the performance-and energy-efficient CNN accelerators. IEICE Trans.
Electron. 2022, 105, 209–221. [CrossRef]
8. Liu, C.-N.; Lai, Y.-A.; Kuo, C.-H.; Zhan, S.-A. Design of 2D Systolic Array Accelerator for Quantized Convolutional Neural
Networks. In Proceedings of the 2021 International Symposium on VLSI Design, Automation and Test (VLSI-DAT), Hsinchu,
Taiwan, 19–22 April 2021; pp. 1–4.
9. Jouppi, N.P.; Young, C.; Patil, N.; Patterson, D.; Agrawal, G.; Bajwa, R.; Bates, S.; Bhatia, S.; Boden, N.; Borchers, A.; et al.
In-datacenter performance analysis of a tensor processing unit. In Proceedings of the 2017 ACM/IEEE 44th Annual International
Symposium on Computer Architecture (ISCA), Toronto, ON, Canada, 24–28 June 2017; pp. 1–12.
10. Wang, Y.; Wang, Y.; Shi, C.; Cheng, L.; Li, H.; Li, X. An Edge 3D CNN Accelerator for Low-Power Activity Recognition. IEEE
Trans. Comput. Aided Des. Integr. Circuits Syst. 2021, 40, 918–930. [CrossRef]
11. Parmar, Y.; Sridharan, K. A Resource-Efficient Multiplierless Systolic Array Architecture for Convolutions in Deep Networks.
IEEE Trans. Circuits Syst. II Express Briefs 2020, 67, 370–374. [CrossRef]
12. Chen, Y.-H.; Krishna, T.; Emer, J.S.; Sze, V. Eyeriss: An Energy-Efficient Reconfigurable Accelerator for Deep Convolutional
Neural Networks. IEEE J. Solid-State Circuits 2017, 52, 127–138. [CrossRef]
13. Lu, Y.C.; Chen, C.W.; Pu, C.C.; Lin, Y.T.; Jhan, J.K.; Liang, S.P. Live Demo: An 176.3 GOPs Object Detection CNN Accelerator
Emulated in a 28 nm CMOS Technology. In Proceedings of the 2021 IEEE 3rd International Conference on Artificial Intelligence
Circuits and Systems (AICAS), Washington, DC, USA, 6–9 June 2021; pp. 1–4.
14. Nguyen, D.T.; Nguyen, T.N.; Kim, H.; Lee, H.J. A High-Throughput and Power-Efficient FPGA Implementation of YOLO CNN
for Object Detection. IEEE Trans. Very Large Scale Integr. (VLSI) Syst. 2019, 27, 1861–1873. [CrossRef]
15. Yepez, J.; Ko, S.-B. Stride 2 1-D, 2-D, and 3-D Winograd for Convolutional Neural Networks. IEEE Trans. Very Large Scale Integr.
(VLSI) Syst. 2020, 28, 853–863. [CrossRef]
16. Li, Y.; Lu, S.; Luo, J.; Pang, W.; Liu, H. High-performance Convolutional Neural Network Accelerator Based on Systolic Arrays
and Quantization. In Proceedings of the 2019 IEEE 4th International Conference on Signal and Image Processing (ICSIP), Wuxi,
China, 19–21 July 2019; pp. 335–339.
17. Yang, G.; Lei, J.; Xie, W.; Fang, Z.; Li, Y.; Wang, J.; Zhang, X. Algorithm/Hardware Codesign for Real-Time On-Satellite CNN-Based
Ship Detection in SAR Imagery. IEEE Trans. Geosci. Remote Sens. 2022, 60, 5226018. [CrossRef]
18. Ansari, A.; Ogunfunmi, T. Hardware Acceleration of a Generalized Fast2-D Convolution Method for Deep Neural Networks.
IEEE Access 2022, 10, 16843–16858. [CrossRef]
Electronics 2024, 13, 1564 18 of 18

19. Yan, T.; Zhang, N.; Li, J.; Liu, W.; Chen, H. Automatic Deployment of Convolutional Neural Networks on FPGA for Spaceborne
Remote Sensing Application. Remote Sens. 2022, 14, 3130. [CrossRef]
20. Ardakani, A.; Condo, C.; Ahmadi, M.; Gross, W.J. An Architecture to Accelerate Convolution in Deep Neural Networks. IEEE
Trans. Circuits Syst. I Regul. Pap. 2018, 65, 1349–1362. [CrossRef]
21. Wang, J.; Yuan, Z.; Liu, R.; Feng, X.; Du, L.; Yang, H.; Liu, Y. GAAS: An Efficient Group Associated Architecture and Scheduler
Module for Sparse CNN Accelerators. IEEE Trans. Comput. Aided Des. Integr. Circuits Syst. 2020, 39, 5170–5182. [CrossRef]
22. Wang, J.; Park, S.; Park, C.S. Spatial Data Dependence Graph Based Pre-RTL Simulator for Convolutional Neural Network
Dataflows. IEEE Access 2022, 10, 11382–11403. [CrossRef]
23. Li, J.; Un, K.-F.; Yu, W.-H.; Mak, P.-I.; Martins, R.P. An FPGA-Based Energy-Efficient Reconfigurable Convolutional Neural
Network Accelerator for Object Recognition Applications. IEEE Trans. Circuits Syst. II Express Briefs 2021, 68, 3143–3147. [CrossRef]
24. Qiu, J.; Wang, J.; Yao, S.; Guo, K.; Li, B.; Zhou, E. Going deeper with embedded fpga platform for convolutional neural network.
In Proceedings of the 2016 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays, Monterey, CA, USA,
21–23 February 2016.
25. Huan, Y.; Xu, J.; Zheng, L.; Tenhunen, H.; Zou, Z. A 3D Tiled Low Power Accelerator for Convolutional Neural Network. In
Proceedings of the 2018 IEEE International Symposium on Circuits and Systems (ISCAS), Florence, Italy, 27–30 May 2018; pp. 1–5.
26. Tu, F.; Yin, S.; Ouyang, P.; Tang, S.; Liu, L.; Wei, S. Deep Convolutional Neural Network Architecture with Reconfigurable
Computation Patterns. IEEE Trans. Very Large Scale Integr. (VLSI) Syst. 2017, 25, 2220–2233. [CrossRef]
27. Redmon, J.; Divvala, S.; Girshick, R.; Farhadi, A. You only look once: Unified, real-time object detection. In Proceedings of the
2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 779–788.
28. Son, H.; Na, Y.; Kim, T.; Al-Hamid, A.A.; Kim, H. CNN Accelerator with Minimal On-Chip Memory Based on Hierarchical Array.
In Proceedings of the 2021 18th International SoC Design Conference (ISOCC), Jeju Island, Republic of Korea, 6–9 October 2021;
pp. 411–412.
29. Zhang, S.; Cao, J.; Zhang, Q.; Zhang, Q.; Zhang, Y.; Wang, Y. An FPGA-Based Reconfigurable CNN Accelerator for YOLO. In
Proceedings of the 2020 IEEE 3rd International Conference on Electronics Technology (ICET), Chengdu, China, 8–12 May 2020;
pp. 74–78.
30. Adiono, T.; Putra, A.; Sutisna, N.; Syafalni, I.; Mulyawan, R. Low Latency YOLOv3-Tiny Accelerator for Low-Cost FPGA Using
General Matrix Multiplication Principle. IEEE Access 2021, 9, 141890–141913. [CrossRef]
31. Li, P.; Che, C. Mapping YOLOv4-Tiny on FPGA-Based DNN Accelerator by Using Dynamic Fixed-Point Method. In Proceedings
of the 2021 12th International Symposium on Parallel Architectures, Algorithms and Programming (PAAP), Xi’an, China, 10–12
December 2021; pp. 125–129.
32. Babu, P.; Parthasarathy, E. Hardware acceleration for object detection using YOLOv4 algorithm on Xilinx Zynq platform.
J. Real-Time Image Process. 2022, 19, 931–940. [CrossRef]
33. Ma, Y.; Cao, Y.; Vrudhula, S.; Seo, J.-S. Optimizing the Convolution Operation to Accelerate Deep Neural Networks on FPGA.
IEEE Trans. Very Large Scale Integr. (VLSI) Syst. 2018, 26, 1354–1367. [CrossRef]
34. Zhang, C.; Sun, G.; Fang, Z.; Zhou, P.; Pan, P.; Cong, J. Caffeine: Toward Uniformed Representation and Acceleration for Deep
Convolutional Neural Networks. IEEE Trans. Comput. Aided Des. Integr. Circuits Syst. 2019, 38, 2072–2085. [CrossRef]

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual
author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to
people or property resulting from any ideas, methods, instructions or products referred to in the content.

You might also like