AI Hardware Accelerators Overview
AI Hardware Accelerators Overview
Email: [email protected]
Artificial intelligence (AI) and deep learning have emerged as powerful forces driving
innovation across a vast spectrum of industries and applications. From revolution-
izing the way we interact with technology to tackling complex scientific challenges,
these advancements are shaping the future of our world. Fig. 1.1 shows the transfor-
mation of AI over the past decade marked by a series of groundbreaking innovations,
beginning with the introduction of AlexNet in 2012, which revolutionized the field
of deep learning. This journey has seen the development of increasingly sophis-
ticated algorithms, culminating in the latest generation of large language models
(LLMs). These advancements have significantly expanded AI capabilities, enabling
more complex and nuanced understanding and generation of human language, and
transforming a wide range of applications across industries.
1
2 1 Hardware Accelerators for Artificial Intelligence
Fig. 1.1 The evolution of Artificial Intelligence over the past decade, marking significant milestones
from the introduction of AlexNet in 2012, which revolutionized neural networks, to the development
of sophisticated Generative Pre-trained Transformer (GPT) models.
In the current landscape, several key factors stand out as crucial for the ongoing
advancement of AI.
• Availability of data: The explosion of data across various fields has provided
fertile ground for the training of deep learning models. From social media posts
and medical records to satellite imagery and financial transactions, the abundance
of data has been instrumental in fueling progress in AI.
• Increased computational power: Advancements in hardware, such as Graphics
Processing Units (GPUs) and specialized AI accelerator chips, have greatly en-
hanced the building and training of increasingly sophisticated neural networks.
This increase in processing capability enables faster training durations and the
handling of larger datasets, continuously expanding the horizons of AI potential.
• Deep learning architectures: Deep neural networks, including convolutional
neural networks (CNNs) and recurrent neural networks (RNNs), have excelled in
various applications such as image classification, natural language understand-
ing, and language translation. Their success in these domains has significantly
advanced the capabilities of AI in processing and interpreting visual and textual
data.
• Generative AI: This subfield of AI focuses on creating new data, such as text,
images, code, and music. Generative models like Generative Pre-trained Trans-
formers (GPTs) can generate human-quality content that is often indistinguishable
from the real thing. This has opened up exciting possibilities for creative applica-
tions, personalized content generation, and data augmentation.
computing architectures, such as the Von Neumann architecture, have long served as
the backbone of information processing. However, these architectures are increas-
ingly falling short when it comes to the specific demands of modern AI, particularly
in the realm of neural network processing.
• Bottlenecks of Traditional CPUs
Traditional Central Processing Units (CPUs) with Von Neumann architecture
face significant challenges with deep learning, primarily due to inefficiencies in
managing the high volume of multiply-accumulate (MAC) operations essential
for neural network computation. This issue is compounded by the limited parallel
processing capabilities, narrower vector processing units, and lower memory
bandwidth of modern CPU architectures creating bottlenecks in data handling.
Additionally, the complex cache hierarchies introduce latency, and the general-
purpose design of CPUs, including their instruction sets and control flows, leads
to underutilization in the predictable, repetitive tasks of neural network processing
[15]. These factors, combined with the energy-intensive nature of neural network
computations, underscore the inadequacy of traditional computing systems for
the parallel, data-heavy demands of AI workloads, especially as the depth and
complexity of neural networks continue to grow.
• GPUs: A Stepping Stone, but Not the Solution
GPUs, with their thousands of cores, excel at parallel processing, allowing them
to perform a large number of simple computations simultaneously, which is
a common requirement in AI and deep learning algorithms. Moreover, GPUs
incorporate cutting-edge memory solutions, such as High Bandwidth Memory
(HBM) and GDDR6X, to support the intensive data transfer required for these
computations. This high bandwidth enables GPUs to efficiently perform paral-
lel processing tasks, such as matrix multiplications, and accumulate operations
found in deep learning algorithms, thereby significantly speeding up AI model
training and inference processes. Despite their superiority over CPUs in AI appli-
cations, GPUs are increasingly overshadowed by their energy-intensive nature as
AI algorithms grow in complexity and data requirements. This substantial power
consumption stems from their adherence to the Von Neumann architecture, which
leads to notable limitations in power efficiency, memory bandwidth, and latency.
These constraints not only escalate operational costs but also impede the scal-
ability and sustainability of AI advancements [16]. Consequently, while GPUs
represent a significant advancement over CPUs, they are not the ultimate solution
for the evolving landscape of AI, particularly in an era where energy efficiency
and area benchmarks are paramount.
• Need for Specialized Hardware Accelerators: Tailored for AI Efficiency
Recognizing the shortcomings of traditional architectures, researchers have turned
to the development of specialized hardware accelerators for AI. These accelerators
are intended to address the unique computational requirements of AI in their
design, focusing on the following key aspects:
– High Parallelism: AI accelerators typically employ massive parallelism, uti-
lizing hundreds or even thousands of processing cores to handle the massive
4 1 Hardware Accelerators for Artificial Intelligence
Linear Regression
Decision Trees
Conventional
ML Algorithms SVM
KNN
Supervised Logistic Regression
Random Forest
Deep Learning
CNNs
Clustering
RNNs
AI Algorithm Unsupervised
PCA
Q-Learning
Reinforcement
Learning DQN
Semi/Self
Supervised
Transformers
Advanced
Algorithms GANs
into the functionalities and applications of key AI algorithms, categorized into three
major areas:
1 Supervised Learning
• Conventional Machine Learning:
– Linear Regression: This algorithm learns a linear relationship between fea-
tures and a continuous target variable. It operates by fitting a linear equation
to the data points, optimizing the line to minimize the sum of the squared
differences between the predicted and actual values. This straightforward yet
effective method is widely used in regression tasks such as predicting housing
prices based on factors like size and location.
– Decision Trees: This algorithm classifies data points by asking a series of
yes/no questions based on features. Each question splits the data into smaller
subsets until reaching a final classification. Decision trees offer interpretable
results and are efficient for handling large datasets.
– Support Vector Machine (SVM): SVM is a powerful classification algorithm
that finds the best hyperplane separating different classes in the feature space.
It maximizes the margin between data points of different categories, offering
robustness in classification tasks.
– K-Nearest Neighbors (KNN): The K-Nearest Neighbors (KNN) algorithm
classifies a data point based on the majority vote of its ‘k’ nearest neighbors
in the feature space. Versatile and straightforward, KNN is used for both
classification and regression tasks by analyzing the distances between data
points to find the most similar neighbors. Its simplicity and effectiveness make
it popular in various machine learning applications.
– Logistic Regression: Logistic regression is used for binary classification. It
estimates the probability that a given input point belongs to a certain class.
This is done by applying a logistic function to a linear equation, providing a
probabilistic foundation for classification.
– Random Forest: Random Forest is an ensemble learning technique that builds
multiple decision trees and combines their results for improved accuracy and
stability. It is effective for both classification and regression tasks, reducing
overfitting and enhancing predictive performance.
• Deep Learning:
– Convolutional Neural Networks (CNNs): These networks excel at image
recognition and computer vision tasks. They learn to extract features from
images by applying convolution filters over the input data. CNNs are widely
used in applications like face recognition, object detection, and medical image
analysis.
– Recurrent Neural Networks (RNNs): These networks are developed to pro-
cess sequential data like text and speech. They have internal loops that allow
them to remember information from previous inputs, making them ideal for
tasks like language translation and machine translation.
2 Unsupervised Learning
1.2 AI Algorithms and their Hardware Implementation 7
(a) Google TPU V2 (d) INTEL Loihi 2 (e) ALINX FPGA Board
Fig. 1.3 Modern Hardware Accelerators: (a) TPU v2 was unveiled at Google I/O in May 2017 [1],
(b) NVIDIA Tesla P100: The World’s First AI Supercomputing Data Center GPU [2], (c) Nvidia
H100 Hopper chip [3], (d) Intel Loihi 2 Neuromorphic Chip [4], (e) ALINX AX7Z020: SoC FPGA
Development Board [5]
Artificial intelligence (AI) algorithms are increasingly being deployed in various do-
mains, such as computer vision, natural language processing, robotics, and health-
care. However, the computational complexity and resource requirements of these
algorithms pose significant challenges for their efficient and scalable implementa-
tion on hardware platforms. In this section, we review some case studies of specific
hardware specially designed for handling fast and energy-efficient AI computation,
covering different types of AI algorithms and hardware platforms. Selected hardware
packages and boards are demonstrated in Fig. 1.3.
• Tensor Processing Unit (TPU):One example of a state-of-the-art AI algorithm
implementation on a hardware accelerator is the Google TPU (Tensor Process-
ing Unit), a custom ASIC (Application-Specific Integrated Circuit) specifically
designed to accelerate deep neural networks (DNNs). The TPU uses a systolic
array architecture, facilitating high-throughput matrix multiplication, essential
for DNN operations. It also supports reduced-precision arithmetic and data com-
pression techniques, enhancing both the performance and energy efficiency of
DNN inference. The TPU has been deployed in powering various Google ser-
vices, including Google Search, Google Photos, Google Translate, and AlphaGo.
A detailed architectural explanation of the TPU is provided in a later section1.3.
1.2 AI Algorithms and their Hardware Implementation 9
The rapid evolution of artificial intelligence (AI) has spurred an equally significant
demand for specialized hardware solutions capable of handling its increasingly
sophisticated algorithms. While traditional CPUs have played a crucial role in AI
development, they are often outmatched by the computational demands of modern
machine learning (ML) and neural network (NN) algorithms. This has led to the
rise of alternative hardware platforms specifically designed to accelerate AI tasks.
Here, we will explore these diverse hardware solutions, providing insight into their
comparative strengths and weaknesses.
A. GPUs:
– Strengths: Initially designed for rendering graphics in gaming and visual ap-
plications, GPUs have evolved significantly over the past 30 years to become
crucial for AI applications. The high data throughput and massive parallelism
of GPUs, with their hundreds of cores, allow them to excel in parallel calcula-
tions such as matrix multiplications. This makes them ideal for deep learning,
big data analytics, and genomic sequencing. In particular, GPUs are highly
effective in training AI models, where the ability to process large, similar
datasets simultaneously is essential. Extensively used in neural networks and
accelerated AI operations, GPUs provide the computational power necessary
for handling large volumes of identical or unstructured data. Today, their ad-
12 1 Hardware Accelerators for Artificial Intelligence
vanced capabilities make them a staple in data centers and cloud applications,
continuing to play a significant role in the AI revolution.
– Weaknesses: GPUs, while powerful, face challenges in multi-GPU setups due
to complexity and scalability issues. Programming for these systems requires
specialized knowledge, and communication overhead between GPUs can hin-
der performance. Memory management is critical, as large datasets may not
fit in a single GPU. Additionally, the high cost and increased power consump-
tion are significant considerations. Compatibility with AI frameworks and
diminishing performance gains with added GPUs further complicate their use.
Maintenance and limited availability of cloud services also pose challenges.
B. FPGAs (Field-Programmable Gate Arrays):
– Strengths: FPGAs, positioned between CPUs and GPUs, offer a unique blend
of benefits for AI acceleration. Unlike CPUs with limited parallelism and
GPUs that may lack power efficiency, FPGAs provide reconfigurable hard-
ware tailored to specific AI tasks. They enable custom hardware accelerators
that can be programmed to align perfectly with an AI algorithm’s require-
ments, resulting in enhanced performance and energy efficiency. FPGAs excel
due to their parallel processing capabilities, ideal for AI operations like matrix
multiplications. Additionally, they allow for bespoke hardware customization,
offloading intensive computations for faster execution. Crucially, FPGAs are
well-suited for low-latency inference in real-time applications, offering imme-
diate predictions. Their ability to be fine-tuned for specific tasks also makes
them highly energy-efficient, particularly beneficial for edge computing and
IoT applications.
– Weaknesses: Despite their versatility, FPGAs pose challenges in programming
complexity, requiring expertise in hardware description languages like VHDL
or Verilog, which can be a barrier for some AI practitioners. They also face
resource constraints, including limited logic gates, memory, and DSP blocks,
making optimization of AI algorithms within these confines a complex task.
Moreover, scalability with FPGAs in large clusters or cloud environments can
be more intricate compared to GPUs. FPGAs are highly efficient for specific
algorithms or tasks, but adapting them to different AI models may necessitate
substantial reconfiguration, limiting their flexibility in certain scenarios.
E. Neuromorphic Integrated Circuits (ICs):
– Strengths: Neuromorphic ICs, inspired by the human brain, excel in efficiency
and speed for specific AI tasks. Their design, mimicking neurons and synapses,
allows for lower power consumption and faster data processing, particularly in
pattern recognition and sensory data interpretation. This makes them ideal for
edge computing applications in AI.
– Weaknesses: The complexity of mimicking biological structures means neu-
romorphic ICs are still in the early development stages. They may not yet
match the versatility or raw power of more established platforms like GPUs
1.2 AI Algorithms and their Hardware Implementation 13
for a broad range of AI tasks. Their specialized nature also poses a challenge
for programming and integration into existing technology stacks.
C. Application-Specific Integrated Circuits (ASICs):
– Strengths: AI ASICs are custom hardware designed and optimized for specific
AI computations, allowing for the creation of innovative architectures tailored
to particular models or applications. This specialization in task execution leads
to exceptional energy efficiency and a smaller footprint compared to other
accelerators. Their design focus on specific functions also results in higher
speed and performance, making ASICs highly effective for high-performance,
energy-sensitive applications within AI. The combination of tailored architec-
ture, power efficiency, and compact size positions ASICs as a superior choice
for specialized AI computations.
– Weaknesses: The biggest drawback of ASICs is their lack of flexibility. Their
custom design locks them into performing a single function, making them
unsuitable for adapting to evolving AI algorithms and changing requirements.
Additionally, their high development cost can make them less attractive for
smaller projects.
D. Emerging Devices:
– Strengths: Emerging devices in AI hardware applications are advancing with
novel architectures and memory technologies. Process-in-memory and near-
memory computing, for instance, are changing how data is handled, bringing
computation closer to storage, thereby improving speed and efficiency. Inno-
vations in non-volatile memory technologies like ReRAM, PCM, and MRAM
are also pivotal, as they enable matrix-vector multiplication operations di-
rectly within the device, a critical operation in AI algorithms. Additionally,
new devices like FeFET (Ferroelectric Field Effect Transistor) are emerging,
promising advancements in AI hardware design due to their unique properties.
These technologies and architectures collectively represent a significant leap
forward in AI hardware, offering improvements in speed, energy efficiency,
and computing power.
– Weaknesses: Emerging device-based AI hardware accelerators, despite their
potential, face challenges like variation, aging, and device variabilities, leading
to inconsistent behavior. Process compatibility issues also present hurdles in
integrating these technologies into existing manufacturing systems. However,
engineers and physicists are actively working to optimize these devices for
AI hardware design. They are focused on addressing these variabilities and
compatibility issues to harness the full potential of these advanced technologies
for efficient and reliable AI computation.
The diverse landscape of hardware solutions for AI presents developers and re-
searchers with a range of options. Understanding the strengths and weaknesses of
each platform and carefully considering application requirements is key to selecting
the optimal hardware for successful AI implementation. As the field of AI contin-
14 1 Hardware Accelerators for Artificial Intelligence
ues to evolve, we can expect further innovations in hardware design, pushing the
boundaries of performance and efficiency to new heights.
x + ᶴ x + ᶴ x + ᶴ
x + ᶴ x + ᶴ x + ᶴ
x + ᶴ x + ᶴ x + ᶴ
Fig. 1.4 Dynamic Dataflow Architecture for NeuFlow (the figure is recreated from [18]).
4 Parallel Processing: NeuFlow exploits parallelism both within modules and across
images for efficient processing.
5 Smart DMA: This component handles off-chip memory interactions, ensuring
efficient data transfer and storage.
6 Real-Time Execution: The architecture is optimized for real-time vision tasks,
providing high throughput for filter-based algorithms like convolutional networks.
NeuFlow’s design allows for efficient, real-time processing of complex vision
algorithms, highlighting its potential in advanced vision systems.
The DianNao family of hardware accelerators, which falls under the ASIC category,
is designed for machine learning, especially neural networks, and includes DianNao,
DaDianNao[20], ShiDianNao[21], and PuDianNao[22]. Each architecture is special-
ized for specific aspects of machine learning, with a focus on minimizing memory
transfers and maximizing efficiency. DianNao, the first in the series, is optimized for
16 1 Hardware Accelerators for Artificial Intelligence
DMA
Inst.
Instructions
X +
DMA
Inst.
NBin
+
Memory Interface
X +
DMA
Inst.
X +
NBout
+
Tn x Tn
X +
SB
Fig. 1.5 DaDianNao Accelerator Architecture (the figure is recreated from [20]).
having each tile work in parallel, DaDianNao achieves a higher processing capac-
ity, making it well-suited for more complex neural network tasks. The architecture
is depicted in Fig. 1.5.
• ShiDianNao takes a different direction, focusing on low-power applications, par-
ticularly in image processing. Its architecture is simplified compared to DianNao,
aiming to reduce energy consumption. This makes ShiDianNao ideal for embed-
ded systems where power efficiency is crucial. The design targets high-throughput
vision processing tasks, optimizing the architecture for embedded applications
that require efficient image processing capabilities.
• PuDianNao extends the capabilities of the DianNao series to support a broader
range of machine learning algorithms, not just neural networks. It includes special-
ized hardware units dedicated to different machine-learning tasks, offering a more
versatile approach to machine-learning computations. This general-purpose ori-
entation makes PuDianNao a flexible solution, capable of handling various types
of machine learning algorithms beyond the scope of traditional neural network
processing.
The Neural Processing Unit (NPU) falls under the category of a digital ASIC. NPU
operates by transforming select segments of general-purpose programs into neu-
ral network (NN) models [17]. This process, known as the Parrot transformation,
involves identifying program segments that can tolerate approximation without sig-
nificantly impacting overall accuracy. Once transformed, these segments are executed
on the NPU, which is specially designed to efficiently process NN models. The inte-
gration of NPUs into traditional computing systems involves a co-processor model,
where NPUs work alongside standard CPUs. The CPU handles regular precise com-
putations, while the NPU accelerates approximate computations. This dual-processor
approach allows for a balance between accuracy and computational efficiency. The
architecture of the NPU, particularly in the operational framework depicted in Figure
6, showcases the detailed workflow of how program segments are processed. This
includes data flow, control mechanisms, and the interaction between the NPU and
the main CPU. The advantage of this setup lies in its ability to significantly improve
performance and reduce energy consumption for suitable tasks, with the trade-off
being a manageable decrease in computational accuracy. The Neural Processing
Unit (NPU) architecture is shown in Fig. 1.6. This specialized processing unit for AI
works in the following steps at the architectural level:
1 Identification of Approximable Code Segments: The CPU identifies code seg-
ments within a general-purpose program that are suitable for approximation.
2 Parrot Transformation: These identified segments are transformed into neural
network models using the Parrot transformation. This involves mapping the pro-
gram’s logic and data flow into a neural network structure.
18 1 Hardware Accelerators for Artificial Intelligence
Processing Processing
Engine Engine
Processing Processing
Engine Engine
Accumulator
Registors
Processing Processing
Engine Engine
Sigmoid
Unit Output Register File
Output FIFO
Scaling Controller
Input FIFO
Unit
(a) (b)
Fig. 1.6 The Neural Processing Unit (NPU) Architecture (a) 8-Processing engine (PE) NPU, (b)
Single processing engine (PE) (the figure is recreated from [17]).
Center Router
1
ADC
FIFO FIFO FIFO AMP AMP AMP AMP
Fig. 1.7 The RENO Architecture (The figure is recreated from [19]).
Vault
Cache $ $ $ $ $ $
Mem for W Buffer Cnt
weights
MAC M M M M M M
A
OP.
TSV X
counter
B
μ +
VC
PNG C
R R R
NoC PE PE PE PE
Link
Fig. 1.8 Neurocube Architecture (left) and Processing Elements (PEs) Organization (right) (the
figure is recreated from [24]).
BANK
Ctrl Data from Control Mux
E Buffer Cmd Adr
from
Vol. CMD Dec.
Sub
Mem Subarray
Latch Current
en Sigmoid from
AMP SW negative Timing Ctrl
Vread array
Data Data
MAT WD Vwrite flow path
A Ctrl E E contr config.
B from
Global Row Decoder
WDD
Crossbar Crossbar Ctrl
WDD
From buffer
from E Comp Reg
win - code
Reg 4
FF Subarray
Counter
msb
Col Mux. A Col Mux. B
en en Ctrl E
SA SA C Output Reg. from
vol. vol. C
0
Col Mux. Col Mux.
Ctrl E Data to/from FF
mem, data flow
WDD
WDD
ReRAM ReRAM from MUX
Crossbar Crossbar :Add on hardware comp, data flow Reg
Global Dec.
E Connections GWL: Global word line WDD: Wordline Decoder and Driver
Buffer
Bufffer Subarray SA: Sense Amplifier GDL: Global data line Mat SubArray
Controller Global IO Row Buffer AMP:Amplifier SW:Switch VOL: Voltage Sources DL
D
Fig. 1.9 The PRIME Framework Illustrated. On the left, the memory bank layout is shown, with
bold blue and red lines indicating pathways for standard memory operations and computational
tasks. On the right, enhancements specific to PRIME are highlighted: (A) a wordline driver with
multiple voltage levels; (B) a column multiplexer with analog subtraction and sigmoid functions;
(C) a versatile sense amplifier with counters for multi-level signals, including ReLU functions and
4-to-1 max pooling; (D) the linkage between flip-flop (FF) and Buffer subarrays; (E) the PRIME
management unit. (the figure is recreated from [28])
and energy-intensive data transfers between separate processing and storage units.
Furthermore, the PIM approach inherently minimizes data movement, significantly
cutting down on energy consumption and improving operational efficiency by pro-
cessing data directly within the memory arrays. PRIME’s configurable ReRAM
arrays provide remarkable flexibility, supporting various neural network architec-
tures and accommodating a wide range of computational needs, enhancing the
architecture’s utility across different application scenarios. However, PRIME also
has several limitations, primarily due to its reliance on ReRAM technology. The
integration of processing and memory in ReRAM arrays increases the complexity
of design and fabrication, posing challenges in integrating ReRAM technology at a
large scale while ensuring consistent performance. Additionally, ReRAM is not as
standardized as CMOS technology and may face reliability issues such as retention,
cycle-to-cycle, and device-to-device variations, which can affect the performance
and predictability of PRIME. The architecture is specifically designed for neural
network computations, limiting its effectiveness for other types of computations and
restricting its utility in more general-purpose computing scenarios. Lastly, ReRAM
technologies are known for variability and stability challenges, including retention
time, endurance, and variability in electrical characteristics, which can lead to perfor-
mance inconsistencies and impact the overall efficiency and reliability of the PRIME
architecture.
Host Interface
Activation (64k per cycle)
Setup
Storgae)
Control
Accumulators
Instr Activation
167 GiB/s
Off Chip I/O Normalize/Pool
Data Buffer
Computation
Control Control
Control
Fig. 1.10 Tensor Processing Unit (TPU) Architecture Block Diagram (the figure is recreated from
[23])
External Memory
Psum Router Psum Router
Psum Router Psum Router
Weight Router Iact Router
GLB Cluster Router Router GLB Cluster Weight Router Iact Router
PE Cluster Cluster Cluster PE Cluster Weight Router Iact Router
Fig. 1.11 Eyeriss v2 Top Level Architecture (the figure is recreated from [27]).
Eyeriss V2, shown in Fig.1.11, is a successor to the Eyeriss architecture [26] im-
plemented as an ASIC. It is tailored for efficient processing of compact and sparse
Deep Neural Networks (DNNs) and was published in 2019 by a research group at
MIT [27]. It addresses the challenges posed by varying layer shapes and sizes and
the demand for energy-efficient hardware in mobile devices. The architecture details
are given below:
1 Hierarchical Mesh Network (HM-NoC): A key feature of EyerissV2, HM-NoC
adapts to different bandwidth and data reuse requirements. It efficiently handles
the data flow across the processor, enhancing throughput and energy efficiency.
2 Processing Element (PE) Design: The architecture includes PEs tailored for pro-
cessing sparse DNNs. It utilizes Compressed Sparse Column (CSC) format for
data storage and movement, reducing storage costs and improving energy effi-
ciency.
3 Row-Stationary (RS) Dataflow: EyerissV2 adopts RS dataflow, optimizing data
reuse and minimizing data movement. This approach is particularly effective for
the compact DNN layers found in modern mobile-oriented networks.
4 Run-Length Coding (RLC) in EyerissV2: RLC, a compression technique, is not
explicitly mentioned in EyerissV2’s architecture. However, the use of CSC for data
compression can be seen as a related approach to handling sparse data efficiently.
This methodology plays a crucial role in optimizing data storage and transfers in
26 1 Hardware Accelerators for Artificial Intelligence
sparse DNNs, directly impacting the overall performance and energy efficiency
of EyerissV2.
Eyeriss V2 offers several advantages in deep neural networks (DNNs), particu-
larly for mobile devices. It excels in energy-efficient processing, a crucial feature
for battery-powered mobile applications. The architecture is specifically designed to
handle the complexities and challenges associated with compact and sparse DNNs,
which are becoming increasingly common in mobile technology. Additionally, Ey-
eriss V2’s Hierarchical Mesh Network-on-Chip (HM-NoC) provides flexible data
handling, ensuring high throughput and efficiency across various DNN structures.
However, Eyeriss V2 also presents certain limitations. The advanced architecture,
particularly the HM-NoC, introduces significant design and implementation com-
plexities. Furthermore, being tailored for compact and sparse DNNs, Eyeriss V2
may have limited applicability for other types of neural networks or general-purpose
computing tasks.
0 1 0 1 0 0 1 0 0
6 1 5 3 1 1 1 0 3
RLC Decoding
MAC
11
0 1 0 1 0 0 1 0 0
Accumulation
6 1 5 3 1 1 1 0 3
Buffer
3, 0, 0, 1, 1, 1, 1, 5, 5, 6
Row-Major streaming
Quatization+ReLU
3 0 0 1
1 1 1 1
5 5 6 3
Activation Tensor
Channel N
Fig. 1.12 Illustration of the CompAct Architecture. The RLC encoding and decoding blocks ensure
that activations are always stored on-chip in a compressed form(the figure is recreated from [25]).
5 Systolic Array-Based Design: The design utilizes a systolic array structure, which
is advantageous for parallel processing in neural network computations.
CompAct’s integration of Run-Length Coding (RLC) for on-chip compression
addresses the specific challenge of memory access energy in Convolutional Neural
Networks (CNNs). This architecture stands out due to its ability to switch between
Sparse and Lossy RLC, offering a customizable approach to balance compression
and accuracy. CompAct significantly enhances energy efficiency by reducing the
activation buffer energy, leading to lower total chip energy consumption. Its tailored
RLC scheme effectively minimizes the memory footprint, enhancing performance
without sacrificing accuracy. Furthermore, the architecture’s adaptability to differ-
ent CNN workloads makes it versatile and efficient in various application scenarios.
However, the sophisticated integration of RLC and its variations introduces complex-
ity in design and implementation. Additionally, especially with Lossy-RLC, there
is a potential trade-off between compression efficiency and the accuracy of neural
network outputs. In summary, CompAct is a sophisticated and efficient accelerator
28 1 Hardware Accelerators for Artificial Intelligence
such as SRAM or non-volatile memory, which consume less power than tradi-
tional DRAM. Clock gating and power gating techniques can also be employed to
reduce power usage in idle circuitry. Additionally, optimizing the memory hier-
archy to reduce access to higher power-consuming memory units, and employing
approximate computing methods where high precision is not required, can fur-
ther lower energy consumption. These strategies not only reduce the power usage
during intensive computations but also minimize the overall energy footprint
of the hardware, making the accelerators more suitable for energy-constrained
environments.
• Scalability: Scalability in neural network accelerators is critical to adapt to the
diverse requirements of different neural network models. This extends beyond
simple modular design to encompass a range of techniques aimed at ensuring
the accelerator can handle networks of varying sizes and complexities. Adaptive
precision, where the computational precision is adjusted based on the requirement
of the task, is one approach. This not only aids in resource management but also
in energy efficiency. Scalability also involves employing advanced interconnect
architectures that can efficiently manage the increased data flow in larger networks.
These architectures should support both vertical scaling, where the capabilities
of an individual unit are enhanced, and horizontal scaling, where more units
are added to increase throughput. Another aspect is the use of reconfigurable
computing elements, like FPGAs, which allow the accelerator to adapt to different
network topologies. Additionally, the integration of heterogeneous computing
elements, combining CPUs, GPUs, and custom ASICs, can provide the flexibility
to optimize performance for various types of neural network computations. Lastly,
scalability must also consider the software stack, ensuring that the accelerator’s
hardware is fully utilized through efficient software frameworks and libraries that
can adapt to the changing hardware configurations.
• Hardware-Software Synergy: Optimal performance in neural network acceler-
ators hinges on the deep integration between hardware and software. This goes
beyond using specialized hardware instructions or adapting hardware to spe-
cific algorithms. It involves developing software that can dynamically interact
with hardware features, including real-time reconfigurability and adaptability to
various computational loads. Advanced machine learning techniques, such as
hardware-aware neural architecture search (NAS), play a pivotal role in this co-
optimization process, allowing for the simultaneous design of both hardware
and software to maximize performance and efficiency. Additionally, the synergy
should consider the use of compiler optimizations that can translate high-level
neural network models into efficient hardware-specific instructions, bridging the
gap between algorithmic design and hardware execution. This also includes the
development of robust APIs and SDKs that provide an abstraction layer, enabling
developers to efficiently utilize hardware resources without needing deep hard-
ware expertise. Furthermore, embracing emerging technologies like in-memory
computing and neuromorphic computing in the hardware design can open new
avenues for software to exploit these novel architectures, leading to breakthroughs
in processing speed and energy efficiency. This comprehensive approach ensures
30 1 Hardware Accelerators for Artificial Intelligence
that the hardware is not only capable of running current neural network models
efficiently but is also future-proofed against the evolving landscape of machine
learning algorithms.
• Processing Parallelism: Handling the vast parallel computations in neural net-
works demands architectures capable of concurrent processing. This extends
beyond mere multiple processing elements working in unison. One example is
the use of systolic arrays, as seen in Google’s TPU, where data flows across a
grid of processors, enabling highly efficient matrix operations. Another approach
is the employment of SIMD (Single Instruction, Multiple Data) architectures,
commonly found in GPUs, which are adept at handling vectorized operations.
Additionally, techniques like pipelining can be leveraged, where different stages
of a computation are overlapped, and data parallelism, where the same opera-
tion is performed on different data sets in parallel. More advanced techniques
include spatio-temporal parallelism, which combines both spatial and temporal
aspects to maximize resource utilization. This approach can be particularly ef-
fective in recurrent neural network (RNN) computations, where both spatial and
temporal data dependencies are prevalent. Moreover, emerging technologies like
neuromorphic computing, which mimics the parallel processing capabilities of
the human brain, present new frontiers in parallel processing architecture. These
technologies offer the potential for massively parallel processing capabilities, far
exceeding traditional computing architectures.
In summary, AI accelerator design must be evaluated for the specific needs of
their applications to choose the most appropriate form of design trade-offs. By
understanding and leveraging these various architectural features, designers can
significantly enhance the computational efficiency of AI accelerators.
The explosive growth of deep learning applications has fueled the demand for effi-
cient hardware accelerators. Traditional processors struggle with the immense com-
putational demands of deep learning algorithms, leading to high energy consumption
and long latency. Hardware accelerators offer a promising solution by exploiting the
inherent parallelism and data-intensive nature of these algorithms. However, design-
ing efficient hardware accelerators for deep learning requires careful consideration of
various optimization techniques. We will discuss these techniques in this section,
exploring their methodologies and impact on accelerator performance.
Hardware accelerators are used in various applications where deep learning models
play a critical role. These applications include:
1.5 Applications and Future 33
The Diverse Applications of Hardware Accelerators for Deep Learning Deep learn-
ing models have revolutionized countless industries, but their complex computational
needs pose significant challenges. Hardware accelerators, specialized hardware de-
signed to efficiently execute these models, are emerging as the key to unlocking their
full potential. By dramatically reducing training times and inference speeds, they
enable faster innovation and deployment across diverse fields.
1 Computer Vision:
– Image and Video Recognition: From facial recognition in security systems
to automatic image tagging on social media platforms, hardware accelerators
power the real-time analysis of vast amounts of visual data.
– Object Detection: Self-driving cars rely on hardware accelerators for real-time
object detection on the road, ensuring safe navigation. Similarly, retail stores
utilize these tools for inventory management and theft prevention.
– Autonomous Vehicles: The development of autonomous vehicles hinges on
the efficient processing of sensor data. Hardware accelerators enable real-time
obstacle detection and path planning, making self-driving cars a reality.
2 Natural Language Processing:
– Machine Translation: Breaking down language barriers is becoming easier
with hardware accelerators. They power real-time translation tools, facilitating
communication across cultures and languages.
– Sentiment Analysis: Businesses rely on sentiment analysis to understand cus-
tomer opinions and improve their products and services. Hardware accelerators
accelerate this process, allowing for faster and more accurate insights.
– Chatbot Development: Chatbots are taking over customer service and online
interactions. Hardware accelerators enable natural and engaging conversations
by powering the complex algorithms behind these virtual assistants.
3 Speech Recognition:
– Voice Assistants: From Siri to Alexa, voice assistants are changing the way we
interact with technology. Hardware accelerators power these tools, enabling
natural language understanding and accurate response generation.
– Smart Speakers: Smart speakers are revolutionizing the way we listen to music
and access information. Hardware accelerators make these devices responsive
and intelligent, providing users with a seamless experience.
– Transcription Services: Transcribing audio and video content is becoming
easier and more efficient with hardware acceleration. This technology is trans-
forming workflows in various industries, including education, media, and law.
4 Robotics:
34 1 Hardware Accelerators for Artificial Intelligence
6 Finance:
– Fraud Detection: The financial industry relies on robust fraud detection systems
to protect against financial crimes. Hardware accelerators enable real-time
analysis of financial transactions, preventing fraudulent activities.
– Risk Assessment: Evaluating financial risk is crucial for investors and lenders.
Hardware accelerators provide faster and more accurate risk assessments,
leading to better financial decisions.
– Algorithmic Trading: High-frequency trading relies on split-second decision-
making. Hardware accelerators enable faster execution of trades, providing a
competitive edge in the financial markets.
7 Internet of Things (IoT):
– Edge Computing: Processing data directly at the edge of the network is crucial
for many IoT applications. Hardware accelerators enable real-time analysis of
sensor data, leading to faster response times and improved decision-making.
– Sensor Data Analysis: The vast amount of data generated by IoT devices needs
to be processed efficiently. Hardware accelerators enable real-time analysis of
this data, unlocking valuable insights and enabling smarter devices.
– Anomaly Detection: Identifying unusual patterns in sensor data is critical
for various applications, such as predictive maintenance and fault detection.
Hardware accelerators power these anomaly detection algorithms, ensuring
timely identification and response to potential problems.
1.6 Conclusion 35
1.6 Conclusion
FPGAs, and ASICs, emphasizing their pivotal roles in augmenting the efficiency
and performance of AI tasks. The chapter also encapsulates the challenges en-
countered in the development of these accelerators, their influence on advancing
AI capabilities, and the anticipation of future innovations in this domain. By
weaving together technical insights, practical examples, and forward-looking per-
spectives, the chapter aims to furnish a nuanced understanding of the current state
and emerging trends in AI hardware development, catering to a diverse audience
ranging from industry practitioners to academic researchers.
References
14. Ting-Chang Chang, Kuan-Chang Chang, Tsung-Ming Tsai, Tian-Jian Chu, Simon M. Sze,
Resistance random access memory,Materials Today,Volume 19, Issue 5,2016,Pages 254-264,
ISSN 1369-7021, https://doi.org/10.1016/j.mattod.2015.11.009.
15. Capra, M.; Bussolino, B.; Marchisio, A.; Shafique, M.; Masera, G.; Martina, M. An Updated
Survey of Efficient Hardware Architectures for Accelerating Deep Convolutional Neural Net-
works. Future Internet 2020, 12, 113. https://doi.org/10.3390/fi12070113
16. E. Nurvitadhi, D. Sheffield, Jaewoong Sim, A. Mishra, G. Venkatesh and D. Marr, ”Acceler-
ating Binarized Neural Networks: Comparison of FPGA, CPU, GPU, and ASIC,” 2016 Inter-
national Conference on Field-Programmable Technology (FPT), Xi’an, China, 2016, pp. 77-
84, doi: 10.1109/FPT.2016.7929192. keywords: Neurons;Random access memory;Biological
neural networks;Field programmable gate arrays;Graphics processing units;Hardware;System-
on-chip;Deep learning;binarized neural networks;FPGA;CPU;GPU;ASIC;data analyt-
ics;hardware accelerator,
17. H. Esmaeilzadeh, A. Sampson, L. Ceze and D. Burger, ”Neural Acceleration for General-
Purpose Approximate Programs,” 2012 45th Annual IEEE/ACM International Sympo-
sium on Microarchitecture, Vancouver, BC, Canada, 2012, pp. 449-460, doi: 10.1109/MI-
CRO.2012.48. keywords: Approximate Computing;Neural Networks;Accelerator;Neural Pro-
cessing Unit;NPU,
18. C. Farabet, B. Martini, B. Corda, P. Akselrod, E. Culurciello and Y. Le-
Cun, ”NeuFlow: A runtime reconfigurable dataflow processor for vision,” CVPR
2011 WORKSHOPS, Colorado Springs, CO, USA, 2011, pp. 109-116, doi:
10.1109/CVPRW.2011.5981829. keywords: Tiles;Computer architecture;Runtime;Field pro-
grammable gate arrays;Hardware;Convolvers;Feature extraction,
19. X. Liu et al., ”RENO: A high-efficient reconfigurable neuromorphic computing accelerator
design,” 2015 52nd ACM/EDAC/IEEE Design Automation Conference (DAC), San Fran-
cisco, CA, USA, 2015, pp. 1-6, doi: 10.1145/2744769.2744900. keywords: Active appearance
model;Artificial neural networks;Training;Arrays;Routing;Accuracy;Memristors,
20. Tianshi Chen, Zidong Du, Ninghui Sun, Jia Wang, Chengyong Wu, Yunji Chen, and
Olivier Temam. 2014. DianNao: a small-footprint high-throughput accelerator for ubiqui-
tous machine-learning. SIGARCH Comput. Archit. News 42, 1 (March 2014), 269–284.
https://doi.org/10.1145/2654822.2541967
21. Z. Du et al., ”ShiDianNao: Shifting vision processing closer to the sensor,” 2015 ACM/IEEE
42nd Annual International Symposium on Computer Architecture (ISCA), Portland, OR,
USA, 2015, pp. 92-104, doi: 10.1145/2749469.2750389. keywords: Kernel;Neural net-
works;Sensors;Energy efficiency;Smart phones;Filtering;Neurons,
22. Daofu Liu, Tianshi Chen, Shaoli Liu, Jinhong Zhou, Shengyuan Zhou, Olivier Te-
man, Xiaobing Feng, Xuehai Zhou, and Yunji Chen. 2015. PuDianNao: A Polyva-
lent Machine Learning Accelerator. SIGPLAN Not. 50, 4 (April 2015), 369–381.
https://doi.org/10.1145/2775054.2694358
23. Jouppi, N. P., Young, C., Patil, N., Patterson, D., Agrawal, G., Bajwa, R., Bates, S., Bhatia,
S., Boden, N., Borchers, A., Boyle, R., Cantin, P., Chao, C., Clark, C., Coriell, J., Daley, M.,
Dau, M., Dean, J., Gelb, B., Yoon, D. H. (2017). In-Datacenter Performance Analysis of
a Tensor Processing Unit. In Proceedings of the 44th Annual International Symposium on
Computer Architecture. ISCA ’17: The 44th Annual International Symposium on Computer
Architecture. ACM. https://doi.org/10.1145/3079856.3080246
24. D. Kim, J. Kung, S. Chai, S. Yalamanchili and S. Mukhopadhyay, ”Neurocube: A Pro-
grammable Digital Neuromorphic Architecture with High-Density 3D Memory,” 2016
ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA), Seoul,
Korea (South), 2016, pp. 380-392, doi: 10.1109/ISCA.2016.41. keywords: Neurons;Random
access memory;Computer architecture;Three-dimensional displays;Biological neural net-
works;Artificial neural networks;Neural nets;Neurocomputers;Neuromorphic computing,
25. Jeff (Jun) Zhang, Parul Raj, Shuayb Zarar, Amol Ambardekar, and Siddharth Garg. 2019.
CompAct: On-chip Compression of Activations for Low Power Systolic Array Based CNN
Acceleration. ACM Trans. Embed. Comput. Syst. 18, 5s, Article 47 (October 2019), 24 pages.
https://doi.org/10.1145/3358178
38 1 Hardware Accelerators for Artificial Intelligence
26. Y. -H. Chen, T. Krishna, J. S. Emer and V. Sze, ”Eyeriss: An Energy-Efficient Reconfig-
urable Accelerator for Deep Convolutional Neural Networks,” in IEEE Journal of Solid-
State Circuits, vol. 52, no. 1, pp. 127-138, Jan. 2017, doi: 10.1109/JSSC.2016.2616357.
keywords: Shape;Random access memory;Computer architecture;Throughput;Clocks;Neural
networks;Hardware;Convolutional neural networks (CNNs);dataflow processing;deep
learning;energy-efficient accelerators;spatial architecture,
27. Y. -H. Chen, T. -J. Yang, J. Emer and V. Sze, ”Eyeriss v2: A Flexible Accelerator for Emerg-
ing Deep Neural Networks on Mobile Devices,” in IEEE Journal on Emerging and Selected
Topics in Circuits and Systems, vol. 9, no. 2, pp. 292-308, June 2019, doi: 10.1109/JET-
CAS.2019.2910232. keywords: Hardware;Shape;Arrays;Parallel processing;Mobile hand-
sets;Bandwidth;Deep neural network accelerators;deep learning;energy-efficient accelera-
tors;dataflow processing;spatial architecture,
28. P. Chi et al., ”PRIME: A Novel Processing-in-Memory Architecture for Neural Net-
work Computation in ReRAM-Based Main Memory,” 2016 ACM/IEEE 43rd Annual In-
ternational Symposium on Computer Architecture (ISCA), Seoul, Korea (South), 2016,
pp. 27-39, doi: 10.1109/ISCA.2016.13. keywords: Artificial neural networks;Random ac-
cess memory;Microprocessors;Acceleration;Biological neural networks;Memory manage-
ment;processing in memory;neural network;resistive random access memory,
29. X. Dong, C. Xu, Y. Xie and N. P. Jouppi, ”NVSim: A Circuit-Level Performance, Energy,
and Area Model for Emerging Nonvolatile Memory,” in IEEE Transactions on Computer-
Aided Design of Integrated Circuits and Systems, vol. 31, no. 7, pp. 994-1007, July 2012,
doi: 10.1109/TCAD.2012.2185930. keywords: Nonvolatile memory;Arrays;Phase change ran-
dom access memory;Wires;Distributed databases;Integrated circuit modeling;Analytical cir-
cuit model;MRAM;NAND Flash;nonvolatile memory;phase-change random-access memory
(PCRAM);resistive random-access memory (ReRAM);spin-torque-transfer memory (STT-
RAM),
30. X. Peng, S. Huang, Y. Luo, X. Sun and S. Yu, ”DNN+NeuroSim: An End-to-End Benchmark-
ing Framework for Compute-in-Memory Accelerators with Versatile Device Technologies,”
2019 IEEE International Electron Devices Meeting (IEDM), San Francisco, CA, USA, 2019,
pp. 32.5.1-32.5.4, doi: 10.1109/IEDM19573.2019.8993491.
31. Pentecost, Lillian, et al. “NVMExplorer: A Framework for Cross-Stack Com-
parisons of Embedded Non-Volatile Memories.” 2022 IEEE International Sympo-
sium on High-Performance Computer Architecture (HPCA), IEEE, 2022. Crossref,
https://doi.org/10.1109/hpca53966.2022.00073.
32. Agrawal, Amogh (2021). Compute-in-Memory Primitives for Energy-
Efficient Machine Learning. Purdue University Graduate School. Thesis.
https://doi.org/10.25394/PGS.15048825.v1
33. V. Seshadri et al., ”Ambit: In-Memory Accelerator for Bulk Bitwise Operations Using Com-
modity DRAM Technology,” 2017 50th Annual IEEE/ACM International Symposium on
Microarchitecture (MICRO), Boston, MA, USA, 2017, pp. 273-287.