Tensor FPGA
Tensor FPGA
Article
Hardware Considerations for Tensor Implementation
and Analysis Using the Field Programmable
Gate Array
Ian Grout 1, * and Lenore Mullin 2
1 Department of Electronic and Computer Engineering, University of Limerick, V94 T9PX Limerick, Ireland
2 Department of Computer Science, College of Engineering and Applied Sciences, University at Albany,
State University of New York, Albany, NY 12222, USA; [email protected]
* Correspondence: [email protected]; Tel.: +353-61-202-298
Received: 18 October 2018; Accepted: 8 November 2018; Published: 13 November 2018
Abstract: In today’s complex embedded systems targeting internet of things (IoT) applications,
there is a greater need for embedded digital signal processing algorithms that can effectively and
efficiently process complex data sets. A typical application considered is for use in supervised and
unsupervised machine learning systems. With the move towards lower power, portable, and embedded
hardware-software platforms that meet the current and future needs for such applications, there is
a requirement on the design and development communities to consider different approaches to
design realization and implementation. Typical approaches are based on software programmed
processors that run the required algorithms on a software operating system. Whilst such approaches
are well supported, they can lead to solutions that are not necessarily optimized for a particular
problem. A consideration of different approaches to realize a working system is therefore required,
and hardware based designs rather than software based designs can provide performance benefits
in terms of power consumption and processing speed. In this paper, consideration is given to
utilizing the field programmable gate array (FPGA) to implement a combined inner and outer product
algorithm in hardware that utilizes the available hardware resources within the FPGA. These products
form the basis of tensor analysis operations that underlie the data processing algorithms in many
machine learning systems.
1. Introduction
Embedded system applications are today demanding greater levels of digital signal processing
(DSP) capabilities whilst providing low-power operation and with reduced processing times for
complex signal processing operations found typically in machine learning [1] systems. For example,
facial recognition [2] for safety and security conscious applications is a noticeable every day example,
and many smartphones today incorporate facial recognition software applications for phone and
software app access. Embedded environmental sensors, as an alternative application, can input
multiple sensor data values over a period of time and, using DSP algorithms, can analyze the data
and autonomously provide specific outcomes. Although these applications may differ, within the
system hardware and software, these are simply algorithms accessing data values that need to be
processed. The system does not need to know the context of the data it is obtaining. Data processing is
rather concerned with how effectively and efficiently it can obtain, store, and process the data before
transmitting a result to an external system. This requires not only an understanding of regular access
patterns in important internet of things (IoT) algorithms, but also an ability to identify similarities
amongst such algorithms. Research presented herein shows how scalar operations, such as plus and
times, extended to all scalar operations, can be defined in a single circuit that implements all scalar
Electronics 2018, 7, x FOR PEER REVIEW 2 of 24
operations extended to: (i) n-dimensional tensors (arrays); (ii) the inner product, (matrix multiply
is a 2-d instance)amongst
similarities and thesuch outer product,Research
algorithms. n-dimensional
both onpresented arrayshow
herein shows (thescalar
Kronecker
operations, Product
such is a
2-d instance);
as plus and andtimes,
(iii) compressions, or reductions,
extended to all scalar operations,overcan be arbitrary
defined indimensions.
a single circuitHowever, even more
that implements
all scalarexist.
relationships operations
One ofextended
the mostto:compute
(i) n-dimensional
intensivetensors (arrays);
operations (ii) is
in IoT thethe
inner product, (matrix
Khatri-Rao, or parallel
multiply
Kronecker is a 2-d
Product, instance)
which, from andthethe outer product,
perspective both
of this on n-dimensional
research, is an outer arrays
product (the projected
Kronecker to a
matrix,Product
enabling is a contiguous
2-d instance);reads
and (iii)
andcompressions,
writes of data or reductions, over arbitrary
values at machine dimensions. However,
speeds.
even more relationships exist. One of the most compute intensive operations in IoT is the Khatri-Rao,
In terms of the data, when this data is obtained, it must be stored in the available memory.
or parallel Kronecker Product, which, from the perspective of this research, is an outer product
This will be a mixture of cache memory within a suitably selected software programmed processor
projected to a matrix, enabling contiguous reads and writes of data values at machine speeds.
(microcontroller
In terms (µC), microprocessor
of the data, when this(µP),dataor is digital
obtained,signal processor
it must be stored (DSP)), locally connected
in the available memory. This external
volatile
will be a mixture of cache memory within a suitably selected software programmed processor via
or non-volatile memory connected to the processor, memory connected to the processor
local area network (LAN),
(microcontroller (µC),ormicroprocessor
via some form(µP), of Cloud basedsignal
or digital memory (Cloud
processor storage).
(DSP)), locally Identifying
connectedwhat
to useexternal
and when is theorchallenge.
volatile non-volatile Ideally,
memory theconnected
data would be stored
to the in specific
processor, memorymemory connected locations
to the so
processor
that the processor via local area network
can optimally (LAN),
access theorstored
via some form
input of Cloud
data, based
process thememory
data, and (Cloud storage).
store the result
Identifying what to use and when is the challenge. Ideally, the data
(the output data) again in memory in suitable new locations, or overwriting existing data in already would be stored in specific
memory
utilized memory. locations
Knowing so that
andtheanticipating
processor cancache optimally
memoryaccessmisses,
the stored
for input
example,data,enable
processathe data, that
design
and store the result (the output data) again in memory in suitable new locations, or overwriting
minimizes overhead(s), such as signal delays, energy, heat, and power.
existing data in already utilized memory. Knowing and anticipating cache memory misses, for
In many embedded systems implemented today, the software programmed processor is the
example, enable a design that minimizes overhead(s), such as signal delays, energy, heat, and power.
commonlyInused many programmable
embedded systems device to performtoday,
implemented complex tasks and
the software interface toprocessor
programmed input and is output
the
systems. The software
commonly approach hasdevice
used programmable beento developed over the
perform complex lastand
tasks number of years
interface to input andand is supported
output
through tools (usually
systems. The softwareavailable via an
approach hasintegrated
been developeddevelopment environment
over the last number of years(IDE))and and is programming
supported
language constructs,
through providing
tools (usually the necessary
available via ansyntax and semantics
integrated development to perform the required
environment (IDE)) complex
and
programming language constructs, providing the necessary syntax
tasks. However, increasingly, the programmable logic device (PLD) [3] that allows for a hardware and semantics to perform the
required complex tasks. However, increasingly, the programmable logic
configuration to be downloaded into the PLD in terms of digital logic operations is utilized. Figure 1device (PLD) [3] that allows
showsfor thea target
hardware configuration
device to be downloaded
choices available into thetoday.
to the designer PLD inAlternatively,
terms of digitalanlogic operations
application is
specific
utilized. Figure 1 shows the target device choices available to the designer today. Alternatively, an
integrated circuit (ASIC) solution whereby a custom integrated circuit is designed and fabricated could
application specific integrated circuit (ASIC) solution whereby a custom integrated circuit is designed
be considered. Design goals include not only semantic, denotational, and functional descriptions of a
and fabricated could be considered. Design goals include not only semantic, denotational, and
circuit,functional
but also an operational
descriptions of adescription
circuit, but also(how anto build the description
operational circuit and(how associated
to build memory
the circuitrelative
and to
accessassociated
patterns of important
memory algorithms).
relative to access patterns of important algorithms).
Algorithm to be implemented
1. Programmable/configurable
FigureFigure 1. Programmable/configurabledevice
device choices for implementing
choices for implementingdigital
digital signal
signal processing
processing
operations in hardware
operations andand
in hardware software.
software.
Electronics 2018, 7, 320 3 of 24
In this paper, consideration is given to a general algorithm, and the resultant circuit, for an
n-dimensional inner and outer product. This algorithm (circuit) builds upon scalar operations,
thus creating a single IP (intellectual property) core that utilizes an efficient memory access algorithm.
The field programmable gate array (FPGA) is used as the target hardware and the Xilinx® [4] Artix-7 [5]
device is utilized in this case study. The two algorithms, the matrix multiplication, and Tensor Product
(Kronecker Product), are foundational to essential algorithms in AI and IoT. The paper is presented in a
way to discuss the necessary links between the computer science (algorithm design and development)
and the engineering (circuit design, implementation, test, and verification) actions that need to be
undertaken as a single, combined approach to system realization.
The paper is structured as follows. Section 2 will introduce and discuss algorithms for complex
data analysis with a focus on tensor [6] analysis. An approach using tensor based computations with
dimension data arrays that are to be developed and processed is introduced. Section 3 will discuss
memory considerations for tensor analysis operations, and Section 4 will introduce the use of the FPGA
in implementing hardware and hardware/software co-design realizations of tensor computations.
Section 5 will provide a case study design created using the VHDL (Very High Speed Integrated Circuit
(VHSIC) Hardware Description Language (HDL)) [7] for synthesis and implementation within the
FPGA. The design architecture, simulation results, and physical prototype test results are presented,
along with a discussion into implementation possibilities. Section 6 will conclude the paper.
2.1. Introduction
In this section, data structures using tensor notation are introduced and discussed with the need
to consider and implement high performance computing (HPC) applications [8], such as required in
artificial intelligence (AI), machine learning (ML), and deep learning (DL) systems [9]. The section
commences with an introduction to tensors and then followed by a discussion into the use of tensors
in HPC applications. The algorithms foundational to IoT (Matrix Multiply, Kronecker Product,
and Compressions (Reductions)) are targeted with the need for a unified n-dimensional inner and
outer product circuit that can optimally identify and access suitable memories to store input and
processed data.
magnitude and direction (e.g., the velocity of a moving object: Speed and direction of motion). Matrices
(n × m arrays) have two dimensions and are rank 2 tensors. A three-dimensional (n × m × p) array can
be visualized as a cube and is a rank 3 tensor. Tensors with ranks greater than 3 can readily be created
and analysis performed on the data they hold would be performed by accessing the appropriate
element within the tensor and performing a suitable mathematical operation before storing the result
in another tensor.
In a physical realization of this process, the tensor data would be stored in a suitable size
memory, the data would be accessed (typically using a software programmed processor), and the
computation would be undertaken using fixed- or floating-point arithmetic. This entire process should,
ideally, stream data contiguously, and ideally anticipate where cache memory misses might occur,
thus minimizing overhead up and down the memory hierarchy. For example, in an implementation
using cache memory, L1 cache memory miss could also miss in L2, L3, and Page memory.
A tensor rank, or a tensor’s dimensionality, can be thought of in at least two ways. The more
traditional way being, as the number of rows and columns change in a matrix, so does the dimension.
Even with that perspective, computation methods often decompose such matrices into blocks.
Conceptually, this can be thought of as “lifting” the dimension of an array. Further blocking “lifts”
the dimension even more. Another way of viewing a tensor’s dimensionality is by the number of
arguments in a function input over time. The most general way to view dimensionality is to combine
these attributes. The idealized methods for formulating architectural components are chosen to match
the arithmetic and memory access patterns of the algorithms under investigation. In this paper,
the n-dimensional inner and outer products are considered. Thus, in this case, what might be thought
of as a two-dimensional problem can be lifted to perhaps eight or more dimensions to reflect a physical
implementation, considering the memory as registers, the levels of cache memory, RAM (random
access memory), and HDD (hard disk drive). With that formulation, it is possible to create deterministic
cost functions validated by experimentation, and, ideally, an idealized component construction can be
realized that meets desired goals, such as heat dissipation, time, power, and hardware cost.
When an algorithm is run, the hardware implementing the algorithm will access available memory.
In Figure 2, a prototypical graph of how an algorithm that does not have cache memory misses or
page memory faults is presented. The shape of the graph changes as it moves through the memory
hierarchy. This identifies the time requirements associated with the different memories from L1 cache
memory through to disk (HDD). Note the change in the slope with memory type. The slope reflects
how attributes, such as speed, cost, and power, would affect performance. Algorithm execution
(memory access) time is, however, relative to the L1 cache memory chosen. For example, it could be
nanoseconds, microseconds, milliseconds, seconds, minutes, or hours as the data moves further up the
memory hierarchy. Often, performance is related to a decrease in arithmetic operations, i.e., a reduction
of arithmetic complexity. In an ideal computing environment, where memory and computation would
have the same costs, this would be the case. Unfortunately, it is also necessary to be concerned with the
cost of data input/output (I/O). In parallel to this, it is a necessity to consider memory access patterns
and how these relate to the levels of memory. Pre-fetching is one way to alleviate delays. However,
often the algorithm developer must rely on the attributes of a compiler and hope the compiler is
pre-fetching data in an optimum manner. The developer must trust that this action is performed
correctly. This is becoming harder to achieve given that machines are becoming ever more complex
Electronics 2018, 7, 320 5 of 24
Memory access
time is relative to
the L1 cache
memory chosen.
Increasing time
Memory
L1 L2 L3 RAM DISK level
Presently,Presently,
the goalthe isgoal
to achieve a situation
is to achieve a situationwhere
where the thegraph
graph is polynomial,
is polynomial, avoiding avoiding exponential
exponential
behavior, behavior,
such as such
the asonetheinone in Figure
Figure 2, usingHDDs.
2, using HDDs. AAco-design
co-design approach, complemented
approach, with
complemented with
dimension lifting and analysis, as discussed above, can be used to calculate upper and lower bounds
dimension lifting and analysis, as discussed above, can be used to calculate upper and lower bounds
of algorithms relative to their data size, memory access patterns, and arithmetic. The goal is to ensure
of algorithms relativestays
performance to their dataassize,
as linear memory
possible. access
This type patterns,enables
of information and arithmetic.
the algorithm The goal is to ensure
developers
performance stays
insight intoas linear
what as possible.
memories This
to select for use,type of information
i.e., what type and size ofenables
memorythe algorithm
should developers
be used to
keep the slope constant. This, of course, would include pre-fetching, buffering, and
insight into what memories to select for use, i.e., what type and size of memory should be used to keep timings to feed
the prior levels at memory speed. If this is not possible, given the available memory choices, the slope
the slope constant. This, of course, would include pre-fetching, buffering, and timings to feed the prior
change can be minimized.
levels at memory speed. If this is not possible, given the available memory choices, the slope change
can be minimized.
2.3. Machine Learning, Deep Learning, and Tensors
Tensor and machine learning communities have provided a solid research infrastructure,
2.3. Machine Learning,
reaching from Deep Learning,
the efficient andforTensors
routines tensor calculus to methods of multi-way data analysis, i.e.,
from tensor decompositions to methods for consistent and efficient estimation of parameters of
Tensor and machine learning communities have provided a solid research infrastructure, reaching
probabilistic models. Some tensor-based models have the characteristic that if there is a good match
from the efficient routines
between the forthe
model and tensor calculus
underlying to methods
structure in the data,ofthe
multi-way data more
models are much analysis, i.e., from tensor
interpretable
decompositions to methods
than alternative for consistent
techniques. and efficient
Their interpretability estimation
is an of parameters
essential feature of probabilistic
for the machine learning models.
techniques to gain acceptance in the rather engineering intensive fields of
Some tensor-based models have the characteristic that if there is a good match between the modelautomation and control of
cyber-physical systems. Many of these systems show intrinsically multi-linear behavior, which is
and the underlying structure in the data, the models are much more interpretable than alternative
appropriately modeled by tensor methods, and tools for controller design can use these models. The
techniques. Their interpretability
calibration of sensors deliveringis an essential
data and the feature for the machine
higher resolution of measured learning techniques
data will have an to gain
acceptanceadditional
in the rather
impact on engineering intensive
the interpretability fields of automation and control of cyber-physical
of models.
systems. Many Deep
oflearning
these is a subfieldshow
systems of machine learning that
intrinsically supports a setbehavior,
multi-linear of algorithmswhich
inspiredisby the
appropriately
structure and function of the human brain. TensorflowTM [15], PyTorch [16], Keras [17], MXNet [18],
modeled by tensor methods, and tools for controller design can use these models. The calibration of
The Microsoft Cognitive Toolkit (CNTK) [19], Caffe [20], Deeplearning4j [21], and Chainer [22] are
sensors delivering data and
machine learning the higher
frameworks resolution
that are of measured
used to design, build, anddata
trainwill
deephave an models.
learning additional
Such impact on
the interpretability
frameworksofcontinue
models.to emerge. These frameworks support numerical computations on
Deepmultidimensional data arrays,
learning is a subfield or tensors,learning
of machine e.g., point-wise operations,asuch
that supports as add,
set of sub, mul,inspired
algorithms pow, by the
exp, sqrt, div, and mod. They also support numerous TM linear algebra operations, such as Matrix-
structure and function of the human brain. Tensorflow [15], PyTorch [16], Keras [17], MXNet [18],
The Microsoft Cognitive Toolkit (CNTK) [19], Caffe [20], Deeplearning4j [21], and Chainer [22]
are machine learning frameworks that are used to design, build, and train deep learning models.
Such frameworks continue to emerge. These frameworks support numerical computations on
multidimensional data arrays, or tensors, e.g., point-wise operations, such as add, sub, mul, pow, exp,
sqrt, div, and mod. They also support numerous linear algebra operations, such as Matrix-Multiply,
Kronecker Product, Cholesky Factorization, LU (Lower-Upper) Decomposition, singular-value
Electronics 2018, 7, 320 6 of 24
decomposition (SVD), and Transpose. The programs would be written in various languages, such as
Python, C, C++, and Java. These languages also include libraries/packages/modules that have been
developed to support high-level tensor operations, in many cases under the umbrellas of machine
learning and deep learning.
The design presented in this paper is for an n-dimensional inner and outer product, e.g., for 2-d
matrix multiply, which builds upon the scalar operations of + and × [24]. Some operations may be
realized in hardware, firmware, or software. This generalized inner product is defined using reductions
and outer products [24], and reduces to three loops independent of conformable argument dimensions
and shapes. This is due to Psi Reduction, where it is possible to, through linear and multilinear
transformations, reduce an array expression to a normal form. Then, through “dimension lifting” of a
normal form, idealized hardware can be realized where the size of buffers relative to speed and size of
connecting memories, DMA (Direct Memory Access) hardware (contiguous and strided), and other
memory forms, when a problem size is known, or conjectured, and details of hardware are available
and known.
3.1. Introduction
In order to understand memory considerations, it is important to understand the algorithms that
dominate tensor analysis: Inner Products (Matrix Multiply), and Outer Products (Kronecker or Tensor
Product). Others include transformations and selections of components. Models in AI and IoT [13]
are dominated by multiple Kronecker Products, parallel Kronecker Products (Khatri-Rao), and outer
products of Kronecker Products (Tracey Singh), in conjunction with compressions over dimensions.
Memory access patterns are well known. Moving on from an algorithmic specification to an optimized
software or hardware instantiation of that algorithm requires maximizing the data structures that
represent the algorithm in conjunction with the memory(ies) of a computer.
3.3. Cache Memory: Memory Types and Caches in a Typical Processor System
Over the years, memory has become faster in conjunction with memory types becoming
more diverse. Architectures now support multiple, non-uniform memories, multiple processors,
and multiple networks, and those architectures are combined to form complex, multiple networks.
In an IoT application, there may be a case that one application requires the use of a substantial portion
of the available resources and those resources must have a reproducible capacity. Figure 3 presents
a view of the different memories that may be available in an IoT application, from processor to the
Cloud. This view is based on a software programmed processor approach. Different memory types
(principle of operation, storage capacity, speed, cost, ability to retain the data when the device power
supply is removed (volatile vs. non-volatile memory), and physical location in relation to the processor
core) would be considered based on the system requirements. The fastest memories with the shortest
read and write times would be closest to the processor core, and are referred to as the cache memory.
Figure 3 considers the cache memory as three levels (L1, L2, and L3), where the registers are closest to
the core and on the same integrated circuit (IC) die as the processor itself before the cache memory
would be accessed. L1 cache memory would be SRAM (static RAM) fabricated onto the same IC
die as the processor, and would be limited in the amount of data it could hold. The registers and
L1 cache memory would be used to retain the data of immediate use by the processor. External to
the processor would be external cache memory (L2 and L3), where this memory may be fast SRAM
with limited data storage potential or slower dynamic RAM (DRAM) that would have a greater data
storage potential. RAM is volatile memory, so for data retention when the power supply is removed,
non-volatile memory types would be required: EEPROM (electrically erasable programmable read
only memory), Flash memory based on EEPROM, and HDD would be local memory followed by
external memory connected to a local area network (a “network drive”) and Cloud memory. However,
there are costs associated with each memory type that would need to factored into a cost function for
the memory.
Electronics 2018, 7, 320 8 of 24
Electronics 2018, 7, x FOR PEER REVIEW 8 of 24
External memory
Level 1 (L1) cache
(on-processor cache)
Flash or EEPROM HDD
Processor
LAN memory
Cloud memory
Figure Availability
3. 3.
Figure Availabilityof
ofmemory typesininIoT
memory types IoTapplications.
applications.
4.1. Introduction
In this section, the FPGA is introduced as a configurable hardware device that has an internal
circuit structure that can be configured to different digital circuit or system architectures. It can,
for example, be configured to implement a range of digital circuits from a simple combinational
logic circuit through to a complex processor architecture. With the available hardware resources and
ability to describe the circuit/system design using a hardware description language (HDL), such as
VHDL or Verilog [26], the designer can implement custom design architectures that are optimized
to a set of requirements. For example, it is possible to describe a processor architecture using VHDL
or Verilog, and to synthesize the design description using a set of design synthesis constraints into a
logic description that can then be targeted to a specific FPGA device (design implementation, “place
and route”). This processor, which is hardware, would then be connected to a memory containing a
program for the processor to run, the memory may be registers (flip-flops), available memory macros
within the FPGA or external memory devices connected to the pins of the FPGA. Therefore, it would
be possible to implement a hardware only design or a hardware/software co-design. In addition,
if adequate hardware resources were available, more than one processor could be configured into the
FPGA and a multi-processor device therefore developed.
• Volatile memory: When data are stored within the memory, the data are retained in the memory
whilst the memory is connected to a power supply. Once the power supply has been removed,
then the contents of the memory (the data) are lost. The early FPGAs utilized volatile SRAM
based memory.
• Non-volatile memory: When data are stored within the memory, the data are retained in the
memory even when the power supply has been removed. Specific FPGAs available today utilize
Flash memory for holding the configuration.
5. Inner and Outer Product Implementation in Hardware Using the FPGA Case Study
5.1. Introduction
In this section, the design, simulation, and physical prototype testing of a single IP core that
implements the inner and outer products are presented. The idea here is to have a hardware macro cell,
or core, that can be accessed from an external digital system (e.g., a software programmed processor
that can pass the computation tasks to this cell whilst it performs other operations in parallel). The input
array data are stored as constants within arrays in the ipOpCore module, as shown in Figure 4, and are
therefore, in this case, read-only. However, in another application, then it would be necessary to
allow the arrays to be read-write for entering new data to be analyzed and then the design would
be modified to allow array A and B data to be loaded into the core, either as serial or parallel data.
Hence, the discussion provided in this section relates to the specific case study. In addition, a single
result output could be considered and the need for test data output might not be a requirement.
The motivation behind this work is to model tensors as multi-dimensional arrays and to analyze these
using tensor analysis in hardware. This requires a suitable array access algorithm to be developed,
the use of suitable memory for storing data in a specific application, and a suitable implementation
strategy. In this paper, the inner and outer products are only considered using the FPGA as the
target device, an efficient algorithm to implement the inner and outer products in a single circuit
Electronics 2018, 7, 320 11 of 24
implemented in hardware is used, and appropriate embedded FPGA memory resources to enable fast
memory access
Electronics 2018, 7,are used.
x FOR PEER REVIEW 11 of 24
Figure4.4.ipOpCore
Figure ipOpCore case study design.
case study design.
The design
The designshown
shownininFigure
Figure44waswascreated
created toto allow for both
allow for bothproduct
productresults
resultstotobebeindependently
independently
accessed
accessed during normal runtime operation and for specific internal data to be accessed forfor
during normal runtime operation and for specific internal data to be accessed test
test andand
debug purposes. The design description was written in VHDL as a combination
debug purposes. The design description was written in VHDL as a combination of behavioral, RTL, of behavioral, RTL,
andand
structural code targeting thethe
Xilinx ® ®
structural code targeting XilinxArtix-7
Artix-7(XC7A35TICSG324-1L)
(XC7A35TICSG324-1L) FPGA.
FPGA. This specific
This device
specific devicewas
chosen for practical reasons as it contains hardware resources suited for this application.
was chosen for practical reasons as it contains hardware resources suited for this application. The The design,
however,
design, is portableis and
however, is readily
portable and istransferred to othertoFPGAs,
readily transferred or to be
other FPGAs, or part
to beof a larger
part digital
of a larger ASIC
digital
design,
ASICifdesign,
required. For any design
if required. For anyimplementation, the choice
design implementation, theofchoice
hardware, and potentially
of hardware, software,
and potentially
software,
to use wouldtobeusebasedwould
on abenumber
based on a number of considerations.
of considerations. The FPGA was Themounted
FPGA was mounted
on the Digilenton®the
Arty
Digilent ® Arty A7-35T Development Board and was chosen for the following reasons:
A7-35T Development Board and was chosen for the following reasons:
TheThe FPGA
FPGA consideredisisused
considered usedininother
otherproject
project work
work and as as such,
such,the
thework
workdescribed
describedinin this paper
this paper
could
could readily
readily bebe incorporatedinto
incorporated intothese
theseprojects.
projects. Specifically,
Specifically, sensor
sensordata
dataacquisition
acquisitionusing usingthe
theFPGA
FPGA
andand data
data analysis
analysis withinthe
within theFPGA
FPGAprojects
projects would
would benefit
benefit from
from this
thiswork
workwhere
wherethe thealgorithm
algorithm andand
memory access operations used in this paper would provide additional
memory access operations used in this paper would provide additional value to the work undertaken. value to the work
undertaken.
1. The development board used provided hardware resources that were useful for project work,
1. The development board used provided hardware resources that were useful for project work,
such as the 100 MHz clock, external memory, switches, push buttons, light emitting diodes
such as the 100 MHz clock, external memory, switches, push buttons, light emitting diodes
(LEDs), expansion connectors, LAN connection, and a universal serial bus (USB) interface for
(LEDs), expansion connectors, LAN connection, and a universal serial bus (USB) interface for
FPGA configuration and runtime serial I/O.
FPGA configuration and runtime serial I/O.
2. 2. The development
The development board waswas
board physically compact
physically and could
compact be readily
and could integrated
be readily into an into
integrated enclosure
an
forenclosure
mobilityfor purposes and operated from a battery rather than powered
mobility purposes and operated from a battery rather than powered through through the the
USB
+5USB
V power.
+5 V power.
3. 3. TheTheArtix-7
Artix-7FPGAFPGAprovided
providedadequate
adequateinternal
internal resources
resources and and I/O
I/Oforforthe
thework
workundertaken
undertaken andand
external resources could be readily added
external resources could be readily added via via the expansion connectors if
expansion connectors if required.required.
4. 4. ForFor memoryimplementation,
memory implementation,the theFPGA
FPGA cancan use thethe internal
internallook-up
look-uptables
tables(LUTs)
(LUTs) asasdistributed
distributed
memoryfor
memory for small
small memories,
memories, can canuseuseinternal
internal BRAMBRAM (Block RAM)
(Block RAM)for larger memories,
for larger and
memories,
external volatile/non-volatile memories connected
and external volatile/non-volatile memories connected to the I/O. to the I/O.
5. 5. ForFor computation
computation requirements,
requirements, the FPGA
the FPGA allows allows
for bothforfixed-point
both fixed-point and floating-point
and floating-point arithmetic
arithmetic operations
operations to be implemented. to be implemented.
6. For an embedded processor based approach, the MicroBlaze CPU can be instantiated within the
FPGA for software based implementations.
Electronics 2018, 7, 320 12 of 24
6. For an embedded processor based approach, the MicroBlaze CPU can be instantiated within the
FPGA for software based implementations.
C=A ⊗ B
Electronics 2018, 7, 320 13 of 24
The above products were initially developed using C and Python coding where the data in C were
stored in arrays and in Python were stored in lists. The combined inner/outer product algorithm was
verified through running the algorithm with different data sets and verifying the software simulation
model results with manual hand calculation results. Once the software version of the design was
verified, the Python code functionality was manually translated to a VHDL equivalent. The two key
design decisions to make were:
1. How to model the arrays for early-stage evaluation work and how to map the arrays to hardware
in the FPGA.
2. How to design the algorithm to meet timing constraints, such as maximum processing time,
number of clock cycles required, hardware size considerations, and the potential clock frequency,
with the hardware once it is configured within the FPGA.
In this design, the data set was small and so VHDL arrays were used for both the early-stage
evaluation work and for synthesis purposes. In VHDL, the input and results arrays were defined and
initialized as follows:
TYPE array_1by4 IS ARRAY (0 TO 3) OF INTEGER;
TYPE array_1by6 IS ARRAY (0 TO 5) OF INTEGER;
TYPE array_1by9 IS ARRAY (0 TO 8) OF INTEGER;
TYPE array_1by36 IS ARRAY (0 TO 35) OF INTEGER;
TYPE array_1by54 IS ARRAY (0 TO 53) OF INTEGER;
These are one-dimensional arrays suited for ease of memory addressing, appropriate for the
algorithm operation, synthesizable into logic, and have a direct equivalence in the C and Python
Electronics 2018, 7, 320 14 of 24
software models. The input arrays (arrayA and arrayB) contain the input data. The results arrays
(arrayResultIp (inner product) and arrayResultOp (outer product)) were initialized with 00 s. It was
not necessary, in this case, to map to any embedded BRAM or external memory as the data set size
was small and easily mapped by the synthesis tool to distributed RAM within the FPGA. The PC
(product code) array is not shown above, but this is an array that contains the shape and size of arrays,
A and B. For the algorithm, with direct mapping to VHDL from the Python code, the inner product
and outer product each required a set number of clock cycles. Figure 5 shows a simplified timing
diagram identifying the signals required to implement the inner/outer product computation. Once the
Electronics 2018, 7, x FOR PEER REVIEW 14 of 24
computation has been completed, the array contents could then be read out one element at a time.
For evaluation purposes, all array values were made accessible concurrently, but could readily be
made available serially via a multiplexor arrangement to reduce the number of output signals
made available serially via a multiplexor arrangement to reduce the number of output signals required
required in the design.
in the design.
A computation run would commence with the run control signal being pulsed 0-1-0 with the
A computation run would commence with the run control signal being pulsed 0-1-0 with the
product selection input ipOp set to either logic 0 (inner product) or logic 1 (outer product). In this
product selection input ipOp set to either logic 0 (inner product) or logic 1 (outer product). In this
implementation, the inner product required 18 clock cycles and the outer product required 54 clock
implementation, the inner product required 18 clock cycles and the outer product required 54 clock
cycles to complete. The array data read-out operations are not, however, shown in Figure 5. The data
cycles to complete. The array data read-out operations are not, however, shown in Figure 5. The data
values were defined using the INTEGER data type for modelling and simulation purposes, and these
values were defined using the INTEGER data type for modelling and simulation purposes, and these
values were mapped to an 8-bit STD_LOGIC_VECTOR data type for synthesis into hardware. The 8-
values were mapped to an 8-bit STD_LOGIC_VECTOR data type for synthesis into hardware. The 8-bit
bit width data bus was sufficient to account for all data values in this study.
width data bus was sufficient to account for all data values in this study.
Logic level
1
resetN
0 Time (ns)
1
run
0 Time (ns)
1
ipOp
0 Time (ns)
Calculation
Calculation
11
Right array (B) Outer product array
11
Right array (B) Outer product array
Control unit
Control unit
ControlUnit ipOpCore
addrA[7:0] ControlUnit ipOpCore
addrAint[7:0]
addrA[7:0]
addrB[7:0]
addrAint[7:0]
addrBint[7:0] dataA[7:0]
addrB[7:0]
addrPC[7:0]
addrBint[7:0]
addrPCint[7:0] dataA[7:0]
dataB[7:0]
addrPC[7:0]
addrResIp[7:0]
I0 addrPCint[7:0]
addrResIpInt[7:0] I1 dataB[7:0] [7:0]
productCode
addrResIp[7:0]
addrResOp[7:0]
I0 addrResIpInt[7:0]
addrResOpInt[7:0] I1 productCode [7:0]
dataResIp[7:0]
addrResOp[7:0]
clock
addrResOpInt[7:0]
calculationClock dataResIp[7:0]
dataResOp[7:0]
clock
ipOp
calculationClock
selectIpOp dataResOp[7:0]
ipOp
run
selectIpOp
resetN
run
resetN
resetN
resetN
Simplifiedschematic
Figure7.7.Simplified
Figure schematicview
viewofofthe
theelaborated
elaboratedVHDL
VHDLcode.
code.
interfacing to a
computer for +5 V Reset push-button.
power, device
configuration and Artix-7 FPGA.
serial data
100 MHz clock module (on
transfer.
underside of board).
601
Figure 8. Xilinx® Artix-7 FPGA on the Digilent® Arty board identifying key components used in
602 Figure 8. Xilinx® Artix-7 FPGA on the Digilent® Arty board identifying key components
the experimentation.
603 used in the experimentation.
The design must eventually be implemented within the FPGA and this is a two-step process.
604 The the
Firstly, design
VHDL must eventually
code be implemented
is synthesized and then the within the FPGA
synthesized andisthis
design is a two-step
implemented process.
in the target
605 Firstly, the VHDL code is synthesized and then the synthesized design is implemented in
FPGA. The synthesis and implementation operations can be run using the default settings, or the user the target
606 FPGA.
can setThe synthesistoand
constraints implementation
direct operations
the tools. In this canthe
case study, be default
run using the
tool defaultwere
settings settings,
usedorandtheTable
user 3
607 can set constraints
identifies to direct
the hardware the tools.
resources In this after
required case study, the and
synthesis default tool settings were
implementation used
for the and Table
design.
608 3 identifies the hardware resources required after synthesis and implementation for the design.
Table 3. Artix-7 FPGA resource utilization in the case study design.
609 Table 3. Artix-7 FPGA resource utilization in the case study design.
Item Use Number Used
Item Package pin Use
Input 44Number used
Package pin Input
Output 40 44
Outputresults
Design synthesis 40
Post-synthesis I/O Inputs 23 *
Design synthesis results
Outputs 40
Slice LUTs Total used 454
Post-synthesis I/O Inputs
LUT as logic 442
23*
Outputs
LUT as memory (distributed RAM) 12 40
Slice LUTs Slice registers TotalTotal used
used 217 454
Slice register as flip-flop 217
Other logic LUTClockas logic
buffer 2 442
LUT as Design
memory (distributed
implementation resultsRAM) 12
Slice registers
Post-implementation I/O Total used
Inputs 23
217
Slice register as flip-flop
Outputs 40 217
Other logic Slice LUTs Clock buffer
Total used 391 2
LUT as logic 379
LUT as memory (distributed RAM) 12
Slice registers Design implementation
Total used results 217
Slice register as flip-flop 217
Post-implementation I/O
Other logic
Inputs
Clock buffer 2
23
Outputs 40
* Note that the number of inputs required in the design after synthesis do not include the address input bits that
Slice
were LUTs
always a constant logic 0 in this case study. Total used
This was 391
due to the standard 8-bit address bus used for all
LUT
input addresses and the sizes of the arrays meant that assignificant
most logic bits (MSBs) of the array addresses379were not
required. Note also that post-implementation, the number of slice LUTs required was less than that post-synthesis.
LUT as memory (distributed RAM) 12
Slice Simulation
5.4. Design registers Total used 217
Slice register as flip-flop 217
Design simulation was undertaken to ensure that the correct values were stored, 2 calculated,
Other logic Clock buffer
and accessed. The Xilinx® Vivado software tool was used for design entry and simulation was
610
performed using the built-in Vivado simulator. A VHDL test bench was used to perform the
611 * Note that the number of inputs required in the design after synthesis do not include the address
computation and array data read-out operations. Figure 9 shows the complete simulation run where
612 input bits that were always a constant logic 0 in this case study. This was due to the standard 8-bit
the clock frequency in simulation was set to 50 MHz (the master clock frequency divided by two).
613 address bus used for all input addresses and the sizes of the arrays meant that most significant bits
614 (MSBs) of the array addresses were not required. Note also that post-implementation, the number of
615 slice LUTs required was less than that post-synthesis.
Electronics
Electronics 2018,
2018, 7,
7, xx FOR PEER REVIEW
FOR PEER REVIEW 17ofof24
17 24
9. Simulation
Figure9.
Figure Simulation study
study results:
studyresults: Computation
results: Computation and
andresults
Computation and resultsread-out.
results read-out.
read-out.
For
For the
the inner
inner product
Figure 9.data
product data read-out,
read-out,
Simulation studyFigure
Figure 10
10
results:
Figure shows
showsthe
10Computation thesimulation
and results results
simulation resultsfor
read-out. forall
allnine product
nineproduct
product
array
arrayelement
elementvaluesvalues(dataResIp)
(dataResIp)being
beingread
read out of the arrayResultIp array. The iPOp
read out of the arrayResultIp array. The iPOp controlsignal
out of the arrayResultIp array. The iPOp control
controlsignal is
signal
not
is not For
used
usedthe inner
(set(set
to to product
logic
logic 1 in
1 in data
the
the read-out,
simulation
simulation
simulation Figure
test
testbench)
test 10 shows
bench)asas
bench) asitititisisthe simulation
isonly
onlyused
only usedfor
used results
forthe
for for all
thecomputation,
the computation, nine
computation,the product
clock
theclock
clock
array
isis heldelement
held at logicvalues
at logic is(dataResIp)
00asasit it is
alsoalso only
only being
used forread
used the
the out computation,
of the arrayResultIp
for computation,
the
computation, and
andtheand
the array.
the
reset
reset resetThe
signal
signal iPOp
signal
is
isnot
not is control signal
not asserted
asserted
asserted (resetN
(resetN
is notThe
(resetN
= 1). used
= 1). (set
innerTheto logic product
inner
product 1 array
in theaddress
simulation
array address test (addrResIp)
(addrResIp)
(addrResIp)bench) isas
is it isisonly
provided
provided used
provided
to
to for the
eachcomputation,
to access
access
access each each element
element
element in thein
inthe
the clock
the
array
array
is
arrayheld at logic
sequentially.
sequentially. 0 as it is also only used for the computation, and the reset signal is not asserted (resetN
= 1). The inner product array address (addrResIp) is provided to access each element in the array
sequentially.
Figure 10.
Figure10.
Figure Simulation
10.Simulation study
Simulationstudy results:
studyresults: Inner
results:Inner product.
Innerproduct.
product.
Thisshows
This
This showsthe
shows thespecific
the specific results
results
specificFigure 10.
results for
for the
the
Simulation
for complete
thecomplete inner
studyinner
complete inner product
product
results: product as
as follows:
asfollows:
Inner product.follows:
10
10 13
13
𝐶 = 𝐴
𝐴⊗ 𝐵
𝐵== 28 40
This shows the specific results 𝐶 for= the ⊗complete 28inner
1040 product
13 as follows:
Cip = A ⊗ B =46
67
46
2867
10 13 40
For the
For the outer
outer product
product data 𝐶 = 𝐴Figure
data read-out,
read-out, ⊗ 𝐵=
Figure 11 28 4640the
11 shows
shows 67simulation
the simulationresults
resultsfor forthe
thelast
last13
13values
values
46 67
(dataResOp) being read out of the arrayResultOp array. The iPOp control signal is not used (setto
(dataResOp) being read out of the arrayResultOp array. The iPOp control signal is not used (set to
logicFor
logic For inthe
11 in theouter
the
theouter product
simulation
product
simulation data
test
testdata read-out,
bench) as
as it
read-out,
bench) Figure
it is only 11
only
isFigure 11shows
used
used for
shows the
for the
the simulation
simulationresults
thecomputation,
computation, the for
theclock
results forthe
clock is last
last13
isheld
the held at values
atlogic
13 values
logic 00
(dataResOp)
as it is
(dataResOp)
as also
it is also onlybeing
only
being read
used
used for
read out
the
for out of
the of the arrayResultOp
computation,
the arrayResultOp
computation, and the array.
reset
and the array. The
signal iPOp
The iPOp
reset signal is control
not asserted
is notcontrol
asserted signal is
(resetN
signal not
=
is not
(resetN used
1). The
used
= 1). (set to
outer
The(set to
outer
logic
product1 in the simulation
array address test bench)
(addrResOp)
bench) asis
asit is
it only
provided
is used
only to for
access
used the
for computation,
each
the element
product array address (addrResOp) is provided to access each element in the array sequentially. 0
logic 1 in the simulation test computation, the
in the clock
the arrayis
clock held at logic
sequentially.
is held at 0
logicas
itasisitalso only
is also used
only forfor
used thethe
computation,
computation, and thethe
and reset
resetsignal
signal is is
not asserted
not asserted(resetN
(resetN==1). 1).The
Theouter
outer
product array address (addrResOp) is provided to access each element
product array address (addrResOp) is provided to access each element in the array sequentially. in the array sequentially.
Figure 11. Simulation study results: Outer product (final set of results read-out only).
Figure 11. Simulation study results: Outer product (final set of results read-out only).
This shows the specific results for the last 13 values in the results array as follows:
Electronics 2018, 7, 320 18 of 24
This shows the specific results for the last 13 values in the results array as follows:
Electronics 2018, 7, x FOR PEER REVIEW
. . . . . . 18 of 24
. . . . . .
. . . . . .
. . . . . .
⎡. . . . . . ⎤
. . .. .. .. .. ⎥ .
⎢.
⎢. .. ⎥ .
Cop = A ⊗ B = . . .. .. ..
⎢ .. ⎥ .
𝐶 = 𝐴 ⊗ 𝐵 = ⎢.
. . .. .. ..
⎥
⎢.
.
. .. 0. 7. 0. ⎥ 8
⎢. .
. 0 . 147 210 168 ⎥ 24
⎢. . 14 21 16 24⎥
⎣. . 30 28 35 32 ⎦ 40
30 28 35 32 40
5.5.
5.5.Hardware
Hardware Test
TestSet-Up
Set-Up
Design
Designanalysis
analysis waswas inin the main performed using simulation
performed using simulationto todetermine
determinecorrect
correctfunctionality
functionality
and
andsignal
signaltiming
timing considering
considering the initial design
design description
descriptionprior
priortotosynthesis
synthesis(behavioral
(behavioralsimulation),
simulation),
the
thesynthesized design
synthesized (post-synthesis
design simulation),
(post-synthesis and the implemented
simulation), design (post-implementation
and the implemented design (post-
simulation).
implementationThis is a typical simulation
simulation). This is a approach that is supported
typical simulation approach by that
the FPGA simulator
is supported byfor
theverifying
FPGA
the design for
simulator operation
verifying at different
the design steps in the design
operation process.
at different stepsGiven
in thethat the process.
design design isGiven
intendedthat to be
the
used as is
design a block within
intended a larger
to be used asdigital system,
a block withinthe simulation
a larger digitalresults would
system, give an appropriate
the simulation results wouldlevel
of estimating
give the signal
an appropriate timing
level and the circuit
of estimating power
the signal consumption.
timing and the circuit power consumption.
InInaddition
additiontotothe
thesimulation
simulation study,
study, thethe design
design waswas
alsoalso implemented
implemented within
within the FPGA
the FPGA and
and signal
signal monitored
monitored using theusing the development
development board connectors
board connectors (the Pmod(the TM (peripheral
PmodTM (peripheral module)
module) connectors)
connectors)
using a logicusing a logic
analyzer andanalyzer and oscilloscope.
oscilloscope. This test arrangement
This test arrangement is shown in isFigure
shown 12.in Figure 12.
PmodTM Logic
Control ipOp
connector analyser
unit core
Artix-7 Pmod
TM
TM
Pmod
Built-in Oscilloscope
connector
tester
To
Togenerate
generatethe thetop-level
top-leveldesign
designmodule
module input
inputsignals, a built-in
signals, a built-intester circuit
tester waswas
circuit developed
developed and
incorporated into the FPGA. This was a form of a built-in self-test (BIST) [30] circuit that
and incorporated into the FPGA. This was a form of a built-in self-test (BIST) [30] circuit that generated generated the
control signals
the control identified
signals in Figure
identified 5 and
in Figure allowed
5 and allowed thethe
internal
internalarray
array address
address and data
and data signals toto
signals be
accessed. The tester circuit was set-up to continuously repeat the sequence in Figure
be accessed. The tester circuit was set-up to continuously repeat the sequence in Figure 5 rather than 5 rather than
run
runjust
just once
once and so did
and so did not
notrequire
requireany
anyuser
userset
setinput
input control
control signals
signals to to operate.
operate. WithWith
the the number
number of
of address and data bits required (40 address bits and 40 data bits) for the five arrays
address and data bits required (40 address bits and 40 data bits) for the five arrays that exceeded the that exceeded
the number of Pmod TM connections available, these signals were multiplexed to eight address and
number of Pmod TM connections available, these signals were multiplexed to eight address and eight
data bits within the built-in tester and the multiplexor control signals were output for identifying the
array being accessed. The control signals were also accessible on the PmodTM connectors for test
purposes.
Figure 13 shows a simplified schematic view of the elaborated VHDL code, where I0 is the top-
level design module and I1 is the built-in tester module.
Electronics 2018, 7, 320 19 of 24
eight data bits within the built-in tester and the multiplexor control signals were output for identifying
the array being accessed. The control signals were also accessible on the PmodTM connectors for
test purposes.
Figure 13 shows a simplified schematic view of the elaborated VHDL code, where I0 is the
Electronics 2018, 7, x FOR PEER REVIEW 19 of 24
top-level design module and I1 is the built-in tester module.
addressOut[7:0]
Top level design module
dataOut[7:0]
I1 muxOut[2:0]
addrA[7:0]
clockTopOut
addrB[7:0]
ipOpOut
addrPC[7:0]
resetOut
addrResIp[7:0]
I0 runOut
addrResOp[7:0]
ClockTop
ipOp
run
resetN
resetN
Clock
clock
Figure
Figure 13.13. Embedded
Embedded hardware
hardware tester:
tester: Simplified
Simplified schematic
schematic view
view of of
thethe elaborated
elaborated VHDL
VHDL code.
code.
The hardware test arrangement was useful to verify that the signals were generated correctly
The hardware test arrangement was useful to verify that the signals were generated correctly
and matched the logic levels expected during normal design operation. However, it was necessary to
and matched the logic levels expected during normal design operation. However, it was necessary to
reduce the speed of operation to account for non-ideal electrical parasitic effects that caused ringing of
reduce the speed of operation to account for non-ideal electrical parasitic effects that caused ringing
the signal. In this specific set-up, speed of operation of the circuit when monitoring the signals using
of the signal. In this specific set-up, speed of operation of the circuit when monitoring the signals
the logic analyzer and oscilloscope was not deemed important, so the 100 MHz clock was internally
using the logic analyzer and oscilloscope was not deemed important, so the 100 MHz clock was
divided within the built-in tester circuit to 2 MHz in the study. However, further analysis could
internally divided within the built-in tester circuit to 2 MHz in the study. However, further analysis
determine how fast the signals could change if the PmodTM connector was required to connect external
could determine how fast the signals could change if the PmodTM connector was required to connect
memory for larger data sets.
external memory for larger data sets.
Figure 14 shows the logic analyzer test set-up with the Artix-7 FPGA Development Board (bottom
Figure 14 shows the logic analyzer test set-up with the Artix-7 FPGA Development Board
left) and the Digilent® Analog Discovery “USB Oscilloscope and Logic Analyzer” (top right) [31].
(bottom left) and the Digilent® Analog Discovery “USB Oscilloscope and Logic Analyzer” (top right)
The 16 digital inputs for the logic analyzer function were available for use and a GND (ground, 0)
[31]. The 16 digital inputs for the logic analyzer function were available for use and a GND (ground,
connection was required to monitor the test circuit outputs. Internal control signals (ClockTop, ipOp,
0) connection was required to monitor the test circuit outputs. Internal control signals (ClockTop,
run, and resetN) were also available for monitoring in this arrangement.
ipOp, run, and resetN) were also available for monitoring in this arrangement.
® Waveforms software [32] was utilized to control the test hardware and view the
The Digilent
The Digilent Waveforms software [32] was utilized to control the test hardware and view the
®
results. Figure 15 shows the logic analyzer output in Waveforms. Here, one complete cycle of the
results. Figure 15 shows the logic analyzer output in Waveforms. Here, one complete cycle of the test
test process is shown, where the calculations are initially performed and the array outputs then
process is shown, where the calculations are initially performed and the array outputs then read. The
read. The logic level values obtained in this physical prototype test agreed with the results from the
logic level values obtained in this physical prototype test agreed with the results from the simulation
simulation study. The data values are shown as a combined integer number value (data, top) and the
study. The data values are shown as a combined integer number value (data, top) and the values of
values of the individual bits (7 down to 0).
the individual bits (7 down to 0).
Electronics 2018, 7, 320 20 of 24
Electronics 2018, 7, x FOR PEER REVIEW 20 of 24
Electronics 2018, 7, x FOR PEER REVIEW 20 of 24
Figure 14.
14.Logic
Figure14.
Figure Logic analyzer
Logic test
analyzer test
analyzer set-up using
test set-up using the Digilent
Digilent®®
Digilent ® Analog Discovery.
Analog
AnalogDiscovery.
Discovery.
Figure 15.Logic
Logic analyzertest
test results using Digilent®
using the Digilent ® Analog Discovery: Complete cycle.
Figure 15. Logic analyzer
Figure 15. analyzer test results
results using the Analog Discovery:
the Digilent® Analog Discovery: Complete
Complete cycle.
cycle.
The runsignal
The signal (a0-1-0
0-1-0 pulse) initiates
initiates that is selected by the ipOp signal atatthe
The run
run signal (a
(a 0-1-0 pulse)
pulse) initiates the
the computation
computation that
that is
is selected
selected by
by the
the ipOp
ipOp signal
signal at the
the
startofofthe
start thecycle.
cycle.The
The data
data readout
readout on the 8-bit data bus can be seen towards the end of the cycle as
start of the cycle. The data readout on the 8-bit data bus can be seen towards the end of the
can be seen towards the end of the cycle
cycle as
as
bothaabus
both busvalue
valueand
andindividual
individualbit bit values.
values.
both a bus value and individual bit values.
Figure 16 shows
Figure shows the datadata readout operation
operation towards the end of the cycle. The data output
Figure 1616 shows thethe data readout
readout operation towards
towards the end of the cycle. The The data
data output
output
identifiesthe
identifies thevalues
valuesfor
forarray
array A A (nine
(nine values), array B (six
(six values),
values), and
and the
theinner
inner product
product result
resultarray
array
identifies the values for array A (nine values), array B (six values), and the inner product result array
(six
(six values) as identified in Section 5.2.
(six values)
values) as
as identified
identified in
in Section
Section 5.2.
5.2.
Electronics 2018, 7, 320 21 of 24
Electronics 2018, 7, x FOR PEER REVIEW 21 of 24
Figure16.
Figure 16.Logic
Logicanalyzer
analyzer test
test results using the Digilent®® Analog
results using AnalogDiscovery:
Discovery:Data
Datareadout.
readout.
5.6.
5.6.Design
DesignImplementation
Implementation Considerations
Considerations
This
This case
case study design
designhas
haspresented
presentedone one example
example implementation
implementation of combined
of the the combined
inner inner
and
and outer product algorithm. The study focused on creating a custom hardware
outer product algorithm. The study focused on creating a custom hardware only design only design
implementation
implementation rather
rather than
than developing algorithm in
developing the algorithm in software
software to
to run
run on
on aa suitable
suitableprocessor
processor
architecture.
architecture.The
Theapproach
approach taken
taken to
to create the hardware design was
was to
to map
mapthethealgorithm
algorithmoperations
operations
ininsoftware
softwareto
toaahardware
hardwareequivalence.
equivalence. The
The hardware
hardware design was created using two two main
main modules:
modules:
and synthesized into the available FPGA logic (a so-called soft core). For example, in Xilinx® FPGAs,
then the MicroBlaze 32-bit RISC (reduced instruction set computer) CPU can be instantiated into
a custom design. It is also possible to have, if the hardware resources are sufficient, instantiated
multiple soft cores within the FPGA. This would allow for a multi-processor solution and on-chip
processor-to-processor communications with parallel processing. A second approach would be
to develop a custom architecture solution that maps the algorithm and memory modules to the
user requirements, giving a choice to implement sequential or parallel (concurrent) operations.
This provides a high level of flexibility for the designer, but requires a different design approach,
thinking in terms of hardware rather than software operations. A third approach would be to create
a hardware-software co-design incorporating custom architecture hardware and standard processor
architectures working concurrently.
A final consideration in implementation would to be identify example processor architectures and
target hardware used in machine and deep learning applications, where their benefits and limitations
for specific applications could be assessed. For example, in software processor applications, then the
CPU is used for tensor computations where a GPU (graphics processing unit) is not available. GPUs
have architectures and software programming capabilities that are better than a CPU for applications,
such as gaming, where high-speed data processing and parallel computing operations are required.
An example GPU is the Nvidia® Tensor Core [33].
6. Conclusions
In this paper, the design and simulation of a hardware block to implement a combined inner and
outer product was introduced and elaborated. This work was considered in the context of developing
embedded digital signal processing algorithms that can effectively and efficiently process complex
data sets. The FPGA was used as the target hardware and the product algorithm developed as VHDL
modules. The algorithm was initially evaluated using C and Python code before translation to a
hardware description in VHDL. The paper commenced with a discussion into tensors and the need
for effective and efficient memory access to control memory access times and the cost associated with
such memory access operations. To develop the ideas, the FPGA hardware implementation developed
was an example design that paralleled an initial software algorithm (C and Python coding) used for
algorithm development. The design was evaluated in simulation and hardware implementation issues
were discussed.
Author Contributions: The authors contributed equally to the work undertaken as a co-design effort. Co-design
is not just giving tasks to scholars in separate yet complementary disciplines. It is an ongoing attempt to
communicate ideas across disciplines and a desire to achieve a common goal. This type of research not only
helps both disciplines but it gives definitive proof of how that collaboration achieves greater results, and mentors
students and other colleagues to do the same. The case study design was a hardware implementation by Ian
Grout using the FPGA of the algorithm developed by Lenore Mullin. Both authors contributed equally to the
background discussions and the writing of the paper.
Funding: This research received no external funding.
Conflicts of Interest: The authors declare no conflict of interest.
References
1. Earley, S. Analytics, Machine Learning, and the Internet of Things. IT Prof. 2015, 17, 10–13. [CrossRef]
2. Corneanu, C.A.; Simón, M.O.; Cohn, J.F.; Guerrero, S.E. Survey on RGB, 3D, Thermal, and Multimodal
Approaches for Facial Expression Recognition: History, Trends, and Affect-Related Applications. IEEE Trans.
Pattern Anal. Mach. Intell. 2016, 38, 1548–1568. [CrossRef] [PubMed]
3. Xilinx® . FPGA Leadership across Multiple Process Nodes. Available online: https://www.xilinx.com/
products/silicon-devices/fpga.html (accessed on 9 October 2018).
4. Xilinx® . Homepage. Available online: https://www.xilinx.com/ (accessed on 9 October 2018).
5. Xilinx® . Artix-7 FPGA Family. Available online: https://www.xilinx.com/products/silicon-devices/fpga/
artix-7.html (accessed on 9 October 2018).
Electronics 2018, 7, 320 23 of 24
6. Daniel Fleisch. A Student’s Guide to Vectors and Tensors; Cambridge University Press: Cambridge, UK, 2011;
ISBN-10 0521171903, ISBN-13 978-0521171908.
7. Institute of Electrical and Electronics Engineers. IEEE Std 1076-2008—IEEE Standard VHDL Language Reference
Manual; IEEE: New York, NY, USA, 2009; ISBN 978-0-7381-6853-1, ISBN 978-0-7381-6854-8.
8. Kindratenko, V.; Trancoso, P. Trends in High Performance Computing. Comput. Sci. Eng. 2011, 13, 92–95.
[CrossRef]
9. Lane, N.D.; Bhattacharya, S.; Mathur, A.; Georgiev, P.; Forlivesi, C.; Kawsar, F. Squeezing Deep Learning into
Mobile and Embedded Devices. IEEE Pervasive Comput. 2017, 16, 82–88. [CrossRef]
10. Mullin, L.; Raynolds, J. Scalable, Portable, Verifiable Kronecker Products on Multi-scale Computers.
In Constraint Programming and Decision Making. Studies in Computational Intelligence; Ceberio, M.,
Kreinovich, V., Eds.; Springer: Cham, Switzerland, 2014; Volume 539.
11. Gustafson, J.; Mullin, L. Tensors Come of Age: Why the AI Revolution Will Help HPC. 2017. Available
online: https://www.hpcwire.com/2017/11/13/tensors-come-age-ai-revolution-will-help-hpc/ (accessed
on 9 October 2018).
12. Workshop Report: Future Directions in Tensor Based Computation and Modeling. 2009. Available
online: https://www.researchgate.net/publication/270566449_Workshop_Report_Future_Directions_in_
Tensor-Based_Computation_and_Modeling (accessed on 9 October 2018).
13. Tensor Computing for Internet of Things (IoT). 2016. Available online: http://drops.dagstuhl.de/opus/
volltexte/2016/6691/ (accessed on 9 October 2018).
14. Python.org, Python. Available online: https://www.python.org/ (accessed on 9 October 2018).
15. TensorflowTM . Available online: https://www.tensorflow.org/ (accessed on 9 October 2018).
16. Pytorch. Available online: https://pytorch.org/ (accessed on 9 October 2018).
17. Keras. Available online: https://en.wikipedia.org/wiki/Keras (accessed on 9 October 2018).
18. Apache MxNet. Available online: https://mxnet.apache.org/ (accessed on 9 October 2018).
19. Microsoft Cognitive Toolkit (MTK). Available online: https://www.microsoft.com/en-us/cognitive-toolkit/
(accessed on 9 October 2018).
20. CAFFE: Deep Learning Framework. Available online: http://caffe.berkeleyvision.org/ (accessed on 9
October 2018).
21. DeepLearning4J. Available online: https://deeplearning4j.org/ (accessed on 9 October 2018).
22. Chainer. Available online: https://chainer.org/ (accessed on 9 October 2018).
23. Google, Cloud TPU. Available online: https://cloud.google.com/tpu/ (accessed on 9 October 2018).
24. Lenore, M.; Mullin, R. A Mathematics of Arrays. Ph.D. Thesis, Syracuse University, Syracuse, NY, USA, 1988.
25. Mullin, L.; Raynolds, J. Conformal Computing: Algebraically connecting the hardware/software boundary
using a uniform approach to high-performance computation for software and hardware. arXiv 2018.
Available online: https://arxiv.org/pdf/0803.2386.pdf (accessed on 1 November 2018).
26. Institute of Electrical and Electronics Engineers. IEEE Std 1364™-2005 (Revision of IEEE Std 1364-2001), IEEE
Standard for Verilog® Hardware Description Language; IEEE: New York, NY, USA, 2006; ISBN 0-7381-4850-4,
ISBN 0-7381-4851-2.
27. Ong, Y.S.; Grout, I.; Lewis, E.; Mohammed, W. Plastic optical fibre sensor system design using the field
programmable gate array. In Selected Topics on Optical Fiber Technologies and Applications; IntechOpen: Rijeka,
Croatia, 2018; pp. 125–151, ISBN 978-953-51-3813-6.
28. Dou, Y.; Vassiliadis, S.; Kuzmanov, G.K.; Gaydadjiev, G.N. 64 bit Floating-point FPGA Matrix Multiplication.
In Proceedings of the 2005 ACM/SIGDA 13th International Symposium on Field-Programmable Gate Arrays,
Monterey, CA, USA, 20–22 February 2005. [CrossRef]
29. Amira, A.; Bouridane, A.; Milligan, P. Accelerating Matrix Product on Reconfigurable Hardware for
Signal Processing. In Field-Programmable Logic and Applications, Proceedings of the 11th International
Conference, FPL 2001, Belfast, UK, 27–29 August 2001; Lecture Notes in Computer Science; Brebner, G.,
Woods, R., Eds.; Springer: Berlin/Heidelberg, Germany, 2001; pp. 101–111, ISBN 978-3-540-42499-4 (print),
ISBN 978-3-540-44687-3 (online).
30. Hurst, S.L. VLSI Testing—Digital and Mixed Analogue/Digital Techniques; The Institution of Engineering and
Technology: London, UK, 1998; pp. 241–242, ISBN 0-85296-901-5.
31. Digilent® . Analog Discovery. Available online: https://reference.digilentinc.com/reference/instrumentation/
analog-discovery/start?redirect=1 (accessed on 1 November 2018).
Electronics 2018, 7, 320 24 of 24
© 2018 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access
article distributed under the terms and conditions of the Creative Commons Attribution
(CC BY) license (http://creativecommons.org/licenses/by/4.0/).