0% found this document useful (0 votes)
105 views24 pages

Tensor FPGA

This document discusses hardware considerations for implementing tensor operations using field programmable gate arrays (FPGAs). It explains that embedded systems for applications like machine learning require efficient digital signal processing while using low power. FPGAs allow implementing operations like inner products and outer products in hardware for improved performance over software-based processors. The document focuses on storing and processing tensor data from internet of things applications using available memory resources in FPGAs.

Uploaded by

dreadrebirth2342
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
105 views24 pages

Tensor FPGA

This document discusses hardware considerations for implementing tensor operations using field programmable gate arrays (FPGAs). It explains that embedded systems for applications like machine learning require efficient digital signal processing while using low power. FPGAs allow implementing operations like inner products and outer products in hardware for improved performance over software-based processors. The document focuses on storing and processing tensor data from internet of things applications using available memory resources in FPGAs.

Uploaded by

dreadrebirth2342
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 24

electronics

Article
Hardware Considerations for Tensor Implementation
and Analysis Using the Field Programmable
Gate Array
Ian Grout 1, * and Lenore Mullin 2
1 Department of Electronic and Computer Engineering, University of Limerick, V94 T9PX Limerick, Ireland
2 Department of Computer Science, College of Engineering and Applied Sciences, University at Albany,
State University of New York, Albany, NY 12222, USA; [email protected]
* Correspondence: [email protected]; Tel.: +353-61-202-298

Received: 18 October 2018; Accepted: 8 November 2018; Published: 13 November 2018 

Abstract: In today’s complex embedded systems targeting internet of things (IoT) applications,
there is a greater need for embedded digital signal processing algorithms that can effectively and
efficiently process complex data sets. A typical application considered is for use in supervised and
unsupervised machine learning systems. With the move towards lower power, portable, and embedded
hardware-software platforms that meet the current and future needs for such applications, there is
a requirement on the design and development communities to consider different approaches to
design realization and implementation. Typical approaches are based on software programmed
processors that run the required algorithms on a software operating system. Whilst such approaches
are well supported, they can lead to solutions that are not necessarily optimized for a particular
problem. A consideration of different approaches to realize a working system is therefore required,
and hardware based designs rather than software based designs can provide performance benefits
in terms of power consumption and processing speed. In this paper, consideration is given to
utilizing the field programmable gate array (FPGA) to implement a combined inner and outer product
algorithm in hardware that utilizes the available hardware resources within the FPGA. These products
form the basis of tensor analysis operations that underlie the data processing algorithms in many
machine learning systems.

Keywords: Inner and outer product; tensor; FPGA; hardware

1. Introduction
Embedded system applications are today demanding greater levels of digital signal processing
(DSP) capabilities whilst providing low-power operation and with reduced processing times for
complex signal processing operations found typically in machine learning [1] systems. For example,
facial recognition [2] for safety and security conscious applications is a noticeable every day example,
and many smartphones today incorporate facial recognition software applications for phone and
software app access. Embedded environmental sensors, as an alternative application, can input
multiple sensor data values over a period of time and, using DSP algorithms, can analyze the data
and autonomously provide specific outcomes. Although these applications may differ, within the
system hardware and software, these are simply algorithms accessing data values that need to be
processed. The system does not need to know the context of the data it is obtaining. Data processing is
rather concerned with how effectively and efficiently it can obtain, store, and process the data before
transmitting a result to an external system. This requires not only an understanding of regular access
patterns in important internet of things (IoT) algorithms, but also an ability to identify similarities

Electronics 2018, 7, 320; doi:10.3390/electronics7110320 www.mdpi.com/journal/electronics


Electronics 2018, 7, 320 2 of 24

amongst such algorithms. Research presented herein shows how scalar operations, such as plus and
times, extended to all scalar operations, can be defined in a single circuit that implements all scalar
Electronics 2018, 7, x FOR PEER REVIEW 2 of 24
operations extended to: (i) n-dimensional tensors (arrays); (ii) the inner product, (matrix multiply
is a 2-d instance)amongst
similarities and thesuch outer product,Research
algorithms. n-dimensional
both onpresented arrayshow
herein shows (thescalar
Kronecker
operations, Product
such is a
2-d instance);
as plus and andtimes,
(iii) compressions, or reductions,
extended to all scalar operations,overcan be arbitrary
defined indimensions.
a single circuitHowever, even more
that implements
all scalarexist.
relationships operations
One ofextended
the mostto:compute
(i) n-dimensional
intensivetensors (arrays);
operations (ii) is
in IoT thethe
inner product, (matrix
Khatri-Rao, or parallel
multiply
Kronecker is a 2-d
Product, instance)
which, from andthethe outer product,
perspective both
of this on n-dimensional
research, is an outer arrays
product (the projected
Kronecker to a
matrix,Product
enabling is a contiguous
2-d instance);reads
and (iii)
andcompressions,
writes of data or reductions, over arbitrary
values at machine dimensions. However,
speeds.
even more relationships exist. One of the most compute intensive operations in IoT is the Khatri-Rao,
In terms of the data, when this data is obtained, it must be stored in the available memory.
or parallel Kronecker Product, which, from the perspective of this research, is an outer product
This will be a mixture of cache memory within a suitably selected software programmed processor
projected to a matrix, enabling contiguous reads and writes of data values at machine speeds.
(microcontroller
In terms (µC), microprocessor
of the data, when this(µP),dataor is digital
obtained,signal processor
it must be stored (DSP)), locally connected
in the available memory. This external
volatile
will be a mixture of cache memory within a suitably selected software programmed processor via
or non-volatile memory connected to the processor, memory connected to the processor
local area network (LAN),
(microcontroller (µC),ormicroprocessor
via some form(µP), of Cloud basedsignal
or digital memory (Cloud
processor storage).
(DSP)), locally Identifying
connectedwhat
to useexternal
and when is theorchallenge.
volatile non-volatile Ideally,
memory theconnected
data would be stored
to the in specific
processor, memorymemory connected locations
to the so
processor
that the processor via local area network
can optimally (LAN),
access theorstored
via some form
input of Cloud
data, based
process thememory
data, and (Cloud storage).
store the result
Identifying what to use and when is the challenge. Ideally, the data
(the output data) again in memory in suitable new locations, or overwriting existing data in already would be stored in specific
memory
utilized memory. locations
Knowing so that
andtheanticipating
processor cancache optimally
memoryaccessmisses,
the stored
for input
example,data,enable
processathe data, that
design
and store the result (the output data) again in memory in suitable new locations, or overwriting
minimizes overhead(s), such as signal delays, energy, heat, and power.
existing data in already utilized memory. Knowing and anticipating cache memory misses, for
In many embedded systems implemented today, the software programmed processor is the
example, enable a design that minimizes overhead(s), such as signal delays, energy, heat, and power.
commonlyInused many programmable
embedded systems device to performtoday,
implemented complex tasks and
the software interface toprocessor
programmed input and is output
the
systems. The software
commonly approach hasdevice
used programmable beento developed over the
perform complex lastand
tasks number of years
interface to input andand is supported
output
through tools (usually
systems. The softwareavailable via an
approach hasintegrated
been developeddevelopment environment
over the last number of years(IDE))and and is programming
supported
language constructs,
through providing
tools (usually the necessary
available via ansyntax and semantics
integrated development to perform the required
environment (IDE)) complex
and
programming language constructs, providing the necessary syntax
tasks. However, increasingly, the programmable logic device (PLD) [3] that allows for a hardware and semantics to perform the
required complex tasks. However, increasingly, the programmable logic
configuration to be downloaded into the PLD in terms of digital logic operations is utilized. Figure 1device (PLD) [3] that allows
showsfor thea target
hardware configuration
device to be downloaded
choices available into thetoday.
to the designer PLD inAlternatively,
terms of digitalanlogic operations
application is
specific
utilized. Figure 1 shows the target device choices available to the designer today. Alternatively, an
integrated circuit (ASIC) solution whereby a custom integrated circuit is designed and fabricated could
application specific integrated circuit (ASIC) solution whereby a custom integrated circuit is designed
be considered. Design goals include not only semantic, denotational, and functional descriptions of a
and fabricated could be considered. Design goals include not only semantic, denotational, and
circuit,functional
but also an operational
descriptions of adescription
circuit, but also(how anto build the description
operational circuit and(how associated
to build memory
the circuitrelative
and to
accessassociated
patterns of important
memory algorithms).
relative to access patterns of important algorithms).

Algorithm to be implemented

Software programmed Hardware configured


ASIC
processor programmable logic device

Microcontroller (µC) Field programmable gate array (FPGA)

Microprocessor (µP) Complex programmable logic device (CPLD)

Digital signal processor (DSP) Simple programmable logic device (SPLD)

1. Programmable/configurable
FigureFigure 1. Programmable/configurabledevice
device choices for implementing
choices for implementingdigital
digital signal
signal processing
processing
operations in hardware
operations andand
in hardware software.
software.
Electronics 2018, 7, 320 3 of 24

In this paper, consideration is given to a general algorithm, and the resultant circuit, for an
n-dimensional inner and outer product. This algorithm (circuit) builds upon scalar operations,
thus creating a single IP (intellectual property) core that utilizes an efficient memory access algorithm.
The field programmable gate array (FPGA) is used as the target hardware and the Xilinx® [4] Artix-7 [5]
device is utilized in this case study. The two algorithms, the matrix multiplication, and Tensor Product
(Kronecker Product), are foundational to essential algorithms in AI and IoT. The paper is presented in a
way to discuss the necessary links between the computer science (algorithm design and development)
and the engineering (circuit design, implementation, test, and verification) actions that need to be
undertaken as a single, combined approach to system realization.
The paper is structured as follows. Section 2 will introduce and discuss algorithms for complex
data analysis with a focus on tensor [6] analysis. An approach using tensor based computations with
dimension data arrays that are to be developed and processed is introduced. Section 3 will discuss
memory considerations for tensor analysis operations, and Section 4 will introduce the use of the FPGA
in implementing hardware and hardware/software co-design realizations of tensor computations.
Section 5 will provide a case study design created using the VHDL (Very High Speed Integrated Circuit
(VHSIC) Hardware Description Language (HDL)) [7] for synthesis and implementation within the
FPGA. The design architecture, simulation results, and physical prototype test results are presented,
along with a discussion into implementation possibilities. Section 6 will conclude the paper.

2. Algorithms for Tensor Analysis

2.1. Introduction
In this section, data structures using tensor notation are introduced and discussed with the need
to consider and implement high performance computing (HPC) applications [8], such as required in
artificial intelligence (AI), machine learning (ML), and deep learning (DL) systems [9]. The section
commences with an introduction to tensors and then followed by a discussion into the use of tensors
in HPC applications. The algorithms foundational to IoT (Matrix Multiply, Kronecker Product,
and Compressions (Reductions)) are targeted with the need for a unified n-dimensional inner and
outer product circuit that can optimally identify and access suitable memories to store input and
processed data.

2.2. Tensors as Algebraic Objects


As the need for IoT [10] and AI solutions grows, so does the need for High Performance Tensor
(HPT) operations [11]. Tensors often provide a natural and compact representation for multidimensional
data. For example, a function with five parameters can be thought of as a five-dimensional array. This is a
particularly useful approach to structuring complex data sets for analysis.
With the complexity of tensor analysis requirements in real-world scenarios, there is a need
for suitable hardware and software platforms to effectively and efficiently perform tensor analysis
operations. Although there is a plethora of tensor platforms available for use, all the platforms are
built upon tensors using various software programming languages, approaches, and performances.
Selecting and obtaining the right programming language and hardware platform to run tensor
computation programs on is not a trivial task. Fortunately, numerous efforts are underway to identify
hot spots and build firmware and hardware. These efforts are built upon over 10 years of national and
international workshops (e.g., [12,13]) uniting scientists to address these issues.
Tensors are algebraic objects that describe linear and multi-linear relationships. Tensors can be
represented as multidimensional arrays. A tensor is denoted by its rank from 0 upwards. Each rank
represents an array of a particular dimension. This idea is shown in Table 1 that identifies the tensor
rank, its mathematical entity, and an example realization using the Python language [14], using Python
lists to hold the data (in the examples, using integer numbers). A scalar value representing a magnitude
(e.g., the speed of a moving object) is a tensor of rank 0. A rank 1 tensor is a vector representing a
Electronics 2018, 7, 320 4 of 24

magnitude and direction (e.g., the velocity of a moving object: Speed and direction of motion). Matrices
(n × m arrays) have two dimensions and are rank 2 tensors. A three-dimensional (n × m × p) array can
be visualized as a cube and is a rank 3 tensor. Tensors with ranks greater than 3 can readily be created
and analysis performed on the data they hold would be performed by accessing the appropriate
element within the tensor and performing a suitable mathematical operation before storing the result
in another tensor.
In a physical realization of this process, the tensor data would be stored in a suitable size
memory, the data would be accessed (typically using a software programmed processor), and the
computation would be undertaken using fixed- or floating-point arithmetic. This entire process should,
ideally, stream data contiguously, and ideally anticipate where cache memory misses might occur,
thus minimizing overhead up and down the memory hierarchy. For example, in an implementation
using cache memory, L1 cache memory miss could also miss in L2, L3, and Page memory.

Table 1. Tensor rank (0 to n) with an example code using Python lists.

Rank. Mathematical Entity Example Realization in Python Code


0 Scalar (magnitude only) A=1
1 Vector (magnitude and direction) B = [0,1,2]
2 Matrix (two dimensions) C = [[0,1,2], [3,4,5], [6,7,8]]
3 Cube (three dimensions) D = [[[0,1,2], [3,4,5]], [[6,7,8], [9,10,11]]]
n n dimensions ...

A tensor rank, or a tensor’s dimensionality, can be thought of in at least two ways. The more
traditional way being, as the number of rows and columns change in a matrix, so does the dimension.
Even with that perspective, computation methods often decompose such matrices into blocks.
Conceptually, this can be thought of as “lifting” the dimension of an array. Further blocking “lifts”
the dimension even more. Another way of viewing a tensor’s dimensionality is by the number of
arguments in a function input over time. The most general way to view dimensionality is to combine
these attributes. The idealized methods for formulating architectural components are chosen to match
the arithmetic and memory access patterns of the algorithms under investigation. In this paper,
the n-dimensional inner and outer products are considered. Thus, in this case, what might be thought
of as a two-dimensional problem can be lifted to perhaps eight or more dimensions to reflect a physical
implementation, considering the memory as registers, the levels of cache memory, RAM (random
access memory), and HDD (hard disk drive). With that formulation, it is possible to create deterministic
cost functions validated by experimentation, and, ideally, an idealized component construction can be
realized that meets desired goals, such as heat dissipation, time, power, and hardware cost.
When an algorithm is run, the hardware implementing the algorithm will access available memory.
In Figure 2, a prototypical graph of how an algorithm that does not have cache memory misses or
page memory faults is presented. The shape of the graph changes as it moves through the memory
hierarchy. This identifies the time requirements associated with the different memories from L1 cache
memory through to disk (HDD). Note the change in the slope with memory type. The slope reflects
how attributes, such as speed, cost, and power, would affect performance. Algorithm execution
(memory access) time is, however, relative to the L1 cache memory chosen. For example, it could be
nanoseconds, microseconds, milliseconds, seconds, minutes, or hours as the data moves further up the
memory hierarchy. Often, performance is related to a decrease in arithmetic operations, i.e., a reduction
of arithmetic complexity. In an ideal computing environment, where memory and computation would
have the same costs, this would be the case. Unfortunately, it is also necessary to be concerned with the
cost of data input/output (I/O). In parallel to this, it is a necessity to consider memory access patterns
and how these relate to the levels of memory. Pre-fetching is one way to alleviate delays. However,
often the algorithm developer must rely on the attributes of a compiler and hope the compiler is
pre-fetching data in an optimum manner. The developer must trust that this action is performed
correctly. This is becoming harder to achieve given that machines are becoming ever more complex
Electronics 2018, 7, 320 5 of 24

and compiler writers


Electronics 2018, 7,are getting
x FOR scarcer. Empirical methods of experimentation reveal graphs,
PEER REVIEW 5 of 24 such as
the one shown in Figure 2. Such diagnostic methods allow the algorithm developer to observe the
experimentation reveal graphs, such as the one shown in Figure 2. Such diagnostic methods allow
performance of a particular algorithm running on a machine. It is then possible to look at memory
the algorithm developer to observe the performance of a particular algorithm running on a machine.
speed, size,
It is thenother
and possible cost factors
to look to putspeed,
at memory together a model
size, and of how
other cost factorswe might
to put improve
together a modelperformance
of
through “dimension
how we mightlifting”.improve That said, the
performance goal“dimension
through is alwayslifting”.
to try That
to keep
said,the slope
the goal linear,toi.e.,
is always try the linear
to keep the slope
part of a polynomial curvelinear, i.e.,that
such the linear part of is
the slope a polynomial
minimized. curve such that the slope is minimized.

Algorithm execution time

Memory access
time is relative to
the L1 cache
memory chosen.
Increasing time

Memory
L1 L2 L3 RAM DISK level

Figure 2. Algorithm execution (memory access) of time vs. memory hierarchy.


Figure 2. Algorithm execution (memory access) of time vs. memory hierarchy.

Presently,Presently,
the goalthe isgoal
to achieve a situation
is to achieve a situationwhere
where the thegraph
graph is polynomial,
is polynomial, avoiding avoiding exponential
exponential
behavior, behavior,
such as such
the asonetheinone in Figure
Figure 2, usingHDDs.
2, using HDDs. AAco-design
co-design approach, complemented
approach, with
complemented with
dimension lifting and analysis, as discussed above, can be used to calculate upper and lower bounds
dimension lifting and analysis, as discussed above, can be used to calculate upper and lower bounds
of algorithms relative to their data size, memory access patterns, and arithmetic. The goal is to ensure
of algorithms relativestays
performance to their dataassize,
as linear memory
possible. access
This type patterns,enables
of information and arithmetic.
the algorithm The goal is to ensure
developers
performance stays
insight intoas linear
what as possible.
memories This
to select for use,type of information
i.e., what type and size ofenables
memorythe algorithm
should developers
be used to
keep the slope constant. This, of course, would include pre-fetching, buffering, and
insight into what memories to select for use, i.e., what type and size of memory should be used to keep timings to feed
the prior levels at memory speed. If this is not possible, given the available memory choices, the slope
the slope constant. This, of course, would include pre-fetching, buffering, and timings to feed the prior
change can be minimized.
levels at memory speed. If this is not possible, given the available memory choices, the slope change
can be minimized.
2.3. Machine Learning, Deep Learning, and Tensors
Tensor and machine learning communities have provided a solid research infrastructure,
2.3. Machine Learning,
reaching from Deep Learning,
the efficient andforTensors
routines tensor calculus to methods of multi-way data analysis, i.e.,
from tensor decompositions to methods for consistent and efficient estimation of parameters of
Tensor and machine learning communities have provided a solid research infrastructure, reaching
probabilistic models. Some tensor-based models have the characteristic that if there is a good match
from the efficient routines
between the forthe
model and tensor calculus
underlying to methods
structure in the data,ofthe
multi-way data more
models are much analysis, i.e., from tensor
interpretable
decompositions to methods
than alternative for consistent
techniques. and efficient
Their interpretability estimation
is an of parameters
essential feature of probabilistic
for the machine learning models.
techniques to gain acceptance in the rather engineering intensive fields of
Some tensor-based models have the characteristic that if there is a good match between the modelautomation and control of
cyber-physical systems. Many of these systems show intrinsically multi-linear behavior, which is
and the underlying structure in the data, the models are much more interpretable than alternative
appropriately modeled by tensor methods, and tools for controller design can use these models. The
techniques. Their interpretability
calibration of sensors deliveringis an essential
data and the feature for the machine
higher resolution of measured learning techniques
data will have an to gain
acceptanceadditional
in the rather
impact on engineering intensive
the interpretability fields of automation and control of cyber-physical
of models.
systems. Many Deep
oflearning
these is a subfieldshow
systems of machine learning that
intrinsically supports a setbehavior,
multi-linear of algorithmswhich
inspiredisby the
appropriately
structure and function of the human brain. TensorflowTM [15], PyTorch [16], Keras [17], MXNet [18],
modeled by tensor methods, and tools for controller design can use these models. The calibration of
The Microsoft Cognitive Toolkit (CNTK) [19], Caffe [20], Deeplearning4j [21], and Chainer [22] are
sensors delivering data and
machine learning the higher
frameworks resolution
that are of measured
used to design, build, anddata
trainwill
deephave an models.
learning additional
Such impact on
the interpretability
frameworksofcontinue
models.to emerge. These frameworks support numerical computations on
Deepmultidimensional data arrays,
learning is a subfield or tensors,learning
of machine e.g., point-wise operations,asuch
that supports as add,
set of sub, mul,inspired
algorithms pow, by the
exp, sqrt, div, and mod. They also support numerous TM linear algebra operations, such as Matrix-
structure and function of the human brain. Tensorflow [15], PyTorch [16], Keras [17], MXNet [18],
The Microsoft Cognitive Toolkit (CNTK) [19], Caffe [20], Deeplearning4j [21], and Chainer [22]
are machine learning frameworks that are used to design, build, and train deep learning models.
Such frameworks continue to emerge. These frameworks support numerical computations on
multidimensional data arrays, or tensors, e.g., point-wise operations, such as add, sub, mul, pow, exp,
sqrt, div, and mod. They also support numerous linear algebra operations, such as Matrix-Multiply,
Kronecker Product, Cholesky Factorization, LU (Lower-Upper) Decomposition, singular-value
Electronics 2018, 7, 320 6 of 24

decomposition (SVD), and Transpose. The programs would be written in various languages, such as
Python, C, C++, and Java. These languages also include libraries/packages/modules that have been
developed to support high-level tensor operations, in many cases under the umbrellas of machine
learning and deep learning.

2.4. Tensor Hardware


Google’s introduction of a Tensor Processing Unit (TPU) [23] that works in conjunction with
TensorFlow emphasizes that there is a need for fast tensor computation. That need will only grow
exponentially as the use of AI increases. Consequently, what would an idealized processor for
tensors look like? What would idealized software defined hardware look like? What are important
pervasive algorithms? Two workshops, one at the NSF (National Science Foundation) [12] in America,
and another at Dagstuhl [13], validated and promoted how tensors are used in numerous domains,
considering AI and IoT in general. Charles Van Loan, a co-organizer of the NSF Workshop, emphasized
the importance of The Kronecker Product. He called it the Product of the Times. The algorithm (circuit)
presented herein is foundational to this very important algorithm. The goal is to develop designs that
could be used to build a Universal Algebraic Unit© (UAU©) that could support all the mathematics in
numerical libraries, such as NumPy, which most, if not all, applications mentioned above, use and
rely on for performance. There are two challenges in the design and development of applications that
require tensor support: Optimal software and hardware, necessitating a co-design approach. Due to
the ubiquitous nature of tensors, a co-design approach is used to achieve the goals of the work.

2.5. Contribution of this Paper


This paper demonstrates the Matrix Multiplication and Kronecker Product that are both built
from a common algorithm, the outer product. This design is unique in that is provides:

• A general approach to inner and outer product, n dimensional, 0 ≤ n;


• a general approach relative to scalar operations other than + and ×; and
• a demonstration of how the design enables speed-up for Kronecker Products

The design presented in this paper is for an n-dimensional inner and outer product, e.g., for 2-d
matrix multiply, which builds upon the scalar operations of + and × [24]. Some operations may be
realized in hardware, firmware, or software. This generalized inner product is defined using reductions
and outer products [24], and reduces to three loops independent of conformable argument dimensions
and shapes. This is due to Psi Reduction, where it is possible to, through linear and multilinear
transformations, reduce an array expression to a normal form. Then, through “dimension lifting” of a
normal form, idealized hardware can be realized where the size of buffers relative to speed and size of
connecting memories, DMA (Direct Memory Access) hardware (contiguous and strided), and other
memory forms, when a problem size is known, or conjectured, and details of hardware are available
and known.

2.6. The Kronecker Family of Algorithms


With an ability to build an idealized Kronecker Product, it is possible to address multiple
Kronecker Products, parallel Kronecker Products, and outer products of Kronecker Products.
These algorithms are used throughout the models built by mathematicians. Moreover, they are
often used many times in sequence, necessitating an optimization study. If strides are required, as in
the classical approach, performance will suffer. The Kronecker Product is viewed as an outer product,
no matter how many there are in an expression. Consequently, it is not necessary to be concerned with
strided access until the end when the outer product result is transposed and reshaped, thus saving
energy and time. It is then possible to capitalize on contiguous access streaming from component to
component. The analysis may consider time, space, speed, and other parameters, such as energy and
heat, to determine cost.
Electronics 2018, 7, 320 7 of 24

3. Memory Considerations for Tensor Analysis

3.1. Introduction
In order to understand memory considerations, it is important to understand the algorithms that
dominate tensor analysis: Inner Products (Matrix Multiply), and Outer Products (Kronecker or Tensor
Product). Others include transformations and selections of components. Models in AI and IoT [13]
are dominated by multiple Kronecker Products, parallel Kronecker Products (Khatri-Rao), and outer
products of Kronecker Products (Tracey Singh), in conjunction with compressions over dimensions.
Memory access patterns are well known. Moving on from an algorithmic specification to an optimized
software or hardware instantiation of that algorithm requires maximizing the data structures that
represent the algorithm in conjunction with the memory(ies) of a computer.

3.2. Computer Memory Access


From the onset of computing, computer scientists and mathematicians have discussed the
complexity of an algorithm that translates to finding the least amount of arithmetic to perform.
In an ideal world, where memories had the same speed no matter where they were, the computation
effort would be based on the complexity of the algorithm. In fact, in the early days of computing,
that was the case where memory was only one clock cycle away from the CPU (central processing
unit). This is not true now. Now, what matters is the least amount of arithmetic and an optimal use of
memory. From an engineering point of view, this means an understanding of the algorithm operation
from a memory access pattern perspective. Moreover, through that understanding, it is possible to
create an optimal, predictive, and reproducible performance.

3.3. Cache Memory: Memory Types and Caches in a Typical Processor System
Over the years, memory has become faster in conjunction with memory types becoming
more diverse. Architectures now support multiple, non-uniform memories, multiple processors,
and multiple networks, and those architectures are combined to form complex, multiple networks.
In an IoT application, there may be a case that one application requires the use of a substantial portion
of the available resources and those resources must have a reproducible capacity. Figure 3 presents
a view of the different memories that may be available in an IoT application, from processor to the
Cloud. This view is based on a software programmed processor approach. Different memory types
(principle of operation, storage capacity, speed, cost, ability to retain the data when the device power
supply is removed (volatile vs. non-volatile memory), and physical location in relation to the processor
core) would be considered based on the system requirements. The fastest memories with the shortest
read and write times would be closest to the processor core, and are referred to as the cache memory.
Figure 3 considers the cache memory as three levels (L1, L2, and L3), where the registers are closest to
the core and on the same integrated circuit (IC) die as the processor itself before the cache memory
would be accessed. L1 cache memory would be SRAM (static RAM) fabricated onto the same IC
die as the processor, and would be limited in the amount of data it could hold. The registers and
L1 cache memory would be used to retain the data of immediate use by the processor. External to
the processor would be external cache memory (L2 and L3), where this memory may be fast SRAM
with limited data storage potential or slower dynamic RAM (DRAM) that would have a greater data
storage potential. RAM is volatile memory, so for data retention when the power supply is removed,
non-volatile memory types would be required: EEPROM (electrically erasable programmable read
only memory), Flash memory based on EEPROM, and HDD would be local memory followed by
external memory connected to a local area network (a “network drive”) and Cloud memory. However,
there are costs associated with each memory type that would need to factored into a cost function for
the memory.
Electronics 2018, 7, 320 8 of 24
Electronics 2018, 7, x FOR PEER REVIEW 8 of 24

Registers Level 2 (L2) cache Level 3 (L3) cache


(flip-flops and latches)

External memory
Level 1 (L1) cache
(on-processor cache)
Flash or EEPROM HDD

Processor

LAN memory
Cloud memory

Figure Availability
3. 3.
Figure Availabilityof
ofmemory typesininIoT
memory types IoTapplications.
applications.

3.4. Cache Misses


3.4. Cache andand
Misses Implications
Implications
To help understand
To help whywhy
understand cache memory
cache memorymisses, page
misses, faults,
page andand
faults, other memory
other memory faults cause
faults causedelays
in computation,
delays in computation, to the 1-dtofast
reference reference the Fourier transform
1-d fast Fourier (FFT)(FFT)
transform can becanmade.
be made.Theory
Theory states
statesthat a
lengththat a length
n FFT has ann nFFT has
log n an n log nand
complexity complexity and so computation
so an idealized an idealized computation
time could betime could be from
determined
determined from this assumption. However, the computation could
this assumption. However, the computation could take significantly longer to complete, dependingtake significantly longer to on
complete, depending on a number of hardware related issues that include
a number of hardware related issues that include the availability of cache memory, the associativity the availability of cache
memory, the associativity of the cache, how many levels of memory there are, and the size of the
of the cache, how many levels of memory there are, and the size of the problem. For example, if a
problem. For example, if a four-way associative cache is used, a radix 4 FFT might be selected based
four-way associative cache is used, a radix 4 FFT might be selected based on the size of the available
on the size of the available associative cache. However, suppose a radix 2 is used. If the input vector
associative
for the cache.
FFT was However,
0 1 2 3 4 5suppose
6 7 8 9 10a11radix
12 13214 is15,
used.
and Ifthethe input
cache vector
could for the
fit only FFTeight
the first was values,
0123456
7 8 9 that
10 11 12 13 14 15, and the cache could fit only the first eight values, that
means that on the 4th cycle, there is a cache miss. Knowing that the data could be reshaped means that onand
the 4th
cycle,transposed
there is a to
cache
obtain data locality, i.e., 0 8 1 9 2 10 3 11 4 12 5 13 6 14 7 15, the operation could be data
miss. Knowing that the data could be reshaped and transposed to obtain
locality, i.e., 0 8 with
completed 1 9 2the
10 3data
11 4locally
12 5 13 6 14 7However,
stored. 15, the operation
as the inputcould
databesetcompleted withthis
size increases, thenot
data
onlylocally
results
stored. in cache
However, as misses,
the inputit also
dataresults in page
set size faults,this
increases, significantly
not only slowing
results indown
cachethemisses,
performance
it also of
results
an algorithm,
in page which can slowing
faults, significantly be vieweddown graphically as an exponential
the performance of an rise [25]. Forwhich
algorithm, signal can
processing
be viewed
applications, the implication is that there will be a computation time increase.
graphically as an exponential rise [25]. For signal processing applications, the implication is that there
will be a computation time increase.
3.5. Cost Functions for Memory Access
3.5. Cost Functions forfunctions
Usually cost Memory are
Access
based on statistical methods. However, the analysis used in this work
create a Normal Form that depicts the levels of memory desired relative to the access patterns of
Usually
algorithms.costWith
functions
this, it are based ona statistical
is possible, methods.
priori, to define However, the analysis
the implementation used in
requirements, this
such aswork
createmaximum
a Normal heat dissipation, power, cost, and time. With this information, as an FPGA designer, it is of
Form that depicts the levels of memory desired relative to the access patterns
algorithms.
possible With this,theit available
to utilize is possible, a priori,
hardware to define
resources the implementation
and add the right types andrequirements,
levels of memory,such as
maximum heat of
the number dissipation,
FPGAs linked power, cost,and
together, anduse time.
FPGAs Withwiththis information,
other as an FPGA
forms of processing designer,
unit. Such
considerations
it is possible wouldthe
to utilize come from knowledge
available of the hardware
hardware resources and add andthe
knowledge
right types of algorithm
and levels of
requirements. Through experimental methods, developed
memory, the number of FPGAs linked together, and use FPGAs with other forms of by one of the co-authors, it can be seen that
processing
each level of memory as a Normal Form moves through the memory
unit. Such considerations would come from knowledge of the hardware and knowledge of algorithm relative to its access patterns
and arithmetic. What can be seen for any algorithm is that the curves, referring to Figure 2, start out
requirements. Through experimental methods, developed by one of the co-authors, it can be seen that
constant, but then move to become a linear curve(s) while computation is still in real memory. Then,
each level of memory as a Normal Form moves through the memory relative to its access patterns
it is noticed that for each small piece of linearity, the slope gets steeper, indicating a change in memory
and arithmetic.
speed. Thus,What can be of
an evolution seen for any algorithm
a polynomial curve is seenis that
that the curves,
finally referring towhen
goes exponential Figure
the 2, start out
access
constant,
is to but
HDD. then
In move
parallel,to ifbecome a linear
the available curve(s)
sizes while of
and speeds computation
the various is still in realcomponents
architectural memory. Then,
it is noticed
available,that foraseach
such smallbuffers,
registers, piece ofandlinearity,
memories,the are
slope gets steeper,
known, then it isindicating
possible to a“dimension
change in lift”
memory
speed. theThus, an evolution
Normal Form to includeof a polynomial curve isThus,
all these attributes. seenperformance
that finally goes
can beexponential
predicted and when the access
verified via is
suitably designed experiments.
to HDD. In parallel, if the available sizes and speeds of the various architectural components available,
such as registers, buffers, and memories, are known, then it is possible to “dimension lift” the Normal
Form to include all these attributes. Thus, performance can be predicted and verified via suitably
designed experiments.
Electronics 2018, 7, 320 9 of 24

4. The Field Programmable Gate Array (FPGA)

4.1. Introduction
In this section, the FPGA is introduced as a configurable hardware device that has an internal
circuit structure that can be configured to different digital circuit or system architectures. It can,
for example, be configured to implement a range of digital circuits from a simple combinational
logic circuit through to a complex processor architecture. With the available hardware resources and
ability to describe the circuit/system design using a hardware description language (HDL), such as
VHDL or Verilog [26], the designer can implement custom design architectures that are optimized
to a set of requirements. For example, it is possible to describe a processor architecture using VHDL
or Verilog, and to synthesize the design description using a set of design synthesis constraints into a
logic description that can then be targeted to a specific FPGA device (design implementation, “place
and route”). This processor, which is hardware, would then be connected to a memory containing a
program for the processor to run, the memory may be registers (flip-flops), available memory macros
within the FPGA or external memory devices connected to the pins of the FPGA. Therefore, it would
be possible to implement a hardware only design or a hardware/software co-design. In addition,
if adequate hardware resources were available, more than one processor could be configured into the
FPGA and a multi-processor device therefore developed.

4.2. Programmable Logic Devices (PLDs)


The basic concept of the PLD is to provide a programmable (configurable) IC that enables the
designer to configure logic cells and interconnect within the device itself to form a digital electronic
circuit that is housed within a single packaged IC. In this, the hardware resources (the available
hardware for use by the designer) will be configured to implement a required functionality. By changing
the hardware configuration, the PLD will operate a different function. Hardware configured PLDs are
becoming increasingly popular due to the potential benefits in terms of logic potential (obsolescence),
rapid prototyping capabilities in digital ASIC design (early stage prototyping, design debugging,
and performance evaluation), and design speed benefits, where PLD based hardware can implement
the same functions as a software programmed processor, but in a reduced time. Concurrent (parallel)
operations can be built into the PLD circuit configuration that would otherwise be implemented
sequentially within a processor. This is particularly important for computationally expensive
mathematical operations, such as the FFT, digital filtering, and other mathematical operations that
require complex data sets to be analyzed in a short time. Table 2 summarizes available devices and
their vendors. It is not, however, a trivial task to select the right device for a specific application or
range of applications.

Table 2. PLD vendors and devices [27].

Vendor FPGA SPLD/CPLD Company Homepage


Virtex, Kintex, Artix and CoolRunner-II, XA
Xilinx® https://wwwxilinxcom/
Spartan CoolRunner-II and XC9500XL
Stratix, Arria, MAX, Cyclone https://wwwintelcom/content/
Intel® —
and Enpirion www/us/en/fpga/deviceshtml
ATF15xx ATF25xx, ATF75xx
Atmel Corporation
AT40Kxx family FPGA CPLD families and ATF16xx, https://wwwmicrochipcom/
(Microchip)
ATF22xx SPLD families
Lattice ECP, MachX and iCE FPGA
ispMACH CPLD family http://wwwlatticesemicom/
Semiconductor families
PolarFire, IGLOO, IGLOO2,
https://wwwmicrosemicom/product-
Microsemi ProASIC3, Fusion and —
directory/fpga-soc/1638-fpgas
Rad-Tolerant FPGA families
Electronics 2018, 7, 320 10 of 24

4.3. Hardware Functionality within the FPGA


Each FPGA provides a set hardware resources available to the designer where the use of specific
resources would be considered to obtain a required performance in a specific application. However,
this does rely on the use of the correct FPGA for the application and the knowledge of the designer in
using these available hardware resources.
There are specific advantages in selecting an FPGA for use rather than an off-the-shelf processor.
By selecting the appropriate hardware architecture, high speed DSP operation, such as digital filtering
and FFT operations, can be achieved, which might not be possible in software. This is partly due to the
ability to create a custom design architecture and partly due to concurrent operation, which means
that operations in hardware can be run in parallel as well as sequentially. A typical FPGA also has a
high number of digital input and output pins for connecting to peripheral devices with programmable
I/O standards. This allows for flexibility in the types of peripheral devices, such as memory and
communications ICs, that could be connected to the FPGA. Within the device, as well as programmable
logic circuits, built-in memories for data storage are available, which have an immediate and temporary
use, i.e., for cache memory scenarios. The DSP operations are supported using built-in hardware
multipliers, and fast fixed-point and floating-point calculations can be implemented. In some FPGAs,
built-in analog-to-digital converters (ADCs) are available for analog input sampling as well as IP
blocks, such as FFT and digital filter blocks. These resources give the ability to develop a custom
design architecture suited to the specific application. The FPGA is configured by downloading a
design configuration as a sequence of binary logic values (sequence of 0’s and 1’s). The configuration
would be initially created as a file using the FPGA design tools that is then downloaded into the device.
The configuration values are stored in memory within the device, where the memory may be volatile
or non-volatile:

• Volatile memory: When data are stored within the memory, the data are retained in the memory
whilst the memory is connected to a power supply. Once the power supply has been removed,
then the contents of the memory (the data) are lost. The early FPGAs utilized volatile SRAM
based memory.
• Non-volatile memory: When data are stored within the memory, the data are retained in the
memory even when the power supply has been removed. Specific FPGAs available today utilize
Flash memory for holding the configuration.

5. Inner and Outer Product Implementation in Hardware Using the FPGA Case Study

5.1. Introduction
In this section, the design, simulation, and physical prototype testing of a single IP core that
implements the inner and outer products are presented. The idea here is to have a hardware macro cell,
or core, that can be accessed from an external digital system (e.g., a software programmed processor
that can pass the computation tasks to this cell whilst it performs other operations in parallel). The input
array data are stored as constants within arrays in the ipOpCore module, as shown in Figure 4, and are
therefore, in this case, read-only. However, in another application, then it would be necessary to
allow the arrays to be read-write for entering new data to be analyzed and then the design would
be modified to allow array A and B data to be loaded into the core, either as serial or parallel data.
Hence, the discussion provided in this section relates to the specific case study. In addition, a single
result output could be considered and the need for test data output might not be a requirement.
The motivation behind this work is to model tensors as multi-dimensional arrays and to analyze these
using tensor analysis in hardware. This requires a suitable array access algorithm to be developed,
the use of suitable memory for storing data in a specific application, and a suitable implementation
strategy. In this paper, the inner and outer products are only considered using the FPGA as the
target device, an efficient algorithm to implement the inner and outer products in a single circuit
Electronics 2018, 7, 320 11 of 24

implemented in hardware is used, and appropriate embedded FPGA memory resources to enable fast
memory access
Electronics 2018, 7,are used.
x FOR PEER REVIEW 11 of 24

Figure4.4.ipOpCore
Figure ipOpCore case study design.
case study design.

The design
The designshown
shownininFigure
Figure44waswascreated
created toto allow for both
allow for bothproduct
productresults
resultstotobebeindependently
independently
accessed
accessed during normal runtime operation and for specific internal data to be accessed forfor
during normal runtime operation and for specific internal data to be accessed test
test andand
debug purposes. The design description was written in VHDL as a combination
debug purposes. The design description was written in VHDL as a combination of behavioral, RTL, of behavioral, RTL,
andand
structural code targeting thethe
Xilinx ® ®
structural code targeting XilinxArtix-7
Artix-7(XC7A35TICSG324-1L)
(XC7A35TICSG324-1L) FPGA.
FPGA. This specific
This device
specific devicewas
chosen for practical reasons as it contains hardware resources suited for this application.
was chosen for practical reasons as it contains hardware resources suited for this application. The The design,
however,
design, is portableis and
however, is readily
portable and istransferred to othertoFPGAs,
readily transferred or to be
other FPGAs, or part
to beof a larger
part digital
of a larger ASIC
digital
design,
ASICifdesign,
required. For any design
if required. For anyimplementation, the choice
design implementation, theofchoice
hardware, and potentially
of hardware, software,
and potentially
software,
to use wouldtobeusebasedwould
on abenumber
based on a number of considerations.
of considerations. The FPGA was Themounted
FPGA was mounted
on the Digilenton®the
Arty
Digilent ® Arty A7-35T Development Board and was chosen for the following reasons:
A7-35T Development Board and was chosen for the following reasons:
TheThe FPGA
FPGA consideredisisused
considered usedininother
otherproject
project work
work and as as such,
such,the
thework
workdescribed
describedinin this paper
this paper
could
could readily
readily bebe incorporatedinto
incorporated intothese
theseprojects.
projects. Specifically,
Specifically, sensor
sensordata
dataacquisition
acquisitionusing usingthe
theFPGA
FPGA
andand data
data analysis
analysis withinthe
within theFPGA
FPGAprojects
projects would
would benefit
benefit from
from this
thiswork
workwhere
wherethe thealgorithm
algorithm andand
memory access operations used in this paper would provide additional
memory access operations used in this paper would provide additional value to the work undertaken. value to the work
undertaken.
1. The development board used provided hardware resources that were useful for project work,
1. The development board used provided hardware resources that were useful for project work,
such as the 100 MHz clock, external memory, switches, push buttons, light emitting diodes
such as the 100 MHz clock, external memory, switches, push buttons, light emitting diodes
(LEDs), expansion connectors, LAN connection, and a universal serial bus (USB) interface for
(LEDs), expansion connectors, LAN connection, and a universal serial bus (USB) interface for
FPGA configuration and runtime serial I/O.
FPGA configuration and runtime serial I/O.
2. 2. The development
The development board waswas
board physically compact
physically and could
compact be readily
and could integrated
be readily into an into
integrated enclosure
an
forenclosure
mobilityfor purposes and operated from a battery rather than powered
mobility purposes and operated from a battery rather than powered through through the the
USB
+5USB
V power.
+5 V power.
3. 3. TheTheArtix-7
Artix-7FPGAFPGAprovided
providedadequate
adequateinternal
internal resources
resources and and I/O
I/Oforforthe
thework
workundertaken
undertaken andand
external resources could be readily added
external resources could be readily added via via the expansion connectors if
expansion connectors if required.required.
4. 4. ForFor memoryimplementation,
memory implementation,the theFPGA
FPGA cancan use thethe internal
internallook-up
look-uptables
tables(LUTs)
(LUTs) asasdistributed
distributed
memoryfor
memory for small
small memories,
memories, can canuseuseinternal
internal BRAMBRAM (Block RAM)
(Block RAM)for larger memories,
for larger and
memories,
external volatile/non-volatile memories connected
and external volatile/non-volatile memories connected to the I/O. to the I/O.
5. 5. ForFor computation
computation requirements,
requirements, the FPGA
the FPGA allows allows
for bothforfixed-point
both fixed-point and floating-point
and floating-point arithmetic
arithmetic operations
operations to be implemented. to be implemented.
6. For an embedded processor based approach, the MicroBlaze CPU can be instantiated within the
FPGA for software based implementations.
Electronics 2018, 7, 320 12 of 24

6. For an embedded processor based approach, the MicroBlaze CPU can be instantiated within the
FPGA for software based implementations.

The I/O for this module are as follows:


ipOp User to select whether the inner or outer product is to be performed.
clock Master clock (100 MHz).
resetN Master, asynchronous active low reset.
addrA Array A address for reading array contents (input tensor A).
addrB Array B address for reading array contents (input tensor B).
addrPC Array PC address for reading array contents (product code array).
addrResIp Address of inner product for reading array contents (output tensor IP).
addrResOp Address of outer product for reading array contents (output tensor OP).
dataA Array A data element being accessed (for test purposes only).
dataB Array B data element being accessed (for test purposes only).
productCode Input array size and shape information for algorithm operation.
dataResIp Inner product result array (Serial read-out).
dataResOp Outer product result array contents (serial readout).
These I/O signals can be categorized as input control, input address, and output data.

5.2. Design Approach and Target FPGA


The operation of the combined inner and outer product is demonstrated by reference to a case
study design that implements the necessary memory and algorithms functions within a single IP core.
Given that these functions are to be mapped to a custom design architecture and configured within the
FPGA, a range of possible solutions can be created. The starting point for the design is the computation
to perform. Consider the tensor product of two arrays (A and B), where A is a 3 × 3 array and B is
a 3 × 2 array. For demonstration purposes, the numbers are limited to being 8-bit signed integers
rather than real numbers. The principle of evaluation is the same for both number types, but the HDL
coding style to be adopted would be different. Therefore, the possible numbers considered would
be integer values in the range of −12810 to +12710 . Internally within the VHDL code, these values
were modelled as INTEGER data types that were suitable for simulation and synthesis. For synthesis,
the integer numbers were translated to an 8-bit wide STD_LOGIC_VECTOR data type. This meant
that the physical digital circuit utilized an 8-bit data bus and this size bus was selected as a standard
width for all array input addresses and output data. Fixed-point, 20 s complement arithmetic was also
implemented. Whilst the data range was limited in size, this approach was chosen as the purpose of the
work was to implement and demonstrate the algorithm and memory utilization. The VHDL code was
written such that the data range and array sizes were readily adjusted within the array definitions and
no modification to the algorithm code was required. Floating-point arithmetic rather than fixed-point
arithmetic could be used by coding a floating-point multiplier for matrix multiplication operations
(e.g., [28,29]), and modelling the data as floating point numbers rather than simple fixed-point scalar
numbers as used here. Considering arrays A and B, these two arrays can be operated on to form the
tensor product as both the inner product and the outer product:
   
0 1 2 0 1
A= 3 4 5  B= 2 3 
   
6 7 8 4 5

The tensor product for A and B is noted as:

C=A ⊗ B
Electronics 2018, 7, 320 13 of 24

The result of the inner product, Cip , is:


 
10 13
Cip = A ⊗ B =  28 40 
 
46 67

The result of the outer product, Cop , is:


 
0 0 0 1 0 2

 0 0 2 3 4 6  
0 0 4 5 8 10 
 

 

 0 3 0 4 0 5  
Cop =A ⊗ B=
 6 9 8 .
12 10 15 

 12 15 16 20 20 25 

0 6 0 7 0 8 
 

 
 12 18 14 21 16 24 
24 30 28 35 32 40

The above products were initially developed using C and Python coding where the data in C were
stored in arrays and in Python were stored in lists. The combined inner/outer product algorithm was
verified through running the algorithm with different data sets and verifying the software simulation
model results with manual hand calculation results. Once the software version of the design was
verified, the Python code functionality was manually translated to a VHDL equivalent. The two key
design decisions to make were:
1. How to model the arrays for early-stage evaluation work and how to map the arrays to hardware
in the FPGA.
2. How to design the algorithm to meet timing constraints, such as maximum processing time,
number of clock cycles required, hardware size considerations, and the potential clock frequency,
with the hardware once it is configured within the FPGA.
In this design, the data set was small and so VHDL arrays were used for both the early-stage
evaluation work and for synthesis purposes. In VHDL, the input and results arrays were defined and
initialized as follows:
TYPE array_1by4 IS ARRAY (0 TO 3) OF INTEGER;
TYPE array_1by6 IS ARRAY (0 TO 5) OF INTEGER;
TYPE array_1by9 IS ARRAY (0 TO 8) OF INTEGER;
TYPE array_1by36 IS ARRAY (0 TO 35) OF INTEGER;
TYPE array_1by54 IS ARRAY (0 TO 53) OF INTEGER;

CONSTANT arrayA : array_1by9 := (0, 1, 2, 3, 4, 5, 6, 7, 8);


CONSTANT arrayB : array_1by6 := (0, 1, 2, 3, 4, 5);

SIGNAL arrayResultIp : array_1by6 := (0, 0, 0, 0, 0, 0);


SIGNAL arrayResultOp : array_1by54 := (0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0);

These are one-dimensional arrays suited for ease of memory addressing, appropriate for the
algorithm operation, synthesizable into logic, and have a direct equivalence in the C and Python
Electronics 2018, 7, 320 14 of 24

software models. The input arrays (arrayA and arrayB) contain the input data. The results arrays
(arrayResultIp (inner product) and arrayResultOp (outer product)) were initialized with 00 s. It was
not necessary, in this case, to map to any embedded BRAM or external memory as the data set size
was small and easily mapped by the synthesis tool to distributed RAM within the FPGA. The PC
(product code) array is not shown above, but this is an array that contains the shape and size of arrays,
A and B. For the algorithm, with direct mapping to VHDL from the Python code, the inner product
and outer product each required a set number of clock cycles. Figure 5 shows a simplified timing
diagram identifying the signals required to implement the inner/outer product computation. Once the
Electronics 2018, 7, x FOR PEER REVIEW 14 of 24
computation has been completed, the array contents could then be read out one element at a time.
For evaluation purposes, all array values were made accessible concurrently, but could readily be
made available serially via a multiplexor arrangement to reduce the number of output signals
made available serially via a multiplexor arrangement to reduce the number of output signals required
required in the design.
in the design.
A computation run would commence with the run control signal being pulsed 0-1-0 with the
A computation run would commence with the run control signal being pulsed 0-1-0 with the
product selection input ipOp set to either logic 0 (inner product) or logic 1 (outer product). In this
product selection input ipOp set to either logic 0 (inner product) or logic 1 (outer product). In this
implementation, the inner product required 18 clock cycles and the outer product required 54 clock
implementation, the inner product required 18 clock cycles and the outer product required 54 clock
cycles to complete. The array data read-out operations are not, however, shown in Figure 5. The data
cycles to complete. The array data read-out operations are not, however, shown in Figure 5. The data
values were defined using the INTEGER data type for modelling and simulation purposes, and these
values were defined using the INTEGER data type for modelling and simulation purposes, and these
values were mapped to an 8-bit STD_LOGIC_VECTOR data type for synthesis into hardware. The 8-
values were mapped to an 8-bit STD_LOGIC_VECTOR data type for synthesis into hardware. The 8-bit
bit width data bus was sufficient to account for all data values in this study.
width data bus was sufficient to account for all data values in this study.

Logic level

1
resetN
0 Time (ns)

1
run
0 Time (ns)

1
ipOp
0 Time (ns)

18 clock cycles 54 clock cycles


1
clock
0 Time (ns)

Figure 5. Computation control signal timing diagram.


Figure 5. Computation control signal timing diagram.
5.3. System Architecture
5.3. System Architecture
How the memory and algorithm would generally be mapped to a hardware-only or a hardware/
software Howco-design
the memorywouldand algorithm on
be dependent would generally
the design be mappedspecification
requirements, to a hardware-only or a
resulting from
hardware/software
the co-design available
requirements identification, would behardware,
dependent andonthe the designTherefore,
designer. requirements,
a rangespecification
of possible
resulting would
solutions from the requirements
be possible, but inidentification,
this design, aavailable hardware,
hardware-only and was
solution the adesigner. Therefore, a
design requirement.
range
The of possible
memory solutionsaswould
was modelled VHDLbe possible,
arrays, butalgorithm
and the in this design, a hardware-only
was implemented usingsolution
a counterwas
anda
design
state requirement.
machine arrangement.The Both
memory was and
the inner modelled as VHDL
outer products werearrays, and thefor
to be selectable algorithm
computationwas
implemented
that required ausing
designa decision
counter and
as tostate machine
whether arrangement.
a single memory space Both for
theboth
innerproducts
and outer products
or separate
were to be
memory selectable
spaces for computation
for each product would thatberequired
suitable.a design
Given thedecision as to small
relatively whethersizea of
single memory
the data set
space
and to for both products
support or separate
design evaluation, memory
separate spaces for
memory eachfor
spaces product
the innerwould
andbe suitable.
outer Given
products the
were
relatively small
developed. size ofanthe
However, data set and
alternative to support design
implementation couldevaluation, separate
utilize a single memory
memory space.spaces for the
The system
inner and outer
architecture products
is shown were6.developed.
in Figure However,module
Here, the ipOpCore an alternative implementation
implements the memorycould utilize a
computation
0 s complement
(2single memory number
space. The system architecture
multiplication) whilst theiscontrol
shown in module
unit Figure implements
6. Here, thetheipOpCore
system module
control
implements
and algorithm. theThe
memory
controlcomputation
unit module(2′s complement
input number
control signals are: multiplication) whilst the control unit
module implements the system control and algorithm. The control unit module input control signals
are:
• ipOp User to select whether the inner or outer product is to be performed;
• clock Master clock (100 MHz);
Electronics 2018, 7, 320 15 of 24

ipOp User to select whether the inner or outer product is to be performed;


clock Master clock (100 MHz);
resetN Master, asynchronous active low reset; and
run Initiate a computation run (0-1-0 pulse)
Figure 7 shows a simplified view of the elaborated VHDL code schematic that was generated by
the Xilinx ®
Electronics 2018, Vivado v2015.3
7, x FOR PEER (HL
REVIEW WebPACK Edition) software. This schematic shows the two modules 15 of 24
(ipOpCore (I0) and controlUnit
Electronics 2018, 7, x FOR PEER REVIEW (I1)) that connect together to form the top-level design with 44 inputs
15 of 24
Development
and 40 outputs. Board. TheThis board
target FPGAis shown in Figure
was the ®
Xilinx8 that identifies
Artix-7 mounted the key features
on the ®
of theArty
Digilent boardA7-35T
used
and provided aBoard.
Development
Development convenient
Board. This hardware
This board
board is platform
is shown
shown in to undertake
in Figure
Figure 88that the required
thatidentifies
identifies keydesign
the key
the development
features
features of the
of theboard and
boardusedused
experiments.
and provided
and The FPGA
provided aa convenient was provided
convenient hardware with
hardware platform an
platform toon-board 100
to undertake MHz
undertake the clock
the requiredmodule
required design for the
development the
clock
design development and and
and
resetN signal was
experiments.
experiments. The provided
The FPGA
FPGA waswasby one of the
provided
provided with
withavailable
an on-board
an on-board
on-board pushclock
100 MHz
100 MHz buttons.
clock module
module On the
forboard,
for the clock
the theand
clock array
and the
the
address
resetN and
signal data
was signals would
provided by be
one available
of the internally
available within
on-board the
push FPGA (to
buttons.
resetN signal was provided by one of the available on-board push buttons. On the board, the array connect
On the to a
board,system
the that
array
would
address
address beand
integrated
and data within
data signals
signals the FPGA
would
would alongside
be available
be available this design)
internally
internally withinor
within the
the toFPGA
the external
FPGA header
(to connect
(to connect to aapins
to on that
system
system the
that
development
would be
would board (for
be integrated
integrated connecting
within
within the FPGA
the to another
FPGA alongside
alongside system
thisexternal
this design)to
design) orthe
or FPGA).
to the
to the external
external header
header pinspins onon the
the
development board (for connecting to another
development board (for connecting to another system external system external to the FPGA).

Left array (A) Inner product array


Left array (A) Inner product array

Calculation
Calculation
11
Right array (B) Outer product array
11
Right array (B) Outer product array

Product code array


Product code array ipOpCore
ipOpCore

Control unit
Control unit

arrayX address inputs resetN clock ipOp run


arrayX address inputs resetN clock ipOp run
Figure 6. Simplified system block diagram.
Figure 6.
Figure Simplifiedsystem
6. Simplified systemblock
block diagram.
diagram.

ControlUnit ipOpCore
addrA[7:0] ControlUnit ipOpCore
addrAint[7:0]
addrA[7:0]
addrB[7:0]
addrAint[7:0]
addrBint[7:0] dataA[7:0]
addrB[7:0]
addrPC[7:0]
addrBint[7:0]
addrPCint[7:0] dataA[7:0]
dataB[7:0]
addrPC[7:0]
addrResIp[7:0]
I0 addrPCint[7:0]
addrResIpInt[7:0] I1 dataB[7:0] [7:0]
productCode
addrResIp[7:0]
addrResOp[7:0]
I0 addrResIpInt[7:0]
addrResOpInt[7:0] I1 productCode [7:0]
dataResIp[7:0]
addrResOp[7:0]
clock
addrResOpInt[7:0]
calculationClock dataResIp[7:0]
dataResOp[7:0]
clock
ipOp
calculationClock
selectIpOp dataResOp[7:0]
ipOp
run
selectIpOp
resetN
run
resetN
resetN
resetN

Simplifiedschematic
Figure7.7.Simplified
Figure schematicview
viewofofthe
theelaborated
elaboratedVHDL
VHDLcode.
code.

Figure 7. Simplified schematic view of the elaborated VHDL code.


Electronics 2018, 7, 320 16 of 24
Electronics 2018, 7, x FOR PEER REVIEW 16 of 24

USB connector for PmodTM

interfacing to a
computer for +5 V Reset push-button.
power, device
configuration and Artix-7 FPGA.

serial data
100 MHz clock module (on
transfer.
underside of board).

601
Figure 8. Xilinx® Artix-7 FPGA on the Digilent® Arty board identifying key components used in
602 Figure 8. Xilinx® Artix-7 FPGA on the Digilent® Arty board identifying key components
the experimentation.
603 used in the experimentation.
The design must eventually be implemented within the FPGA and this is a two-step process.
604 The the
Firstly, design
VHDL must eventually
code be implemented
is synthesized and then the within the FPGA
synthesized andisthis
design is a two-step
implemented process.
in the target
605 Firstly, the VHDL code is synthesized and then the synthesized design is implemented in
FPGA. The synthesis and implementation operations can be run using the default settings, or the user the target
606 FPGA.
can setThe synthesistoand
constraints implementation
direct operations
the tools. In this canthe
case study, be default
run using the
tool defaultwere
settings settings,
usedorandtheTable
user 3
607 can set constraints
identifies to direct
the hardware the tools.
resources In this after
required case study, the and
synthesis default tool settings were
implementation used
for the and Table
design.
608 3 identifies the hardware resources required after synthesis and implementation for the design.
Table 3. Artix-7 FPGA resource utilization in the case study design.
609 Table 3. Artix-7 FPGA resource utilization in the case study design.
Item Use Number Used
Item Package pin Use
Input 44Number used
Package pin Input
Output 40 44
Outputresults
Design synthesis 40
Post-synthesis I/O Inputs 23 *
Design synthesis results
Outputs 40
Slice LUTs Total used 454
Post-synthesis I/O Inputs
LUT as logic 442
23*
Outputs
LUT as memory (distributed RAM) 12 40
Slice LUTs Slice registers TotalTotal used
used 217 454
Slice register as flip-flop 217
Other logic LUTClockas logic
buffer 2 442
LUT as Design
memory (distributed
implementation resultsRAM) 12
Slice registers
Post-implementation I/O Total used
Inputs 23
217
Slice register as flip-flop
Outputs 40 217
Other logic Slice LUTs Clock buffer
Total used 391 2
LUT as logic 379
LUT as memory (distributed RAM) 12
Slice registers Design implementation
Total used results 217
Slice register as flip-flop 217
Post-implementation I/O
Other logic
Inputs
Clock buffer 2
23
Outputs 40
* Note that the number of inputs required in the design after synthesis do not include the address input bits that
Slice
were LUTs
always a constant logic 0 in this case study. Total used
This was 391
due to the standard 8-bit address bus used for all
LUT
input addresses and the sizes of the arrays meant that assignificant
most logic bits (MSBs) of the array addresses379were not
required. Note also that post-implementation, the number of slice LUTs required was less than that post-synthesis.
LUT as memory (distributed RAM) 12
Slice Simulation
5.4. Design registers Total used 217
Slice register as flip-flop 217
Design simulation was undertaken to ensure that the correct values were stored, 2 calculated,
Other logic Clock buffer
and accessed. The Xilinx® Vivado software tool was used for design entry and simulation was
610
performed using the built-in Vivado simulator. A VHDL test bench was used to perform the
611 * Note that the number of inputs required in the design after synthesis do not include the address
computation and array data read-out operations. Figure 9 shows the complete simulation run where
612 input bits that were always a constant logic 0 in this case study. This was due to the standard 8-bit
the clock frequency in simulation was set to 50 MHz (the master clock frequency divided by two).
613 address bus used for all input addresses and the sizes of the arrays meant that most significant bits
614 (MSBs) of the array addresses were not required. Note also that post-implementation, the number of
615 slice LUTs required was less than that post-synthesis.
Electronics
Electronics 2018,
2018, 7,
7, xx FOR PEER REVIEW
FOR PEER REVIEW 17ofof24
17 24

Electronics 2018, 7, 320 17 of 24


array data
Electronics 2018,read-out operations.
7, x FOR PEER REVIEWFigure 99 shows
shows the the complete
complete simulation
simulation runrun where
where thethe17clock
clock
of 24
frequency in simulation was set to 50 MHz MHz (the(the master
master clock
clock frequency
frequency divided
divided byby two).
two). This
This
array
This data clock
simulation read-out
simulation clock operations.
frequency
frequency waswas Figure
selected to9allow
selected
to shows
to
allow forthe
allow
for for complete
external
external
external control
controlsimulation
control torun
signals
signals
signals to where
toprovided
be
be be the
provided
provided from
from clock
froman
an
frequency
an external
external in simulation
system
system was
operating
operating at at
100set
100 to
MHzMHz50 MHz
to to
be be (the master
provided
provided on on clock
the
the frequency
falling
falling edge
edge of of
thedivided
the
5050
MHz
be provided on the falling edge of the 50 MHz clock. by
MHz two).
clock.
clock. This
simulation clock frequency was selected to allow for external control signals to be provided from an
external system operating at 100 MHz to be provided on the falling edge of the 50 MHz clock.

9. Simulation
Figure9.
Figure Simulation study
study results:
studyresults: Computation
results: Computation and
andresults
Computation and resultsread-out.
results read-out.
read-out.

For
For the
the inner
inner product
Figure 9.data
product data read-out,
read-out,
Simulation studyFigure
Figure 10
10
results:
Figure shows
showsthe
10Computation thesimulation
and results results
simulation resultsfor
read-out. forall
allnine product
nineproduct
product
array
arrayelement
elementvaluesvalues(dataResIp)
(dataResIp)being
beingread
read out of the arrayResultIp array. The iPOp
read out of the arrayResultIp array. The iPOp controlsignal
out of the arrayResultIp array. The iPOp control
controlsignal is
signal
not
is not For
used
usedthe inner
(set(set
to to product
logic
logic 1 in
1 in data
the
the read-out,
simulation
simulation
simulation Figure
test
testbench)
test 10 shows
bench)asas
bench) asitititisisthe simulation
isonly
onlyused
only usedfor
used results
forthe
for for all
thecomputation,
the computation, nine
computation,the product
clock
theclock
clock
array
isis heldelement
held at logicvalues
at logic is(dataResIp)
00asasit it is
alsoalso only
only being
used forread
used the
the out computation,
of the arrayResultIp
for computation,
the
computation, and
andtheand
the array.
the
reset
reset resetThe
signal
signal iPOp
signal
is
isnot
not is control signal
not asserted
asserted
asserted (resetN
(resetN
is notThe
(resetN
= 1). used
= 1). (set
innerTheto logic product
inner
product 1 array
in theaddress
simulation
array address test (addrResIp)
(addrResIp)
(addrResIp)bench) isas
is it isisonly
provided
provided used
provided
to
to for the
eachcomputation,
to access
access
access each each element
element
element in thein
inthe
the clock
the
array
array
is
arrayheld at logic
sequentially.
sequentially. 0 as it is also only used for the computation, and the reset signal is not asserted (resetN
= 1). The inner product array address (addrResIp) is provided to access each element in the array
sequentially.

Figure 10.
Figure10.
Figure Simulation
10.Simulation study
Simulationstudy results:
studyresults: Inner
results:Inner product.
Innerproduct.
product.

Thisshows
This
This showsthe
shows thespecific
the specific results
results
specificFigure 10.
results for
for the
the
Simulation
for complete
thecomplete inner
studyinner
complete inner product
product
results: product as
as follows:
asfollows:
Inner product.follows:
10
10 13
13
𝐶 = 𝐴
𝐴⊗ 𝐵
𝐵== 28 40
 
This shows the specific results 𝐶 for= the ⊗complete 28inner
1040 product
13 as follows:
Cip = A ⊗ B =46
 67
46
 2867
10 13 40 

For the
For the outer
outer product
product data 𝐶 = 𝐴Figure
data read-out,
read-out, ⊗ 𝐵=
Figure 11 28 4640the
11 shows
shows 67simulation
the simulationresults
resultsfor forthe
thelast
last13
13values
values
46 67
(dataResOp) being read out of the arrayResultOp array. The iPOp control signal is not used (setto
(dataResOp) being read out of the arrayResultOp array. The iPOp control signal is not used (set to
logicFor
logic For inthe
11 in theouter
the
theouter product
simulation
product
simulation data
test
testdata read-out,
bench) as
as it
read-out,
bench) Figure
it is only 11
only
isFigure 11shows
used
used for
shows the
for the
the simulation
simulationresults
thecomputation,
computation, the for
theclock
results forthe
clock is last
last13
isheld
the held at values
atlogic
13 values
logic 00
(dataResOp)
as it is
(dataResOp)
as also
it is also onlybeing
only
being read
used
used for
read out
the
for out of
the of the arrayResultOp
computation,
the arrayResultOp
computation, and the array.
reset
and the array. The
signal iPOp
The iPOp
reset signal is control
not asserted
is notcontrol
asserted signal is
(resetN
signal not
=
is not
(resetN used
1). The
used
= 1). (set to
outer
The(set to
outer
logic
product1 in the simulation
array address test bench)
(addrResOp)
bench) asis
asit is
it only
provided
is used
only to for
access
used the
for computation,
each
the element
product array address (addrResOp) is provided to access each element in the array sequentially. 0
logic 1 in the simulation test computation, the
in the clock
the arrayis
clock held at logic
sequentially.
is held at 0
logicas
itasisitalso only
is also used
only forfor
used thethe
computation,
computation, and thethe
and reset
resetsignal
signal is is
not asserted
not asserted(resetN
(resetN==1). 1).The
Theouter
outer
product array address (addrResOp) is provided to access each element
product array address (addrResOp) is provided to access each element in the array sequentially. in the array sequentially.

Figure 11. Simulation study results: Outer product (final set of results read-out only).
Figure 11. Simulation study results: Outer product (final set of results read-out only).

This shows the


Figure 11.specific results
Simulation forresults:
study the last 13 values
Outer in(final
product the results
setof array read-out
ofresults
results as follows:
only).
This shows
Figurethe
11.specific results
Simulation studyfor the last
results: 13 values
Outer productin the set
(final results array as follows:
read-out only).

This shows the specific results for the last 13 values in the results array as follows:
Electronics 2018, 7, 320 18 of 24

This shows the specific results for the last 13 values in the results array as follows:
 
Electronics 2018, 7, x FOR PEER REVIEW
. . . . . . 18 of 24
 . . . . . . 
. . . . . .
 
 . . . . . .

⎡. . . . . . ⎤


 . . .. .. .. .. ⎥ .
⎢.

⎢. .. ⎥ .
 
Cop = A ⊗ B =  . . .. .. .. 
⎢ .. ⎥ .
 
𝐶 = 𝐴 ⊗ 𝐵 = ⎢.
 . . .. .. .. 
 ⎥ 
⎢.
.
 . .. 0. 7. 0. ⎥ 8


⎢. .
 . 0 . 147 210 168 ⎥ 24

⎢. . 14 21 16 24⎥

⎣. . 30 28 35 32 ⎦ 40
30 28 35 32 40

5.5.
5.5.Hardware
Hardware Test
TestSet-Up
Set-Up
Design
Designanalysis
analysis waswas inin the main performed using simulation
performed using simulationto todetermine
determinecorrect
correctfunctionality
functionality
and
andsignal
signaltiming
timing considering
considering the initial design
design description
descriptionprior
priortotosynthesis
synthesis(behavioral
(behavioralsimulation),
simulation),
the
thesynthesized design
synthesized (post-synthesis
design simulation),
(post-synthesis and the implemented
simulation), design (post-implementation
and the implemented design (post-
simulation).
implementationThis is a typical simulation
simulation). This is a approach that is supported
typical simulation approach by that
the FPGA simulator
is supported byfor
theverifying
FPGA
the design for
simulator operation
verifying at different
the design steps in the design
operation process.
at different stepsGiven
in thethat the process.
design design isGiven
intendedthat to be
the
used as is
design a block within
intended a larger
to be used asdigital system,
a block withinthe simulation
a larger digitalresults would
system, give an appropriate
the simulation results wouldlevel
of estimating
give the signal
an appropriate timing
level and the circuit
of estimating power
the signal consumption.
timing and the circuit power consumption.
InInaddition
additiontotothe
thesimulation
simulation study,
study, thethe design
design waswas
alsoalso implemented
implemented within
within the FPGA
the FPGA and
and signal
signal monitored
monitored using theusing the development
development board connectors
board connectors (the Pmod(the TM (peripheral
PmodTM (peripheral module)
module) connectors)
connectors)
using a logicusing a logic
analyzer andanalyzer and oscilloscope.
oscilloscope. This test arrangement
This test arrangement is shown in isFigure
shown 12.in Figure 12.

Top level design FPGA board

PmodTM Logic
Control ipOp
connector analyser
unit core

Artix-7 Pmod
TM

FPGA connector and

TM
Pmod
Built-in Oscilloscope
connector
tester

Master clock and reset +5 V power input from USB

Figure Embedded hardware


12. Embedded
Figure 12. hardwaretester.
tester.

To
Togenerate
generatethe thetop-level
top-leveldesign
designmodule
module input
inputsignals, a built-in
signals, a built-intester circuit
tester waswas
circuit developed
developed and
incorporated into the FPGA. This was a form of a built-in self-test (BIST) [30] circuit that
and incorporated into the FPGA. This was a form of a built-in self-test (BIST) [30] circuit that generated generated the
control signals
the control identified
signals in Figure
identified 5 and
in Figure allowed
5 and allowed thethe
internal
internalarray
array address
address and data
and data signals toto
signals be
accessed. The tester circuit was set-up to continuously repeat the sequence in Figure
be accessed. The tester circuit was set-up to continuously repeat the sequence in Figure 5 rather than 5 rather than
run
runjust
just once
once and so did
and so did not
notrequire
requireany
anyuser
userset
setinput
input control
control signals
signals to to operate.
operate. WithWith
the the number
number of
of address and data bits required (40 address bits and 40 data bits) for the five arrays
address and data bits required (40 address bits and 40 data bits) for the five arrays that exceeded the that exceeded
the number of Pmod TM connections available, these signals were multiplexed to eight address and
number of Pmod TM connections available, these signals were multiplexed to eight address and eight
data bits within the built-in tester and the multiplexor control signals were output for identifying the
array being accessed. The control signals were also accessible on the PmodTM connectors for test
purposes.
Figure 13 shows a simplified schematic view of the elaborated VHDL code, where I0 is the top-
level design module and I1 is the built-in tester module.
Electronics 2018, 7, 320 19 of 24

eight data bits within the built-in tester and the multiplexor control signals were output for identifying
the array being accessed. The control signals were also accessible on the PmodTM connectors for
test purposes.
Figure 13 shows a simplified schematic view of the elaborated VHDL code, where I0 is the
Electronics 2018, 7, x FOR PEER REVIEW 19 of 24
top-level design module and I1 is the built-in tester module.

Built-in tester module


dataResOp[7:0]
dataResIp[7:0]
productCode[7:0]
dataB[7:0]
dataA[7:0]

addressOut[7:0]
Top level design module
dataOut[7:0]

I1 muxOut[2:0]
addrA[7:0]
clockTopOut
addrB[7:0]
ipOpOut
addrPC[7:0]
resetOut
addrResIp[7:0]
I0 runOut
addrResOp[7:0]
ClockTop
ipOp
run
resetN
resetN
Clock
clock

Figure
Figure 13.13. Embedded
Embedded hardware
hardware tester:
tester: Simplified
Simplified schematic
schematic view
view of of
thethe elaborated
elaborated VHDL
VHDL code.
code.

The hardware test arrangement was useful to verify that the signals were generated correctly
The hardware test arrangement was useful to verify that the signals were generated correctly
and matched the logic levels expected during normal design operation. However, it was necessary to
and matched the logic levels expected during normal design operation. However, it was necessary to
reduce the speed of operation to account for non-ideal electrical parasitic effects that caused ringing of
reduce the speed of operation to account for non-ideal electrical parasitic effects that caused ringing
the signal. In this specific set-up, speed of operation of the circuit when monitoring the signals using
of the signal. In this specific set-up, speed of operation of the circuit when monitoring the signals
the logic analyzer and oscilloscope was not deemed important, so the 100 MHz clock was internally
using the logic analyzer and oscilloscope was not deemed important, so the 100 MHz clock was
divided within the built-in tester circuit to 2 MHz in the study. However, further analysis could
internally divided within the built-in tester circuit to 2 MHz in the study. However, further analysis
determine how fast the signals could change if the PmodTM connector was required to connect external
could determine how fast the signals could change if the PmodTM connector was required to connect
memory for larger data sets.
external memory for larger data sets.
Figure 14 shows the logic analyzer test set-up with the Artix-7 FPGA Development Board (bottom
Figure 14 shows the logic analyzer test set-up with the Artix-7 FPGA Development Board
left) and the Digilent® Analog Discovery “USB Oscilloscope and Logic Analyzer” (top right) [31].
(bottom left) and the Digilent® Analog Discovery “USB Oscilloscope and Logic Analyzer” (top right)
The 16 digital inputs for the logic analyzer function were available for use and a GND (ground, 0)
[31]. The 16 digital inputs for the logic analyzer function were available for use and a GND (ground,
connection was required to monitor the test circuit outputs. Internal control signals (ClockTop, ipOp,
0) connection was required to monitor the test circuit outputs. Internal control signals (ClockTop,
run, and resetN) were also available for monitoring in this arrangement.
ipOp, run, and resetN) were also available for monitoring in this arrangement.
® Waveforms software [32] was utilized to control the test hardware and view the
The Digilent
The Digilent Waveforms software [32] was utilized to control the test hardware and view the
®
results. Figure 15 shows the logic analyzer output in Waveforms. Here, one complete cycle of the
results. Figure 15 shows the logic analyzer output in Waveforms. Here, one complete cycle of the test
test process is shown, where the calculations are initially performed and the array outputs then
process is shown, where the calculations are initially performed and the array outputs then read. The
read. The logic level values obtained in this physical prototype test agreed with the results from the
logic level values obtained in this physical prototype test agreed with the results from the simulation
simulation study. The data values are shown as a combined integer number value (data, top) and the
study. The data values are shown as a combined integer number value (data, top) and the values of
values of the individual bits (7 down to 0).
the individual bits (7 down to 0).
Electronics 2018, 7, 320 20 of 24
Electronics 2018, 7, x FOR PEER REVIEW 20 of 24
Electronics 2018, 7, x FOR PEER REVIEW 20 of 24

Figure 14.
14.Logic
Figure14.
Figure Logic analyzer
Logic test
analyzer test
analyzer set-up using
test set-up using the Digilent
Digilent®®
Digilent ® Analog Discovery.
Analog
AnalogDiscovery.
Discovery.

Figure 15.Logic
Logic analyzertest
test results using Digilent®
using the Digilent ® Analog Discovery: Complete cycle.
Figure 15. Logic analyzer
Figure 15. analyzer test results
results using the Analog Discovery:
the Digilent® Analog Discovery: Complete
Complete cycle.
cycle.

The runsignal
The signal (a0-1-0
0-1-0 pulse) initiates
initiates that is selected by the ipOp signal atatthe
The run
run signal (a
(a 0-1-0 pulse)
pulse) initiates the
the computation
computation that
that is
is selected
selected by
by the
the ipOp
ipOp signal
signal at the
the
startofofthe
start thecycle.
cycle.The
The data
data readout
readout on the 8-bit data bus can be seen towards the end of the cycle as
start of the cycle. The data readout on the 8-bit data bus can be seen towards the end of the
can be seen towards the end of the cycle
cycle as
as
bothaabus
both busvalue
valueand
andindividual
individualbit bit values.
values.
both a bus value and individual bit values.
Figure 16 shows
Figure shows the datadata readout operation
operation towards the end of the cycle. The data output
Figure 1616 shows thethe data readout
readout operation towards
towards the end of the cycle. The The data
data output
output
identifiesthe
identifies thevalues
valuesfor
forarray
array A A (nine
(nine values), array B (six
(six values),
values), and
and the
theinner
inner product
product result
resultarray
array
identifies the values for array A (nine values), array B (six values), and the inner product result array
(six
(six values) as identified in Section 5.2.
(six values)
values) as
as identified
identified in
in Section
Section 5.2.
5.2.
Electronics 2018, 7, 320 21 of 24
Electronics 2018, 7, x FOR PEER REVIEW 21 of 24

Figure16.
Figure 16.Logic
Logicanalyzer
analyzer test
test results using the Digilent®® Analog
results using AnalogDiscovery:
Discovery:Data
Datareadout.
readout.

5.6.
5.6.Design
DesignImplementation
Implementation Considerations
Considerations
This
This case
case study design
designhas
haspresented
presentedone one example
example implementation
implementation of combined
of the the combined
inner inner
and
and outer product algorithm. The study focused on creating a custom hardware
outer product algorithm. The study focused on creating a custom hardware only design only design
implementation
implementation rather
rather than
than developing algorithm in
developing the algorithm in software
software to
to run
run on
on aa suitable
suitableprocessor
processor
architecture.
architecture.The
Theapproach
approach taken
taken to
to create the hardware design was
was to
to map
mapthethealgorithm
algorithmoperations
operations
ininsoftware
softwareto
toaahardware
hardwareequivalence.
equivalence. The
The hardware
hardware design was created using two two main
main modules:
modules:

1.1. The computation


The computation module.
2.2. The control
The control module.
module. The
The control
control module
module was
was required
required toto receive
receivecontrol
controlsignals
signalsfrom
froman
anexternal
external
system and transform these to internal control signals for the computation module.
system and transform these to internal control signals for the computation module.
Thecomputation
The computation module
module itself
itself was
was modelled
modelled as as two
two separate
separate sub-modules
sub-modulesasasthis
thiswas
wasbased
basedon
on
the underlying structure of the problem that was to efficiently access data from memory for runninga
the underlying structure of the problem that was to efficiently access data from memory for running
a computation
computation onon data
data held
held in in specific
specific memory
memory locations:
locations:
1. The memory module.
1. The memory module.
2. The algorithm module.
2. The algorithm module.
For a specific application, the memory module would be used for storing input data, intermediate
For a specific
results data, application,
and final (output)the memory
data. module
For this would
design, the be used for
physical storingused
memory inputwasdata, intermediate
internal to the
results data, and final (output) data. For this design, the physical memory
FPGA using distributed memory within the LUTs given the size of the data set, the availabilityused was internal to the
of
FPGA
hardware resources within the FPGA, and the synthesis tool that automatically determined whatof
using distributed memory within the LUTs given the size of the data set, the availability
hardware
hardwareresources within
resources were to the FPGA,
be used. Theand the synthesis
memory tool that
was modelled automatically
using VHDL arraysdetermined what
where the input
hardware
data arrays held constant values and the intermediate and output data arrays held variables. Inthe
resources were to be used. The memory was modelled using VHDL arrays where a
input datascenario,
different arrays held theconstant
memoryvalues and the
modelling intermediate
in VHDL might and output data
be different. Forarrays heldexplicitly
example, variables.
In a different
targeting scenario,
internal BRAM thecells
memory modelling
and external in VHDL
memory mighttobethe
attached different. For example,
FPGA pins. explicitly
Such an approach
targeting internal BRAM cells and external memory attached to the FPGA
would resemble a standard processor architecture with different levels of memory. The internalpins. Such an approach
would resemble
latches, a standard
flip-flops, processor
distributed RAM, andarchitecture
BRAM cellswithwithin
different
thelevels
FPGAofwould
memory. mapThe internal
to cache latches,
memory
flip-flops,
internal todistributed
the processor RAM,
andand BRAM
external cells within
memory the FPGA
to attached would
memory map to
devices as cache
depictedmemory internal
in Figure 3.
to theThe
processor and external memory to attached memory devices as depicted
designer would have design choices when considering the algorithm module. One approach in Figure 3.
Thebedesigner
would to use a would
standardhave design choices
processor when
architecture considering
that would be the algorithm
software module. One
programmed and approach
mapped
would be to useresources
to hardware a standard processor
within architecture
the FPGA. that would
Depending on thebe software programmed
device, the processor and
maymapped
be an
toembedded
hardwarecore resources
(a so-called hard core) or may be an IP block that can be instantiated in a designbe
within the FPGA. Depending on the device, the processor may andan
embedded
synthesizedcore into(athe availablehard
so-called FPGA core) or may
logic be an soft
(a so-called IP block thatexample,
core). For can be instantiated in a design
in Xilinx FPGAs,
® then
the MicroBlaze 32-bit RISC (reduced instruction set computer) CPU can be instantiated into a custom
Electronics 2018, 7, 320 22 of 24

and synthesized into the available FPGA logic (a so-called soft core). For example, in Xilinx® FPGAs,
then the MicroBlaze 32-bit RISC (reduced instruction set computer) CPU can be instantiated into
a custom design. It is also possible to have, if the hardware resources are sufficient, instantiated
multiple soft cores within the FPGA. This would allow for a multi-processor solution and on-chip
processor-to-processor communications with parallel processing. A second approach would be
to develop a custom architecture solution that maps the algorithm and memory modules to the
user requirements, giving a choice to implement sequential or parallel (concurrent) operations.
This provides a high level of flexibility for the designer, but requires a different design approach,
thinking in terms of hardware rather than software operations. A third approach would be to create
a hardware-software co-design incorporating custom architecture hardware and standard processor
architectures working concurrently.
A final consideration in implementation would to be identify example processor architectures and
target hardware used in machine and deep learning applications, where their benefits and limitations
for specific applications could be assessed. For example, in software processor applications, then the
CPU is used for tensor computations where a GPU (graphics processing unit) is not available. GPUs
have architectures and software programming capabilities that are better than a CPU for applications,
such as gaming, where high-speed data processing and parallel computing operations are required.
An example GPU is the Nvidia® Tensor Core [33].

6. Conclusions
In this paper, the design and simulation of a hardware block to implement a combined inner and
outer product was introduced and elaborated. This work was considered in the context of developing
embedded digital signal processing algorithms that can effectively and efficiently process complex
data sets. The FPGA was used as the target hardware and the product algorithm developed as VHDL
modules. The algorithm was initially evaluated using C and Python code before translation to a
hardware description in VHDL. The paper commenced with a discussion into tensors and the need
for effective and efficient memory access to control memory access times and the cost associated with
such memory access operations. To develop the ideas, the FPGA hardware implementation developed
was an example design that paralleled an initial software algorithm (C and Python coding) used for
algorithm development. The design was evaluated in simulation and hardware implementation issues
were discussed.

Author Contributions: The authors contributed equally to the work undertaken as a co-design effort. Co-design
is not just giving tasks to scholars in separate yet complementary disciplines. It is an ongoing attempt to
communicate ideas across disciplines and a desire to achieve a common goal. This type of research not only
helps both disciplines but it gives definitive proof of how that collaboration achieves greater results, and mentors
students and other colleagues to do the same. The case study design was a hardware implementation by Ian
Grout using the FPGA of the algorithm developed by Lenore Mullin. Both authors contributed equally to the
background discussions and the writing of the paper.
Funding: This research received no external funding.
Conflicts of Interest: The authors declare no conflict of interest.

References
1. Earley, S. Analytics, Machine Learning, and the Internet of Things. IT Prof. 2015, 17, 10–13. [CrossRef]
2. Corneanu, C.A.; Simón, M.O.; Cohn, J.F.; Guerrero, S.E. Survey on RGB, 3D, Thermal, and Multimodal
Approaches for Facial Expression Recognition: History, Trends, and Affect-Related Applications. IEEE Trans.
Pattern Anal. Mach. Intell. 2016, 38, 1548–1568. [CrossRef] [PubMed]
3. Xilinx® . FPGA Leadership across Multiple Process Nodes. Available online: https://www.xilinx.com/
products/silicon-devices/fpga.html (accessed on 9 October 2018).
4. Xilinx® . Homepage. Available online: https://www.xilinx.com/ (accessed on 9 October 2018).
5. Xilinx® . Artix-7 FPGA Family. Available online: https://www.xilinx.com/products/silicon-devices/fpga/
artix-7.html (accessed on 9 October 2018).
Electronics 2018, 7, 320 23 of 24

6. Daniel Fleisch. A Student’s Guide to Vectors and Tensors; Cambridge University Press: Cambridge, UK, 2011;
ISBN-10 0521171903, ISBN-13 978-0521171908.
7. Institute of Electrical and Electronics Engineers. IEEE Std 1076-2008—IEEE Standard VHDL Language Reference
Manual; IEEE: New York, NY, USA, 2009; ISBN 978-0-7381-6853-1, ISBN 978-0-7381-6854-8.
8. Kindratenko, V.; Trancoso, P. Trends in High Performance Computing. Comput. Sci. Eng. 2011, 13, 92–95.
[CrossRef]
9. Lane, N.D.; Bhattacharya, S.; Mathur, A.; Georgiev, P.; Forlivesi, C.; Kawsar, F. Squeezing Deep Learning into
Mobile and Embedded Devices. IEEE Pervasive Comput. 2017, 16, 82–88. [CrossRef]
10. Mullin, L.; Raynolds, J. Scalable, Portable, Verifiable Kronecker Products on Multi-scale Computers.
In Constraint Programming and Decision Making. Studies in Computational Intelligence; Ceberio, M.,
Kreinovich, V., Eds.; Springer: Cham, Switzerland, 2014; Volume 539.
11. Gustafson, J.; Mullin, L. Tensors Come of Age: Why the AI Revolution Will Help HPC. 2017. Available
online: https://www.hpcwire.com/2017/11/13/tensors-come-age-ai-revolution-will-help-hpc/ (accessed
on 9 October 2018).
12. Workshop Report: Future Directions in Tensor Based Computation and Modeling. 2009. Available
online: https://www.researchgate.net/publication/270566449_Workshop_Report_Future_Directions_in_
Tensor-Based_Computation_and_Modeling (accessed on 9 October 2018).
13. Tensor Computing for Internet of Things (IoT). 2016. Available online: http://drops.dagstuhl.de/opus/
volltexte/2016/6691/ (accessed on 9 October 2018).
14. Python.org, Python. Available online: https://www.python.org/ (accessed on 9 October 2018).
15. TensorflowTM . Available online: https://www.tensorflow.org/ (accessed on 9 October 2018).
16. Pytorch. Available online: https://pytorch.org/ (accessed on 9 October 2018).
17. Keras. Available online: https://en.wikipedia.org/wiki/Keras (accessed on 9 October 2018).
18. Apache MxNet. Available online: https://mxnet.apache.org/ (accessed on 9 October 2018).
19. Microsoft Cognitive Toolkit (MTK). Available online: https://www.microsoft.com/en-us/cognitive-toolkit/
(accessed on 9 October 2018).
20. CAFFE: Deep Learning Framework. Available online: http://caffe.berkeleyvision.org/ (accessed on 9
October 2018).
21. DeepLearning4J. Available online: https://deeplearning4j.org/ (accessed on 9 October 2018).
22. Chainer. Available online: https://chainer.org/ (accessed on 9 October 2018).
23. Google, Cloud TPU. Available online: https://cloud.google.com/tpu/ (accessed on 9 October 2018).
24. Lenore, M.; Mullin, R. A Mathematics of Arrays. Ph.D. Thesis, Syracuse University, Syracuse, NY, USA, 1988.
25. Mullin, L.; Raynolds, J. Conformal Computing: Algebraically connecting the hardware/software boundary
using a uniform approach to high-performance computation for software and hardware. arXiv 2018.
Available online: https://arxiv.org/pdf/0803.2386.pdf (accessed on 1 November 2018).
26. Institute of Electrical and Electronics Engineers. IEEE Std 1364™-2005 (Revision of IEEE Std 1364-2001), IEEE
Standard for Verilog® Hardware Description Language; IEEE: New York, NY, USA, 2006; ISBN 0-7381-4850-4,
ISBN 0-7381-4851-2.
27. Ong, Y.S.; Grout, I.; Lewis, E.; Mohammed, W. Plastic optical fibre sensor system design using the field
programmable gate array. In Selected Topics on Optical Fiber Technologies and Applications; IntechOpen: Rijeka,
Croatia, 2018; pp. 125–151, ISBN 978-953-51-3813-6.
28. Dou, Y.; Vassiliadis, S.; Kuzmanov, G.K.; Gaydadjiev, G.N. 64 bit Floating-point FPGA Matrix Multiplication.
In Proceedings of the 2005 ACM/SIGDA 13th International Symposium on Field-Programmable Gate Arrays,
Monterey, CA, USA, 20–22 February 2005. [CrossRef]
29. Amira, A.; Bouridane, A.; Milligan, P. Accelerating Matrix Product on Reconfigurable Hardware for
Signal Processing. In Field-Programmable Logic and Applications, Proceedings of the 11th International
Conference, FPL 2001, Belfast, UK, 27–29 August 2001; Lecture Notes in Computer Science; Brebner, G.,
Woods, R., Eds.; Springer: Berlin/Heidelberg, Germany, 2001; pp. 101–111, ISBN 978-3-540-42499-4 (print),
ISBN 978-3-540-44687-3 (online).
30. Hurst, S.L. VLSI Testing—Digital and Mixed Analogue/Digital Techniques; The Institution of Engineering and
Technology: London, UK, 1998; pp. 241–242, ISBN 0-85296-901-5.
31. Digilent® . Analog Discovery. Available online: https://reference.digilentinc.com/reference/instrumentation/
analog-discovery/start?redirect=1 (accessed on 1 November 2018).
Electronics 2018, 7, 320 24 of 24

32. Digilent® . Waveforms. Available online: https://reference.digilentinc.com/reference/software/waveforms/


waveforms-3/start (accessed on 1 November 2018).
33. Nvidia. Nvidia Tensor Cores. Available online: https://www.nvidia.com/en-us/data-center/tensorcore/
(accessed on 1 November 2018).

© 2018 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access
article distributed under the terms and conditions of the Creative Commons Attribution
(CC BY) license (http://creativecommons.org/licenses/by/4.0/).

You might also like