0% found this document useful (0 votes)

32 views12 pages

AnyHLS - High-Level Synthesis With Partial

Uploaded by

yo bro

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

32 views12 pages

AnyHLS - High-Level Synthesis With Partial

Uploaded by

yo bro

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 12

c 2020 IEEE. Personal use of this material is permitted.

Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this
material for advertising or promotional purposes,creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works.

AnyHLS: High-Level Synthesis with Partial

Evaluation
M. Akif Özkan‡ , Arsène Pérard-Gayot† , Richard Membarth†∗ , Philipp Slusallek†∗ , Roland Leißa∗ ,
Sebastian Hack∗ , Jürgen Teich‡ , and Frank Hannig‡
‡ Friedrich-Alexander University Erlangen-Nürnberg (FAU), Germany
∗ Saarland University (UdS), Germany † German Research Center for Artificial Intelligence (DFKI), Germany

Abstract—FPGAs excel in low power and high throughput different performance or area objectives. In recent languages
arXiv:2002.05796v2 [cs.PL] 21 Jul 2020

computations, but they are challenging to program. Traditionally, such as Chisel [1], VeriScala [2], and MyHDL [3], programmers
developers rely on hardware description languages like Verilog or can create a functional description of their design but stick to
VHDL to specify the hardware behavior at the register-transfer
level. High-Level Synthesis (HLS) raises the level of abstraction, the RTL.
but still requires FPGA design knowledge. Programmers usually High-Level Synthesis (HLS) increases the abstraction level
write pragma-annotated C/C++ programs to define the hardware to an untimed high-level specification similar to imperative
architecture of an application. However, each hardware vendor programming languages and automatically solves low-level
extends its own C dialect using its own vendor-specific set design issues such as clock-level timing, register allocation,
of pragmas. This prevents portability across different vendors.
Furthermore, pragmas are not first-class citizens in the language. and structural pipelining [4]. However, an HLS code that is
This makes it hard to use them in a modular way or design proper optimized for the synthesis of high-performance circuits is
abstractions. fundamentally different from a software program delivering
In this paper, we present AnyHLS, an approach to synthesize high performance on a CPU. This is due to the significant gap
FPGA designs in a modular and abstract way. AnyHLS is between the programming paradigms. An HLS compiler has to
able to raise the abstraction level of existing HLS tools by
resorting to programming language features such as types and
optimize the memory hierarchy of a hardware implementation
higher-order functions as follows: It relies on partial evaluation and parallelize its data paths [5].
to specialize and to optimize the user application based on In order to achieve good Quality of Results (QoR), HLS
a library of abstractions. Then, vendor-specific HLS code is languages demand programmers also to specify the hardware
generated for Intel and Xilinx FPGAs. Portability is obtained architecture of an application instead of just its algorithm. For
by avoiding any vendor-specific pragmas at the source code. In
order to validate achievable gains in productivity, a library for
this reason, HLS languages offer hardware-specific pragmas.
the domain of image processing is introduced as a case study, This ad-hoc mix of software and hardware features makes
and its synthesis results are compared with several state-of-the- it difficult for programmers to optimize an application. In
art Domain-Specific Language (DSL) approaches for this domain. addition, most HLS tools rely on their own C dialect, which
prevents code portability. For example, Xilinx Vivado HLS [6]
uses C++ as base language while Intel SDK [7] (formerly
Altera) uses OpenCL C. These severe restrictions make it hard
I. I NTRODUCTION
to use existing HLS languages in a portable and modular way.
Field Programmable Gate Arrays (FPGAs) consist of a In this paper, we advocate describing FPGA designs using
network of reconfigurable digital logic cells that can be functional abstractions and partial evaluation to generate
configured to implement any combinatorial logic or sequential optimized HLS code. Consider Figure 1 for an example from
circuits. This allows the design of custom application-tailored image processing: With a functional language, we separate
hardware. In particular memory-intensive applications benefit the description of the sobel_x operator from its realization
from FPGA implementations by exploiting fast on-chip memory in hardware. The hardware realization make_local_op is
for high throughput. These features make FPGA implementa- a function that specifies the data path, the parallelization,
tions orders of magnitude faster/more energy-efficient than CPU and memory architecture. Thus, the algorithm and hardware
implementations in these areas. However, FPGA programming architecture descriptions are described by a set of higher-
poses challenges to programmers unacquainted with hardware order functions. A partial evaluator, ultimately, combines
design. these functions to generate an HLS code that delivers high-
FPGAs are traditionally programmed at Register-Transfer performance circuit designs when compiled with HLS tools.
Level (RTL). This requires to model digital signals, their timing, Since the initial descriptions are high-level, compact, and
flow between registers, as well as the operations performed functional, they are reusable and distributable as a library.
on them. Hardware Description Languages (HDLs) such as We leverage the AnyDSL compiler framework [8] to perform
Verilog or VHDL allow for the explicit description of arbitrary partial evaluation and extend it to generate input code for
circuits but require significant coding effort and verification HLS tools targeting Intel and Xilinx FPGA devices. We claim
time. This makes design iterations time-consuming and error- that this approach leads to a modular and portable code other
prone, even for experts: The code needs to be rewritten for than existing HLS approaches, and is able to produce highly
Mem2D Mem2D Mem2D
(1, h, v) (w + v − 1, h, 1)(w + v − 1, h, 1)
Mem1D Mem1D
(W × H, v) line buffer op1 (W × H, v)
row col ...
sel sel
line buffer opv

line buffers sliding window

c Blender Foundation (CC BY 3.0) local operator
let sobel_x = @|img, x, y| let input = make_img_mem1d("sandiego.jpg");
-1 * img.read(x-1, y-1) + 1 * img.read(x+1, y-1) + let output = make_img_mem1d("output.jpg");
-2 * img.read(x-1, y ) + 2 * img.read(x+1, y ) + let operator = make_local_op(sobel_x);
-1 * img.read(x-1, y+1) + 2 * img.read(x+1, y+1); with generate(vhls) { operator(input, output) }

Figure 1. AnyHLS example: The algorithm description sobel_x is decoupled from its realization in hardware make_local_op. The hardware realization
is a function that specifies important transformations for the exploitation of parallelism and memory architecture. The function generate(vhls) selects the
backend for code generation, which is Vivado HLS in this case. Ultimately, an optimized input code for HLS is generated by partially evaluating the algorithm
and realization functions.

efficient hardware implementations. There is an ongoing discussion whether C-based languages

In summary, this paper makes the following contributions: are good candidates for HLS [4], [12]–[15]. Yet, most com-
1
• We present AnyHLS , raising the abstraction level in HLS monly used HLS compilers (e.g., Vivado HLS, AOCL, Catapult,
by using partial evaluation of higher-order functions as a LegUp) are based on C-based languages [4], [6], [7], [10]. The
core compiler technology. It guarantees the well-typedness modularity and readability of C/C++ or OpenCL descriptions
of the residual program and offers considerably higher often conflict with best coding practices of HLS compilers [16],
productivity than existing DSL design techniques and [17]. In the hardware design context, QoR design refers to
C/C++-based approaches (see Section II). the ratio between the performance of the circuit (latency,
• AnyHLS offers unprecedented target independence, and throughput) and design cost (circuit area, energy consumption).
thus portability, across different HLS tools by avoiding A C-based HLS code optimized for satisfactory QoR is entirely
tool-specific pragma extensions and generating target- different from a typical software program [16]–[20]. Thereby,
specific OpenCL or C code as input to existing HLS
++ the developer should express the FPGA implementation of an
tools (see Section III). application using the language abstractions of software (i.e.,
• Productivity, modularity, and portability gains are demon- arrays, loops to specify the memory hierarchy and hardware
strated by presenting an image processing library as a pipelining). Language extensions like pragmas fill the gap
case study in Section IV. For this domain, we show for the lacking FPGA-centric features. However, pragmas are
that a competitive performance in terms of throughput specific to HLS tools, and they cannot be used in a modular way
and resource usage can be achieved in comparison with because the preprocessor already resolves them (e.g., pragmas
existing state-of-the-art DSLs (see Section V). cannot be passed as function parameters). This ad-hoc mix of
software and hardware abstractions of programming languages
II. OVERVIEW, BACKGROUND , AND R ELATED W ORK in HLS makes optimizations hard [15], [17], [19]. Furthermore,
In the following, we briefly discuss prior work (Sections II-A the lack of standardization in HLS languages and compilers
to II-B) and fundamental concepts of AnyDSL (Section II-C). hinders the portability of code across them. Often, the code
optimized for one HLS tool must significantly be changed to
target another HLS tool even when the same FPGA design is
A. QoR and Portability of Code in C-based HLS
described. For these reasons, we believe that the next step for
HLS increases the abstraction level to an untimed high-level HLS requires an increased level of abstraction on the language
specification such as C/C++ or OpenCL from a fully-timed side, which can reduce the need for expert knowledge.
RTL. This eases the hardware design problem by eliminating
low-level issues such as clock-level timing, register allocation, B. Raising the Abstraction Level in HLS
and gate-level pipelining [4], [9], [10]. Modern HLS tools are
able to generate high-quality results for DSP and datapath- Recent work suggests raising the abstraction level in HLS
oriented applications. Several authors (e.g., [4], [11], [12]) by designing libraries, DSLs or source-to-source compilers
have argued the following points as key to this success: to hide low-level implementation details. This improves the
(i) advancements in RTL design tools, (ii) device-specific code modularity and reduces code duplication, but is hard to develop
generation, (iii) domain-specific focus on the target applications, and maintain when well-typedness of programs are preserved.
and (iv) generating both software and hardware from the [16]–[19] make extensive use of C++ template metaprogramming
same code. Modern HLS tools such as Intel FPGA SDK for to provide libraries that are optimized for Vivado-HLS. Generic
OpenCL (AOCL) and Xilinx SDX offer system synthesis to programs can be optimized for compile-time known values
map program parts to either software or hardware. This enables using metaprogramming techniques, but it has the following
software-like development for library design and verification. drawbacks: (i) The well-typedness of the generated program
cannot be guaranteed in metaprogramming. This makes it
1 https://github.com/AnyDSL/anyhls difficult to understand error messages. (ii) Metaprograms are
hard to develop, maintain, and understand since the meta C. AnyDSL Compiler Framework
language is different from the core language (C++ core vs.
AnyDSL2 [8], [34] is a compiler framework for designing
C++ template language). For this reason, code cannot be easily
high-performance, domain-specific libraries. It provides the
moved between the core and the meta language. (iii) Lambda
imperative and functional language Impala. Impala’s syntax is
expressions are not allowed to be used as template arguments
inspired by Rust. We will now briefly discuss Impala’s most
in C++. We refer to [8] for more details. In particular, [16], [18]
important features that we rely on in AnyHLS.
explain the challenges of implementing higher-order algorithms
1) Partial Evaluation: Partial evaluation is a technique for
in C++ for Vivado-HLS. OpenCL C does not support template
program optimization by specialization of compile-time known
metaprogramming, thus forces users to use preprocessor macros
values. Assume that each input of a program F is classified
for generic library design. Therefore, libraries developed by
as either static s or dynamic d, and values for all of the static
using C++ template metaprogramming have to be rewritten
inputs are given. Then, partial evaluation produces an optimized
completely for OpenCL C, that is, for AOCL.
(residual) program Fs such that
DSLs use domain-specific knowledge to parallelize algo-
rithms and generate low-level, optimized code [21]. Program- [[F s]](d) = [[F ]](s, d) (1)
ming accelerators using DSLs is thus easier, in particular
for FPGAs, because the compiler performs scheduling. A and running Fs on the dynamic inputs produces the same
prominent example of that is the FPGA version of Spiral [22]. result as running the original program F on all of the
It generates HDL for digital signal processing applications. inputs [36]. Compiler techniques such as constant propagation,
In the domain of image processing, recent projects include loop unrolling, or inlining are examples to partial evaluation.
Darkroom [23], Rigel [24], and the work of Pu et al. [25] Typically, the user has no control when these optimizations
based on Halide [26]. Hipacc [27], PolyMage [28], SODA [29], are applied from a compiler.
and RIPL [30] create image processing pipelines from a Impala allows programmers to partially evaluate [37] their
DSL. Rigel/Halide, PolyMage, and RIPL are declarative DSLs, program at compile time. Programmers control the partial
whereas Hipacc is embedded into C++. All of these compilers, evaluator via filters [38]. These are Boolean expressions of the
except Rigel, generate HLS code in order to simplify their form @(expr) that annotate function signatures. Each call site
backends. Other examples include L IFT that targets FPGAs via instantiates the callee’s filter with the corresponding argument
algorithmic patterns [31] and Tiramisu [32] for data-parallel list. The call is specialized when the expression evaluates to
algorithms on dense arrays. Tiramisu takes as input a set of true. The expression ?expr yields true, if expr is known
scheduling commands from the user and feeds it to the polyhe- at compile-time; the expression $expr is never considered
dral analysis of the compiler. However, a considerable portion constant by the evaluator. For example, the following @(?n)
of these scheduling primitives remains platform-specific [33]. filter will only specialize calls to pow if n is statically known
Spatial [15] is a language for programming Coarse-Grained at compile-time:
Reconfigurable Architectures (CGRAs) and FPGAs. Spatial fn @(?n) pow(x: int, n: int) -> int {
provides language constructs to express control, memory, and if n == 0 {
interfaces of hardware implementation. 1
} else {
In this paper, it is shown that the described need to raise the if n %
abstraction level in HLS may be accomplished by using recent let y = pow(x, n / 2);
y * y
compiler technology, in particular by exploring the concepts } else {
of partial evaluation and high-order-functions. Unlike the }
x * pow(x, n - 1)

aforementioned DSL compilers, AnyHLS allows programmers }

to build the basic blocks and abstractions necessary for their }

application domain by themselves (see Section III). AnyHLS is Thus, the calls
thereby built on top of AnyDSL [8] (see Section II-C). AnyDSL let z = pow(x, 5); let z = pow(3, 5);
offers partial evaluation to enable shallow embedding [34] will result in the following equivalent sequences of instructions
without the need for modifying a compiler. This means after specialization:
that there is no need to change the compiler when adding let y = x * x; let z = 243;
support for a new application domain, since programmers can let z = x * y * y;

design custom control structures. Partial evaluation specializes As syntactic sugar, @ is available as shorthand for @(true).
algorithmic variants of a program at compile-time. Compared This causes the partial evaluator to always specialize the
to metaprogramming, partial evaluation operates in a single annotated function.
language and preserves the well-typedness of programs [8]. Fur- FPGA implementations must be statically defined for QoR:
thermore, different combinations of static/dynamic parameters types, loops, functions, and interfaces must be resolved at
can be instantiated from the same code. Previously, we have compile-time [16], [18], [19]. Partial evaluation has many
shown how to abstract image border handling implementations advantages compared to metaprogramming as discussed in
for Intel FPGAs using AnyDSL [35]. In this paper, we present Section II-B. Hence, Impala’s partial evaluation is particularly
AnyHLS and an image processing library to synthesize FPGA useful to optimize HLS descriptions.
designs in a modular and abstract way for both Intel and Xilinx
FPGAs. 2 https://anydsl.github.io
2) Generators: Because iteration on various domains is a halide-app.cpp hipacc-app.cpp anyhsl-app.impala
common pattern, Impala provides syntactic sugar for invoking
Halide compiler Hipacc compiler
certain higher-order functions. The loop Vivado Vivado AOCL
AnyDSL compiler
+
Image
Processing
backend backend backend (partial evaluator) Lib.impala
for var1, ..., varn in iter(arg1, ..., argn) { /* ... */ }

translates to VHLS-code.cpp VHLS-code.cpp AOCL-code.cl VHLS-code.cpp AOCL-code.cl

iter(arg1, ..., argn, |var1, ..., varn| { /* ... */ });
template template template
The body of the for loop and the iteration variables constitute library library library

an anonymous function
|var1, ..., varn| { /* ... */ } VHLS VHLS AOCL VHLS AOCL

that is passed to iter as the last argument. We call functions XILINX XILINX INTEL XILINX INTEL
FPGA FPGA FPGA FPGA FPGA
that are invokable like this generators. Domain-specific libraries
implemented in Impala make busy use of these features as
Figure 2. FPGA code generation flows for Halide, Hipacc, and AnyHLS (from
they allow programmers to write custom generators that take left to right). VHLS and AOCL are used as acronyms for Vivado HLS and
advantage of both domain knowledge and certain hardware Intel FPGA SDK for OpenCL, respectively. Halide and Hipacc rely on domain-
features, as we will see in the next section. specific compilers for image processing that instantiate template libraries.
AnyHLS allows defining all abstractions for a domain in a language called
Generators are particularly powerful in combination with Impala and relies on partial evaluation for code specialization. This ensures
partial evaluation. Consider the following functions: maintainability and extensibility of the provided domain-specific library—for
image processing in this example.
type Body = fn(int) -> ();
fn @(?a & ?b) unroll(a: int, b: int, body: Body) -> () {
if a < b { body(a); unroll(a+1, b, body) }
} with vhls() { body() } with opencl() { body() }
fn @ range(a: int, b: int, body: Body) -> () {
unroll($a, b, body) With opencl we use a grid and block size of (1, 1, 1)
}
to generate a single work-item kernel, as the official AOCL
Both generators iterate from a (inclusive) to b (exclusive) documentation recommends [7]. We extended AnyDSL’s
while invoking body each time. The filter unroll tells the OpenCL runtime by the extensions of Intel OpenCL SDK.
partial evaluator to completely unroll the recursion if both loop To provide an abstraction over both HLS backends, we create
bounds are statically known at a particular call site. a wrapper generate that expects a code generation function:
type Backend = fn(fn() -> ()) -> ();
III. T HE A NY HLS L IBRARY fn @ generate(be: Backend, body: fn() -> ()) -> () {
with be() { body() }
Efficient and resource-friendly FPGA designs require }

application-specific optimizations. These optimizations and Switching backends is now just a matter of passing an
transformations are well known in the community. For example, appropriate function to generate:
de Fine Licht et al. [20] discuss the key transformations of HLS
let backend = vhls; // or opencl
codes such as loop unrolling and pipelining. They describe with generate(backend) { body() }
the whole hardware design from the low-level memory layout
to the operator implementations with support for low-level
loop transformations throughout the design. In our setting, B. Building Abstractions for FPGA Designs
the programmer defines and provides these abstractions using
In the following, we present abstractions for the key
AnyDSL for a given domain in the form of a library. We
transformations and design patterns that are common in FPGA
rely on partial evaluation to combine those abstractions and to
design. These include (a) important loop transformations, (b)
remove overhead associated with them. Ultimately, the AnyDSL
control flow and data flow descriptions such as reductions,
compiler synthesizes optimized HLS code (C++ or OpenCL C)
Finite State Machines (FSMs) and (d) the explicit utilization of
from a given functional description of an algorithm as shown
different memory types. Approaches like Spatial [15] expose
in Figure 2. The generated code goes to the selected HLS tool.
these patterns within the language—new patterns require
This is in contrast to other domain-specific approaches like
dedicated support from the compiler. Hence, these languages
Halide-HLS [25] or Hipacc [27], which rely on domain-specific
and compilers are restricted to a specialized application
compilers to instantiate predefined templates or macros. Hipacc
domain they have been designed for. In AnyHLS, Impala’s
makes use of two distinct libraries to synthesize algorithmic
functional language and partial evaluation allow us to design
abstractions to Vivado-HLS and Intel AOCL, while AnyHLS
the abstractions needed for FPGA synthesis in the form of
uses the same image processing library that is described in
a library. New patterns can be added to the library without
Impala.
dedicated support from the compiler. This makes AnyHLS
easier to extend compared to the approaches mentioned afore.
A. HLS Code Generation 1) Loop Transformations: C++ compilers usually provide
For HLS code generation, we implemented an intrinsic certain preprocessor directives that perform particular code
named vhls in AnyHLS to emit Vivado HLS and an intrinsic transformations. A common feature is to unroll loops (see
named opencl to emit AOCL: left-hand side):
Instead of a pragma (on the left), AnyHLS uses the intrinsic
body body body
generator pipeline (on the right). Unlike the above loop
abstractions (e.g., unroll), Impala emits a tool-specific pragma
body body body
for the pipeline abstraction. This provides portability across
body
no unrolling
body body
different HLS tools. Furthermore, it allows the programmer
unroll inner loop
to invoke and pass around pipeline—just like any other
unroll outer loop unroll inner and outer loop
generator.
2) Reductions: Reductions are useful in many contexts. The
Figure 3. Parallel processing following function takes an array of values, a range within,
and an operator:
for (int i=0; i<N/W; ++i) { for i in range(0, N/W) { type T = int;
for (int w=0; w<W; ++w) { for w in unroll(0, W) { fn @(?beg & ?end) reduce(beg: int, end: int, input: &[T],
#pragma unroll op: fn(T, T) -> T) -> T {
body(i*W + w); body(i*W + w); let n = end - beg;
} } if n == 1 {
} } input(beg)
} else {
Such pragmas are built into the compiler. The Impala version let m = (end + beg) / 2;
(shown at right) uses generators that are entirely implemented let a = reduce(beg, m, input, op);
let b = reduce(m, end, input, op);
as a library. Partial evaluation optimizes Impala’s range op(a, b)
and unroll abstractions as well as the input body function }
}
according to their static inputs, i.e., N, W. The residual program
consists of the consecutive body function according to the value In the above filter, the recursion will be completely unfolded
of the W as shown in Figure 3. This generates a concise and if the range is statically known. Thus,
clean code for the target HLS compiler, which is drastically reduce(0, 4, [a, b, c, d], |x, y| x + y)
different from using a pragma.
Generators, unlike C++ pragmas, are first-class citizens of yields: (a + b) + (c + d).
the Impala language. This allows programmers to implement 3) Finite State Machines: AnyHLS models computations
sophisticated loop transformations. For example, the following that depend not only on the inputs but also on an internal
function tile returns a new generator. It instantiates a tiled state with an FSM. To define an FSM, programmers need to
loop nest of the specified tile size with the Loops inner specify states and a transition function that determines when
and outer: to change the current state based on the machine’s input. This
type Loop = fn(int, int, fn(int) -> ()) -> ();
is especially beneficial for modeling control flow. To describe
fn @ tile(size: int, inner: Loop, outer: Loop) -> Loop { an FSM in Impala, we start by introducing types to represent
@|beg, end, body| outer(0, (end-beg)/size,
|i| inner(i*size + beg, (i+1)*size + end, |j| body))
the states and the machine itself:
} type State = int;
struct FSM {
let schedule = tile(W, unroll, range); add: fn(State, fn() -> (), fn() -> State) -> (),
for i in schedule(0, N) { run: fn(State) -> ()
body(i) }
}
An object of type FSM provides two operations: adding one
Passing W for the tiling size, unroll for the inner loop, and
state with add or running the computation. The add method
range for the outer loop yields a generator that is identical
takes the name of the state, an action to be performed for this
to the loop nest at the beginning of this paragraph. With this
state, and a transition function associated with this state. Once
design, we can reuse or explore iteration techniques without
all states are added, the programmer runs the machine by
touching the actual body of a for loop. For example, consider
passing the initial state as an input parameter. The following
the processing options for a two-dimensional loop nest as shown
example adds 1 to every element of an array:
in Figure 3: When just passing range as inner and outer
let buf = /*...*/;
loop, the partial evaluator will keep the loop nest and, hence, let mut (idx, pixel) = (0, 0);
not unroll body and instantiate it only once. Unrolling the inner let fsm = make_fsm();
fsm.add(Read, || pixel = buf(idx),
loop replicates body and increases the bandwidth requirements || if idx>=len { Exit } else { Compute });
accordingly. Unrolling the outer loop also replicates body, but fsm.add(Compute, || pixel += 1, || Write);
fsm.add(Write, || buf(idx++) = pixel, || Read );
in a way that benefits data reuse from the temporal locality of fsm.run(Read);
an iterative algorithm. Unrolling both loops replicate body for
increased bandwidth and data reuse for the temporal locality. Similar the other abstractions introduced in this section, the
C/C++-based HLS solutions often use a pragma to mark a constructor for an FSM is not a built-in function of the compiler
loop amenable for pipelining. This means parallel execution but a regular Impala function. In some cases, we want to
of the loop iterations in hardware. For example, the following execute the FSM in a pipelined way. For this scenario, we add
code on the left uses an initiation interval (II) of 3: a second method run_pipelined. As all the methods, e.g.,
for (int i=0; i<N; ++i) { let II = 3; make_fsm, add, run, are annotated for partial evaluation
#pragma HLS pipeline II=3 for i in pipeline(II, 0, N) {
body(i); body(i) (by @), input functions to these methods will be optimized
} } according to their static inputs. Ultimately, AnyHLS will emit
the states of an FSM as part of a loop according to the selected
run method.
global memory on-chip memory register stream
4) Memory Types and Memory Abstractions: FPGAs have
different memory types of varying sizes and access properties.
Impala supports four memory types specific to hardware design Figure 4. Memory types provided for FPGA design
(see Figure 4): global memory, on-chip memory, registers, and
streams. Global memory (typically DRAM) is allocated on the OnChipArray
Regs2D StreamArray
host using our runtime and accessed through regular pointers. Regs1D
On-chip memory (e.g., BRAM or M10K/M20K) for the FPGA
is allocated using the reserve_onchip compiler intrinsic.
1D register array
Memory accesses using the pointer returned by this intrinsic 2D register array stream array
on-chip array
will map to on-chip memory. Standard variables are mapped
to registers, and a specific stream type is available to allow
for the communication between FPGA kernels. Memory-wise, Figure 5. Memory abstractions
a stream is mapped to registers or on-chip memory by the
HLS tools. These FPGA-specific memory types in Impala will
the smaller array. The generator (make_regs1d) returns an
be mapped to their corresponding tool-specific declarations in
Impala variable that can be read and written by index values
the residual program (on-chip memory will be defined as local
(regs in the following code), similar to C arrays.
memory for AOCL whereas it will be defined as an array in
let regs = make_regs1d(size);
Vivado HLS).
a) Memory partitioning: an array partitioning pragma However, it defines size number of registers in the residual
must be defined as follows to implement a C array with program instead of declaring an array and partitioning it by
hardware registers using Vivado HLS [6]: tool-specific pragmas as in Listing 1. The generated code
typedef int T; does not contain any compiler directives; hence it can be
T Regs1D[size];
#pragma HLS variable=Regs1D array_partition dim=0
used for different HLS tools (e.g., Vivado HLS, AOCL). Since
we annotated make_regs1d, read, and write for partial
Listing 1. A typical way of partitioning an array by using pragmas in existing
HLS tools. evaluation, any call to these functions will be inlined recursively.
This means that the search to find the register to read to or
Other HLS tools offer similar pragmas for the same task. write from will be performed at compile time. These registers
Instead, AnyHLS provides a more concise description of a will be optimized by the AnyDSL compiler, just like any other
register array without using any tool-specific pragma by the variables: unnecessary assignments will be avoided, and a clean
recursive declaration of registers as follows: HLS code will be generated.
type T = int; Correspondingly, AnyHLS provides generators (similar to
struct Regs1D { Listing 2) for one and two-dimensional arrays of on-chip
read: fn(int) -> T,
write: fn(int, T) -> (), memory (e.g., line buffers in Section IV), global memory, and
size: int streams (as illustrated in Figure 5) instead of using memory
}
fn @ make_regs1d(size: int) -> Regs1D { partitioning pragmas encouraged in existing HLS tools (as in
if size == 0 { Listing 1).
Regs1D {
read: @|_| 0,
write: @|_, _| (),
size: size IV. A L IBRARY FOR I MAGE P ROCESSING ON FPGA
}
} else {
AnyHLS allows for defining domain-specific abstractions
let mut reg: T; and optimizations that are used and applied prior to generating
let others = make_regs1d(size - 1);
Regs1D {
customized input to existing HLS tools. In this section, we
read: @|i| if i+1 == size { reg } introduce a library that is developed to support HLS for the
else { others.read(i) },
write: @|i, v| if i+1 == size { reg = v }
domain of image processing applications. It is based on the
else { others.write(i, v) }, fundamental abstractions introduced in Section III-B. Our low-
size: size
}
level implementation is similar to existing domain-specific
} languages targeting FPGAs [24], [27]. For this reason, we focus
}
on the interface of our abstractions as seen by the programmer.
Listing 2. Recursive description of a register array using partial evalution We design applications by decoupling their algorithmic
instead of declaring an array and partitioning it by HLS pragmas.
description from their schedule and memory operations. For
instance, typical image operators, such as the following
When the size is not zero, each recursive call to this
Sobel filter, just resort to the make_local_op generator.
function allocates a register variable named reg, and creates
Similarly, we implement a point operator for RGB-to-gray
a smaller register array with one element less named others.
color conversion as follows (Listing 3):
The read and write functions test if the index i is equal
fn sobel_edge(output: &mut [T], input: &[T]) -> () {
to the index of the current register. In the case of a match, let img = make_raw_mem2d(width, height, input);
the current register is used. Otherwise, the search continues in let dx = make_raw_mem2d(width, height, output);
let sobel_extents = extents(1, 1); // for 3x3 filter access an element of the vector. This increases data reuse and
let operator = make_local_op(4, // vector factor
sobel_operator_x, sobel_extents, mirror, mirror);
DRAM-to-on-chip memory bandwidth [42].
with generate(hls) { operator(img, dx); } 2) Stream Processing: Inter-kernel dependencies of an
}
algorithm should be accessed on-the-fly in combination with
fn rgb2gray(output: &mut [T], input: &[T]) -> () { fine-granular communication in order to pipeline the full
let img = make_raw_img(width, height, input);
let gray = make_raw_img(width, height, output); implementation with a fixed throughput. That is, as soon as a
let operator = make_point_op(@ |pix| { block produces one data, the next block consumes it. In the
let r = pix & 0xFF;
let g = (pix >> 8) & 0xFF; best case, this requires only a single register of a small buffer
let b = (pix >> 16) & 0xFF; instead of reading/writing to temporary images:
(r + g + b) / 3
});
Mem1D Mem1D Mem1D Mem1D
with generate(hls) { operator(img, gray); }
} Kernel1 Kernel2 Kernel3
Listing 3. Sobel filter and RGB-to-gray color conversion as example
applications described by using our library.
We define a stream between two kernels as follows:
The image data structure is opaque. The target platform fn make_mem_from_stream(size: int, data: stream) -> Mem1D;
mapping determines its layout. AnyHLS provides common
border handling functions as well as point and global operators 3) Line Buffers: Storing an entire image to on-chip memory
such as reductions (see Section III-B2). These operators are before execution is not feasible since on-chip memory blocks
composable to allow for more sophisticated ones. are limited in FPGAs. On the other hand, feeding the data
on demand from main memory is extremely slow. Still, it is
possible to leverage fast on-chip memory by using it as FIFO
A. Vectorization
buffers containing only the necessary lines of the input images
Image processing applications consist of loops that possess a (W pixels per line).
very high degree of spatial parallelism. This should be exploited Mem2D (1, h, v)
to reach the bandwidth speed of memory technologies. A
line buffer
resource-efficient approach, so-called vectorization or loop
coarsening, is to aggregate the input pixels to vectors and
process multiple input data at the same time to calculate line buffer
Mem1D (W, v)
multiple output pixels in parallel [39]–[41]. This replicates only
the arithmetic operations applied to data (so-called datapath) line buffers (W, h, v)
instead of the whole accelerator, similar to Single Instruction
Multiple Data (SIMD) architectures. Vectorization requires a This enables parallel reads at the output for every pixel read
control structure specialized to a considered hardware design. at the input. We model a line buffer as follows:
We support the automatic vectorization of an application by type LineBuf1D = fn(Mem1D) -> Mem1D;
a given factor v when using our image processing library. In fn make_linebuf1d(width: int) -> LineBuf1D;
// similar for LineBuf2D
particular, our library use the vectorization techniques proposed
in [40]. For example, the make_local_op function has Akin to Regs1D (see Section III-B4), a recursive call builds
an additional parameter to specify the desired vectorization an array of line buffers (each line buffer will be declared by a
and will propagate this information to the functions it uses separate memory component in the residual program similar
internally: make_local_op(op, v). For brevity, we omit to on-chip array in Figure 5).
the parameter for the vectorization factor for the remaining 4) Sliding Window: Registers are the most amenable re-
abstractions in this section. sources to hold data for highly parallelized access. A sliding
window of size w × h updates the constituting shift registers by
B. Memory Abstractions for Image Processing a new column of h pixels and enables parallel access to w · h
1) Memory Accessor: In order to optimize memory access pixels.
Mem2D (w, h, 1)
and encapsulate the contained memory type (on-chip memory,
etc.) into a data structure, we decouple the data transfer from Mem2D
(1, h, v)
the data use via the following memory abstractions:
struct Mem1D { struct Mem2D {
read: fn(int) -> T, read: fn(int, int) -> T,
write: fn(int, T)->(), write: fn(int, int, T)->(),
update: fn(int) -> (), update: fn(int, int) -> (), sliding window
size: int width: int, height: int
} } This provides high data reuse for temporal locality and avoids
Similar to hardware design practices, these memory abstractions waste of on-chip memory blocks that might be utilized for a sim-
require the memory address to be updated before the ilar data bandwidth. Our implementation uses make_regs2d
read/write operations. The update function transfers data for an explicit declaration of registers and supports pixel-based
from/to the encapsulated memory to/from staging registers indexing at the output. This will instantiate w · h registers in
using vector data types. Then, the read/write functions the residual program, as explained in Section III-B4.
type Swin2D = fn(Mem2D) -> Mem2D; type LocalOp = fn(Mem1D) -> Mem1D;
fn @ make_sliding_window(w: int, h: int) -> Swin2D { fn @ make_local_op(v: int, op: Op, ext: Extents,
let win = make_regs2d(w, h); bh_lower: FnBorder,
// ... bh_upper: FnBorder) -> LocalOp {
} @ |img, out| {
let mut (col, row, idx) = (0, 0, 0);
let wait = /* initial latency */
C. Loop Abstractions for Image Processing let fsm = make_fsm();
fsm.add(Read, || img.update(idx), || Compute);
1) Point Operators: Algorithms such as image scaling and fsm.add(Compute, || {
line_buffer.update(col);
color transformation calculate an output pixel for every input sliding_window.update(row);
pixel. The point operator abstraction (see Listing 4) in AnyHLS col_sel.update(col);
for i in unroll(0, v) {
yields a vectorized pipeline over the input and output image. out.write(i, op(col_sel.read(i)));
This abstraction is parametric in its vector factor v and the }
}, || if idx > wait { Write } else { Index });
desired operator function op. fsm.add(Write, || out.update(idx-wait-1), || Index);
fsm.add(Index, || {
type PointOp = fn(Mem1D) -> Mem1D;
idx++; col++;
fn @ make_point_op(v: int, op: Op) -> PointOp {
if col == img_width { col=0; row++; }
@ |img, out| {
}, || if idx < img.size { Read } else { Exit });
for idx in pipeline(1, 0, img.size) {
fsm.run_pipelined(Read, 1, 0, img.size);
img.update(idx);
}
for i in unroll(0, v) {
}
out.write(i, op(img.read(i)));
}
out.update(idx); Listing 5. Implementation of the local operator abstraction.
}
}
}
Compared to the local operator in Figure 1, we also support
Listing 4. Implementation of the point operator abstraction. boundary handling. We specify the extent of the local operator
(filter size / 2) as well as functions specifying the boundary
handling for the lower and upper bounds. Then, row and column
The total latency is
selection functions apply border handling correspondingly in x-
L = Larith + dW/ve · H cycles (2) and y−directions by using one-dimensional multiplexer arrays
similar to Özkan et al. [40].
where W and H are the width and height of the input image,
and Larith is the latency of the data path. V. E VALUATION AND R ESULTS
2) Local Operators: Algorithms such as Gaussian blur and In the following, we compare the Post Place and Route
Sobel edge detection calculate an output pixel by considering (PPnR) results using AnyHLS and other state-of-the-art domain-
the corresponding input pixel and a certain neighborhood of it specific approaches including Halide-HLS [25] and Hipacc [27].
in a local window. Thus, a local operator with a w × h window The generated HLS codes are compiled using Intel FPGA SDK
requires w · h pixel reads for every output. The same (w − 1) · h for OpenCL 18.1 and Xilinx Vivado HLS 2017.2 targeting a
pixels are used to calculate results at the image coordinates Cyclone V GT 5CGTD9 FPGA and a Zynq XC7Z020 FPGA,
(x, y) and (x + 1, y). This spatial locality is transformed into repectively.
temporal locality when input images are read in raster order for The generated hardware designs are evaluated for their
burst mode, and subsequent pixels are sequentially processed throughput, latency, and resource utilization. FPGAs possess
with a streaming pipeline implementation. The local operator two types of resources: (i) computational: LUTs and DSP
implementation in AnyHLS (shown in Listing 5) consists of blocks; (ii) memory: Flipflops (FFs) and on-chip memory
line buffers and a sliding window to hold dependency pixels (BRAM/M20K). A SLICE/ALM is comprised of look-up tables
in on-chip memory and calculates a new result for every new (LUTs) and flip flops, thus indicate the resource usage when
pixel read. considered with the DSP block and on-chip memory blocks.
Mem2D Mem2D Mem2D
The implementation results presented for Vivado HLS feature
Mem1D
(1, h, v) (w + v − 1, h, 1)(w + v − 1, h, 1)
Mem1D
only the kernel logic, while those by Intel OpenCL include
(W × H, v) line buffer op1 (W × H, v) PCIe interfaces. The execution time of an FPGA circuit (Vivado
row col
line buffer
sel sel
...
HLS implementation) equals to Tclk · latency, where Tclk is
opv
the clock period of the maximum achievable clock frequency
line buffers sliding window
(lower is better). We measured the timing results for Intel
local operator
OpenCL by executing the applications on a Cyclone V GT
This provides a throughput of v pixels per clock cycle at the 5CGTD9 FPGA. This is the case for all analyzed applications.
cost of an initial latency (v is the vectorization factor) We have no intention nor license rights [43, §4] [44, §2] to
benchmark and compare the considered FPGA technologies or
Linitial = Larith + (bh/2c · dW/ve + bdw/ve/2c) (3) HLS tools.
that is spent for caching neighboring pixels of the first
calculation. The final latency is thus: A. Applications
In our experimental evaluation, we consider the following
L = Linitial + (dW/ve · H) (4) applications:
Harris
2) Vectorization: Many FPGA implementations benefit from
FChain parallel processing in order to increase memory bandwidth.
AnyHLS implicitly parallelizes a given image pipeline by a
Harris naïve vectorization factor v. As an example, Figure 7 shows the
FChain streaming pipeline PPnR results, along with the achieved memory throughput for
0 16 35 107 different vectorization factors for the mean filter on a Cyclone V.
Execution time [ms] The memory-bound of the Cyclone V is reported by Intel’s

Figure 6. Execution time for naïve and streaming pipeline implementations Memory Bound [MB/s]

Vectorization factor (v)

of the Harris and FChain for an Intel Cyclone V for images of 1024 × 1024. 32
16
8
• Gaussian (Gauss) blurring an image with a 5 × 5 integer
4
kernel 2
• Harris corner detector (Harris) consisting of 9 kernels 1
that resort to integer arithmetic and horizontal/vertical 200 400 600 800 1,000 1,200 1,400
derivatives
Throughput [MB/s]
• Jacobi smoothing an image with a 3 × 3 integer kernel
35
filter chain (FChain) consisting of 3 convolution kernels

Resource Usage in %
•
On-Chip Mem Blocks Logic Resources
as a pre-processing algorithm 30
• bilateral filter (Bilateral), a 5 × 5 floating-point kernel
as an edge-preserving and noise-reducing function based 25
on exponential functions
• mean filter (MF), a 5×5 filter that determines the average 20
within a local window via 8-bit arithmetic
15
• SobelLuma, an edge detection algorithm provided as a
1 2 4 8 16 32
design example by Intel. The algorithm consists of RGB Vectorization factor (v)
to Luma color conversion, Sobel filters, and thresholding
Figure 7. PPnR results of AnyHLS’s mean filter implementation on an Intel
B. Library Optimizations Cyclone V. The memory bound of the device for our setup is 1344.80 MB/s.

AnyHLS exploits stream processing and performs implicit

diagnosis tool. The speedup is almost linear, whereas resource
parallelization. The following subsections show the impact of
utilization is sub-linear to the vectorization factor, as Figure 7
those optimizations.
depicts. AnyHLS exploits the data reuse between consecutive
1) Stream Processing: Memory transfers between FPGA’s
iterations of the local operators. Data is read and written with
programmable logic and external memory are one of the most
the vectorized data types. The line buffers and the sliding
time-consuming parts of many image processing applications.
window are extended to hold dependency pixels for vectorized
AnyHLS streaming pipeline optimization passes dependency
processing. Thus, only the datapath is replicated instead of the
pixels directly from the producer to the consumer kernel,
whole accelerator implementation (see Section IV-A). All the
as explained in Section IV-B2. This allows pipelined kernel
considered applications except Bilateral in Figure 9 reach the
execution and makes intermediate images between kernels
memory bound. Bilateral is compute-bound due to its large
superfluous. The more intermediate images are eliminated, the
number of floating-point operations.
better the performance of the resulting designs. For example,
this eliminates 8 intermediate images in Harris corner and 2 in
filter chain, see Figure 6 for the performance impact. C. Hardware Design Evaluation
The throughput of both streaming pipeline implementations
We evaluate the generated hardware designs based on their
is indeed determined by their slowest individual kernel, which
throughput, latency, and resource utilization. As a reference, we
is a local operator. Consider Table I, which displays the Vivado
use the designs generated by Halide-HLS [25] and Hipacc [27],
HLS reports. The latency results correspond to Equation (4).
two state-of-the-art image processing DSLs that generate
better results than previous approaches (e.g., Xilinx OpenCV).
Table I
S TREAMING PIPELINE IMPLEMENTATIONS OF H ARRIS AND FC HAIN ON A In contrast to these, which implement dedicated HLS code
X ILINX Z YNQ . DATA IS TRANSFERRED TO THE FPGA ONLY ONCE , THUS generators, AnyHLS is essentially implemented as a library
SIMILAR THROUGHPUTS ARE ACHIEVED . I MAGES SIZES ARE 1024 × 1024, within the AnyDSL framework, as illustrated in Figure 2. Our
v = 1, ftarget = 200 MH Z .
focus is to show that higher-order abstractions, together with
App. Largest mask Sequential Dependency Latency [cyc.] Throughput [MB/s]
partial evaluation, are powerful enough to design a library
FChain 5×5 local + local + local 1050649 821 targeting different HLS compilers.
Harris 3×3 local + local + point 1049634 825 1) Experiments using Xilinx Vivado HLS: We evaluate the
results of circuits generated using AnyHLS in comparison with
the domain-specific language approaches Hipacc and Halide- Table II
HLS. We consider two representative applications from the PP N R RESULTS FOR THE X ILINX Z YNQ BOARD FOR IMAGES OF SIZE
1020 × 1020 AND Ttarget = 5 NS ( CORRESPONDS TO ftarget = 200 MH Z ).
Halide-HLS repository with different configurations (border B ORDER HANDLING IS UNDEFINED .
handling mode and vectorization factor): Gauss and Harris.
These DSLs have been developed by FPGA experts and perform App v #BRAM #SLICE #DSP Latency [cyc.] Throughput [MB/s]
AnyHLS 8 463 16 1042456 828.2
better than many other existing libraries. The applications are 1 Halide-HLS 8 1823 50 1052673 438.2
rewritten for Hipacc and AnyHLS by respecting their original Gauss
Hipacc 8 473 16 1044500 764.7
AnyHLS 16 1441 80 260626 3041.4
descriptions. This ensures that Halide-HLS applications have 4 Halide-HLS 16 4112 180 266241 1640.1
Hipacc 16 1519 64 261649 3064.6
been implemented with adequate scheduling primitives. Hipacc
AnyHLS 20 1405 22 1041450 829.0
and AnyHLS implementations require only the algorithm 1 Halide-HLS 16 2688 35 1052673 464.0
Hipacc 20 1457 34 1042466 828.2
descriptions as input. Harris
AnyHLS 20 2513 44 520740 1450.4
For almost all applications in Tables II and III, AnyHLS 2 Halide-HLS 16 4011 70 528385 895.0
Hipacc 20 2326 68 521756 1637.8
implementations demand fewer resources and deliver higher
performance. Of course, this improvement mainly stems from
Table III
our library implementation. AnyHLS achieves a lower latency PP N R RESULTS FOR THE G AUSSIAN BLUR WITH CLAMPING AT THE
mainly because of the following reasons: BORDERS . I MAGE SIZES ARE 1024 × 1024, v = 1, ftarget = 200 MH Z .
i) The latency of a local operator generated from AnyHLS’
image processing library corresponds to the theoretical Framework #BRAM #SLICE #DSP Latency [cyc.] Throughput [MB/s]

latency given in Equation (4), which is L = Larith + AnyHLS 8 1646 16 1050641 801.8
Halide-HLS 16 2096 50 1060897 458.7
1.042.442 clock cycles for Gauss when v = 1. Larith = Hipacc 8 1709 16 1052693 820.1
14 for AnyHLS’ Gauss implementation as shown in
Table II.
ii) Halide-HLS pads input images according to the selected has control over code generation. Extending AnyHLS’ image
border handling mode (even when no border handling is processing library only requires adding new functions in Impala
defined). This increases the input image size from (W , (see Figure 2). Our intention to compare AnyHLS with these
H) to (W + w − 1, H + h − 1), thus the latency. DSLs is to show that we can generate equally good designs
iii) Hipacc does not pad input images, but run (H + bh/2c · without creating an entire compiler backend.
(W + bw/2c)) loop iterations for a (W × H) image 2) Experiments using Intel FPGA SDK for OpenCL (AOCL):
and (w × h) window. This is similar to the convolution Table IV presents the implementation results for an edge
example in the Vivado Design Suite User Guide [6], but detection algorithm provided as a design example by Intel. The
not optimal. algorithms consist of RGB to Luma color conversion, Sobel
The execution time of an implementation equals to Tclk · filters, and thresholding. Intel’s implementations consist of a
latency, where Tclk is the clock period of the maximum single-work item kernel that utilizes shift registers according
achievable clock frequency (lower is better). Overall, AnyHLS to the FPGA design paradigm. These types of techniques are
processes a given image faster than the other DSL implemen- recommended by Intel’s optimization guide [7] despite that
tations. the same OpenCL code performs drastically bad on other
Halide-HLS uses more on-chip memory for line buffers (see computing platforms.
Section IV-C2) compared to Hipacc and AnyHLS because of its
image padding for border handling. Let us consider the number Table IV
PP N R RESULTS OF AN EDGE DETECTION APPLICATION FOR THE I NTEL
of BRAMs utilized for the Gaussian blur: The line buffers need C YCLONE V. I MAGE SIZES ARE 1024 × 1024. N ONE OF THE
to hold 4 image lines for the 5 × 5 kernel. The image width IMPLEMENTATIONS USE DSP S .
is 1024 and the pixel size is 32 bits. Therefore, AnyHLS and
v Framework #M10K #ALM #DSP Throughput [MB/s]
Hipacc use eight 18K BRAMs as shown in Table II. However,
Halide-HLS stores 1028 integer pixels, which require 16 18K Intel’s Imp. 290 23830 0 419.5
1 AnyHLS 291 23797 0 422.5
BRAMs to buffer four image lines. This doubles the number Hipacc 318 25258 0 449.1
of BRAMs usage (see Table III). Intel’s Imp. - - 0 -
AnyHLS use the vectorization architecture proposed in [40]. 16 AnyHLS 337 29126 0 1278.3
Hipacc 362 35079 0 1327.7
This improves the use of the registers compared to Hipacc and
Intel’s Imp. - - 0 -
Halide. 32 AnyHLS 401 38069 0 1303.8
The performance metrics and resource usage reported by Hipacc 421 44059 0 1320.0
Vivado HLS correlate with our Impala descriptions, hence we
claim that the HLS code generated from AnyHLS’ image We described Intel’s handwritten SobelLuma example using
processing library does not entail severe side effects for Hipacc and AnyHLS. Both Hipacc and AnyHLS provide a
the synthesis of Vivado HLS. Hipacc and Halide-HLS have higher throughput even without vectorization. In order to reach
dedicated compiler backends for HLS code generation. These memory-bound, we would have to rewrite Intel’s hand-tuned
can be improved to achieve similar performance to AnyHLS. design example to exploit further parallelism. AnyHLS uses
However, this is not a trivial task and prone to errors. The slightly less resource, whereas Hipacc provides slightly higher
advantage of AnyDSL’s partial evaluation is that the user throughput for all the vectorization factors. Similar to Figure 7,
REFERENCES

16 AnyHLS Table V
103 PP N R FOR THE I NTEL C YCLONE V. M ISSING NUMBERS (-) INDICATE THAT
Throughput in [MPixel/s]
NDRange
8 THE GENERATED IMPLEMENTATIONS DO NOT FIT THE BOARD .

4
App v Framework #M10K #ALM #DSP Throughput [MB/s]
2
CU4/SIMD16 16 AnyHLS 401 37509 0 1330.1
102 1 Gauss
16 Hipacc 402 35090 0 1301.2
16 AnyHLS 370 31446 0 1328.8
Jacobi
CU1/SIMD1 16 Hipacc 372 30296 0 1282.9
CU16/SIMD1
1 AnyHLS 399 79270 153 326.6
Bilat.
1 Hipacc 422 79892 159 434.7
20 30 40 50 60 70 80
16 AnyHLS 400 39266 0 1255.68
Hardware resources (logic utilization [%]) MF 16 Hipacc - - - -
8 Hipacc 351 31796 0 1275.9
8 AnyHLS 418 44807 0 1230.6
Figure 8. Design space for a 5 × 5 mean filter using an NDRange kernel FChain
8 Hipacc 645 64225 0 427.4
(using the num_compute_units / num_simd_work_items attributes)
8 AnyHLS 442 50537 96 1158.5
and AnyHLS (using the vectorization factor v) for an Intel Cyclone V. Harris
8 Hipacc 668 74246 96 187.14

Hipacc AnyHLS
Throughput in [MPixel/s]

10
VI. C ONCLUSIONS
2
In this paper, we advocate the use of modern compiler
29
technologies for high-level synthesis. We combine functional
abstractions with the power of partial evaluation to decouple a
high-level algorithm description from its hardware design that
28
implements the algorithm. This process is entirely driven by
code refinement, generating input code to HLS tools, such as
Harris Gauss Bilateral Jacobi FChain MF Vivado HLS and AOCL, from the same code base. To specify
important abstractions for hardware design, we have introduced
Figure 9. Throughput measurements for an Intel Cyclone V for the a set of basic primitives. Library developers can rely on these
implementations generated from AnyHLS and Hipacc. Resource utilization primitives to create domain-specific libraries. As an example,
for the same implementations are shown in Table V.
we have implemented an image processing library for synthesis
to both Intel and Xilinx FPGAs. Finally, we have shown that
our results are on par or even better in performance compared
both frameworks yield throughputs very close to the memory
to state-of-the-art approaches.
bound of the Intel Cyclone V.
The OpenCL NDRange kernel paradigm conveys multiple
ACKNOWLEDGMENTS
concurrent threads for data-level parallelism. OpenCL-based
HLS tools exploit this paradigm to synthesize hardware. AOCL This work is supported by the Federal Ministry of Education
provides attributes for NDRange kernels to transform its iter- and Research (BMBF) as part of the Metacca, MetaDL,
ation space. The num_compute_units attribute replicates ProThOS, and REACT projects as well as the Intel Visual
the kernel logic, whereas num_simd_work_items vector- Computing Institute (IVCI) at Saarland University. It was
3
izes the kernel implementation . Combinations of those provide also partially funded by the Deutsche Forschungsgemein-
a vast design space for the same NDRange kernel. However, as schaft (DFG, German Research Foundation) – project number
Figure 8 demonstrates, AnyHLS achieves implementations that 146371743 – TRR 89 “Invasive Computing”. Many thanks to
are orders of magnitude faster than using attributes in AOCL. our colleague Puya Amiri for his work on the pipeline support.
Finally, Table V and Figure 9 present a comparison between
AnyHLS and the AOCL backend of Hipacc [45]. As shown R EFERENCES
in Figure 2, Hipacc has an individual backend and template [1] J. Bachrach et al., “Chisel: Constructing hardware in a Scala
library written with preprocessor directives to generate high- embedded language”, in Proc. of the 49th Annual Design
Automation Conf. (DAC), IEEE, Jun. 3–7, 2012.
performance OpenCL code for FPGAs. In contrast, the ap-
[2] Y. Liu et al., “A scala based framework for developing accel-
plication and library code in AnyHLS stays the same. The eration systems with FPGAs”, Journal of Systems Architecture,
generated AOCL code consists of a loop that iterates over vol. 98, 2019.
the input image. Compared to Hipacc, AnyHLS achieves [3] J. Decaluwe, “MyHDL: A Python-based hardware description
similar performance but outperforms Hipacc for multi-kernel language”, Linux Journal, no. 127, 2004.
[4] J. Cong et al., “High-level synthesis for FPGAs: From
applications such as the Harris corner detector. This shows that
prototyping to deployment”, IEEE Trans. on Computer-Aided
AnyHLS optimizes the inter-kernel dependencies better than Design of Integrated Circuits and Systems (TCAD), vol. 30, no.
Hipacc (see Section IV-B2). 4, 2011.
[5] J. Cong et al., “Automated accelerator generation and opti-
3 These mization with composable, parallel and pipeline architecture”,
parallelization attributes are suggested in [7] for NDRange kernels,
in Proc. of the 55th Annual Design Automation Conf. (DAC),
not for the single-work item kernels using shift registers such as the edge
detection application shown in Table IV. ACM, Jun. 24–29, 2018.
[6] Xilinx, Vivado Design Suite user guide high-level synthesis [27] O. Reiche et al., “Generating FPGA-based image processing
UG902, 2017. accelerators with Hipacc”, in Proc. of the Int’l Conf. On
[7] Intel, Intel FPGA SDK for OpenCL: Best practices guide, 2017. Computer Aided Design (ICCAD), IEEE, Nov. 13–16, 2017.
[8] R. Leißa et al., “AnyDSL: A partial evaluation framework for [28] N. Chugh et al., “A DSL compiler for accelerating image
programming high-performance libraries”, Proc. of the ACM processing pipelines on FPGAs”, in Proc. of the Int’l Conf.
on Programming Languages (PACMPL), vol. 2, no. OOPSLA, on Parallel Architecture and Compilation Techniques (PACT),
Nov. 4–9, 2018. ACM, Sep. 11–15, 2016.
[9] L.-N. Pouchet et al., “Polyhedral-based data reuse optimization [29] Y. Chi et al., “Soda: Stencil with optimized dataflow archi-
for configurable computing”, in Proc. of the ACM/SIGDA tecture”, in 2018 IEEE/ACM Int’l Conf. on Computer-Aided
international symposium on Field programmable gate arrays, Design (ICCAD), IEEE, 2018.
ACM, 2013. [30] R. Stewart et al., “A dataflow IR for memory efficient
[10] R. Nane et al., “A survey and evaluation of FPGA high-level RIPL compilation to FPGAs”, in Proc. of the Int’l Conf. on
synthesis tools”, IEEE Trans. on Computer-Aided Design of Algorithms and Architectures for Parallel Processing (ICA3PP),
Integrated Circuits and Systems, vol. 35, no. 10, 2015. Springer, Dec. 14–16, 2016.
[11] G. Martin and G. Smith, “High-level synthesis: Past, present, [31] M. Kristien et al., “High-level synthesis of functional patterns
and future”, IEEE Design & Test of Computers, vol. 26, no. 4, with Lift”, in Proc. of the 6th ACM SIGPLAN Int’l Workshop on
2009. Libraries, Languages and Compilers for Array Programming,
[12] D. F. Bacon et al., “FPGA programming for the masses”, ARRAY@PLDI 2019, Phoenix, AZ, USA, June 22, 2019., 2019.
Communications of the ACM, vol. 56, no. 4, 2013. [32] R. Baghdadi et al., “Tiramisu: A polyhedral compiler for
[13] S. A. Edwards, “The challenges of synthesizing hardware from expressing fast and portable code”, in Proc. of the IEEE/ACM
C-like languages”, IEEE Design & Test of Computers, vol. 23, Int’l Symp. on Code Generation and Optimization (CGO),
no. 5, 2006. IEEE, Feb. 16–20, 2019.
[14] J. Sanguinetti, “A different view: Hardware synthesis from [33] E. Del Sozzo et al., “A unified backend for targeting FPGAs
SystemC is a maturing technology”, IEEE Design & Test of from DSLs”, in Proc. of the 29th Annual IEEE Int’l Conf.
Computers, vol. 23, no. 5, 2006. on Application-specific Systems, Architectures and Processors
[15] D. Koeplinger et al., “Spatial: A language and compiler for (ASAP), IEEE, Jul. 10–12, 2018.
application accelerators”, in Proc. of the 39th ACM SIGPLAN [34] R. Leißa et al., “Shallow embedding of DSLs via online partial
Conf. on Programming Language Design and Implementation evaluation”, in Proc. of the Int’l Conf. on Generative Program-
(PLDI), ACM, Jun. 18–22, 2018. ming: Concepts & Experiences (GPCE), ACM, Oct. 26–27,
[16] H. Eran et al., “Design patterns for code reuse in HLS packet 2015.
processing pipelines”, in 27th Annual Int’l Symp. on Field- [35] M. A. Özkan et al., “A journey into DSL design using
Programmable Custom Computing Machines (FCCM), IEEE, generative programming: FPGA mapping of image border
2019. handling through refinement”, in Proc. of the 5th Int’l Workshop
[17] J. S. da Silva et al., “Module-per-object: A human-driven on FPGAs for Software Programmers (FSP), VDE, 2018.
methodology for C++-based high-level synthesis design”, in [36] N. D. Jones et al., Partial evaluation and automatic program
27th Annual Int’l Symp. on Field-Programmable Custom generation. Peter Sestoft, 1993.
Computing Machines (FCCM), IEEE, 2019. [37] Y. Futamura, “Parital computation of programs”, in Proc. of the
[18] D. Richmond et al., “Synthesizable higher-order functions for RIMS Symposia on Software Science and Engineering, 1982.
C++”, Trans. on Computer-Aided Design of Integrated Circuits [38] C. Consel, “New insights into partial evaluation: The SCHISM
and Systems, vol. 37, no. 11, 2018. experiment”, in Proc. of the 2nd European Symp. on Program-
[19] M. A. Özkan et al., “A highly efficient and comprehensive ming (ESOP), Springer, Mar. 21–24, 1988.
image processing library for C++-based high-level synthesis”, [39] M. Schmid et al., “Loop coarsening in C-based high-level
in Proc. of the 4th Int’l Workshop on FPGAs for Software synthesis”, in Proc. of the 26th Annual IEEE Int’l Conf.
Programmers (FSP), VDE, 2017. on Application-specific Systems, Architectures and Processors
[20] J. de Fine Licht et al., “Transformations of high-level synthesis (ASAP), IEEE, 2015.
codes for high-performance computing”, The Computing Re- [40] M. A. Özkan et al., “Hardware design and analysis of efficient
search Repository (CoRR), 2018. arXiv: 1805.08288 [cs.DC]. loop coarsening and border handling for image processing”,
[21] G. Ofenbeck et al., “Spiral in Scala: Towards the systematic in Proc. of the Int’l Conf. on Application-specific Systems,
construction of generators for performance libraries”, in Proc. Architectures and Processors (ASAP), IEEE, Jul. 10–12, 2017.
of the Int’l Conf. on Generative Programming: Concepts & [41] G. Stitt et al., “Scalable window generation for the Intel
Experiences (GPCE), ACM, Oct. 27–28, 2013. Broadwell+Arria 10 and high-bandwidth FPGA systems”, in
[22] P. Milder et al., “Computer generation of hardware for linear Proc. of the ACM/SIGDA Int’lSymp. on Field-Programmable
digital signal processing transforms”, ACM Trans. on Design Gate Arrays (FPGA), ACM, Feb. 25–27, 2018.
Automation of Electronic Systems (TODAES), vol. 17, no. 2, [42] Y.-k. Choi et al., “A quantitative analysis on microarchitectures
2012. of modern CPU-FPGA platforms”, in Proc. of the 53rd Annual
[23] J. Hegarty et al., “Darkroom: Compiling high-level image Design Automation Conf. (DAC), ACM, Jun. 5–9, 2016.
processing code into hardware pipelines”, ACM Trans. on [43] Core evaluation license agreement, version 2014.06, Xilinx,
Graphics (TOG), vol. 33, no. 4, 2014. Inc., Jun. 2014. [Online]. Available: https://www.xilinx.com/
[24] J. Hegarty et al., “Rigel: Flexible multi-rate image processing products/intellectual-property/license/core-evaluation-license-
hardware”, ACM Trans. on Graphics (TOG), vol. 35, no. 4, agreement.html.
2016. [44] Intel program license subscription agreement, version Rev.
[25] J. Pu et al., “Programming heterogeneous systems from an 10/2009, Intel Corporation, Oct. 2009. [Online]. Available:
image processing DSL”, ACM Trans. on Architecture and Code https://www.intel.com/content/www/us/en/programmable/
Optimization (TACO), vol. 14, no. 3, 2017. downloads/software/license/lic-prog_lic.html.
[26] J. Ragan-Kelley et al., “Halide: A language and compiler for [45] M. A. Özkan et al., “FPGA-based accelerator design from
optimizing parallelism, locality, and recomputation in image a domain-specific language”, in Proc. of the 26th Int’l Conf.
processing pipelines”, in Proc. of the Conf. on Programming on Field-Programmable Logic and Applications (FPL), IEEE,
Language Design and Implementation (PLDI), ACM, Jun. 16– Aug. 29–Sep. 2, 2016.
19, 2013.

A Survey and Evaluation of FPGA High-Level Synthesis Tools
No ratings yet
A Survey and Evaluation of FPGA High-Level Synthesis Tools
14 pages
System-On-Chip Design Using High-Level Synthesis Tools
No ratings yet
System-On-Chip Design Using High-Level Synthesis Tools
9 pages
L3 Introduction To Hardware Description Languages HDLs
No ratings yet
L3 Introduction To Hardware Description Languages HDLs
17 pages
Amiri 2021 Flower
No ratings yet
Amiri 2021 Flower
9 pages
Python to VHDL for FPGA Design
No ratings yet
Python to VHDL for FPGA Design
4 pages
HDL Lab Ece Uvce Jan20
No ratings yet
HDL Lab Ece Uvce Jan20
50 pages
HLS Introduction Gajski Design and Test
No ratings yet
HLS Introduction Gajski Design and Test
10 pages
High-Level Synthesis Overview
No ratings yet
High-Level Synthesis Overview
28 pages
Xilinx Vivado High Level Synthesis Case Studies
No ratings yet
Xilinx Vivado High Level Synthesis Case Studies
5 pages
VHDL Practical Guide & Examples
No ratings yet
VHDL Practical Guide & Examples
20 pages
Programming With HDLS: Paul Chow February 11, 2008
No ratings yet
Programming With HDLS: Paul Chow February 11, 2008
21 pages
Hls Survey PDF
No ratings yet
Hls Survey PDF
27 pages
Week 1 (Part 2) ECE-852 Pak Austria
No ratings yet
Week 1 (Part 2) ECE-852 Pak Austria
45 pages
Verilog HDL in Digital Design
No ratings yet
Verilog HDL in Digital Design
15 pages
VHDL Lab
No ratings yet
VHDL Lab
50 pages
High-Level Synthesis Tools Guide
No ratings yet
High-Level Synthesis Tools Guide
8 pages
Introduction to Hardware Description Language
No ratings yet
Introduction to Hardware Description Language
32 pages
VLSI System Design & Modeling Technique
No ratings yet
VLSI System Design & Modeling Technique
46 pages
附件1
No ratings yet
附件1
6 pages
FPGA Design for Engineering Students
No ratings yet
FPGA Design for Engineering Students
15 pages
HLSPilot: LLM-Driven HLS Framework
No ratings yet
HLSPilot: LLM-Driven HLS Framework
9 pages
Mini
No ratings yet
Mini
12 pages
FPGA Radar Controller Overview
No ratings yet
FPGA Radar Controller Overview
10 pages
University of Technology Michatronics Branch: Wasan Shakir Mahmood 4 - Stage Supervisor: Layla Hattim
No ratings yet
University of Technology Michatronics Branch: Wasan Shakir Mahmood 4 - Stage Supervisor: Layla Hattim
13 pages
Section 1HLS Overview Powerpoint
No ratings yet
Section 1HLS Overview Powerpoint
28 pages
Verilog HDL for Digital Design
No ratings yet
Verilog HDL for Digital Design
52 pages
Hardware Description 1
No ratings yet
Hardware Description 1
6 pages
DSDV Lab @vtudeveloper - in
No ratings yet
DSDV Lab @vtudeveloper - in
49 pages
VHDL Design Methods and FPGA Implementation
No ratings yet
VHDL Design Methods and FPGA Implementation
6 pages
Understanding FPGAs and HDL Design
No ratings yet
Understanding FPGAs and HDL Design
1 page
Coupling Loop Transformations and High-Level Synthesis: Mots-Clés
No ratings yet
Coupling Loop Transformations and High-Level Synthesis: Mots-Clés
10 pages
Rosetta - A Realistic High-Level Synthesis Benchmark Suite For Software Programmable FPGAs
No ratings yet
Rosetta - A Realistic High-Level Synthesis Benchmark Suite For Software Programmable FPGAs
10 pages
Hardware Description Language
No ratings yet
Hardware Description Language
29 pages
Overview of Hardware Description Languages
0% (1)
Overview of Hardware Description Languages
12 pages
Hardware Description Language
No ratings yet
Hardware Description Language
23 pages
Project Report On Implementation of Some Basic Hardware Designs at FPGA Using VERILOG
100% (1)
Project Report On Implementation of Some Basic Hardware Designs at FPGA Using VERILOG
27 pages
PYNQ Productivity With Python
100% (1)
PYNQ Productivity With Python
67 pages
19ECE349-RISC Processor Design Using HDL
No ratings yet
19ECE349-RISC Processor Design Using HDL
195 pages
Unit 1 - LP1
No ratings yet
Unit 1 - LP1
59 pages
RISC Processor Design for Engineers
No ratings yet
RISC Processor Design for Engineers
325 pages
VHDL Overview: Features and Examples
100% (1)
VHDL Overview: Features and Examples
4 pages
Understanding HDL and VHDL Basics
No ratings yet
Understanding HDL and VHDL Basics
36 pages
FPGAproject
No ratings yet
FPGAproject
5 pages
VHDL Programming and Digital Design Guide
No ratings yet
VHDL Programming and Digital Design Guide
39 pages
Lecture 2-Intro VHDL
No ratings yet
Lecture 2-Intro VHDL
43 pages
HDL Lab Manual 2018
No ratings yet
HDL Lab Manual 2018
49 pages
Elevator Control System
100% (1)
Elevator Control System
15 pages
Intro To VHDL
No ratings yet
Intro To VHDL
31 pages
FALLSEM2024-25 BECE102P LO VL2024250105416 2024-07-16 Reference-Material-I
No ratings yet
FALLSEM2024-25 BECE102P LO VL2024250105416 2024-07-16 Reference-Material-I
79 pages
CME 4456 Reconfigurable Computing: Şerife YILMAZ
No ratings yet
CME 4456 Reconfigurable Computing: Şerife YILMAZ
72 pages
Advanced Tools in Reconfigurable Computing
No ratings yet
Advanced Tools in Reconfigurable Computing
6 pages
Introduction To Verilog
No ratings yet
Introduction To Verilog
32 pages
EN3542 - Digital System Design: Hardware Description Languages - I
No ratings yet
EN3542 - Digital System Design: Hardware Description Languages - I
76 pages
HDL Lab Manual
No ratings yet
HDL Lab Manual
74 pages
2019 OpenCL Implementation of FPGA-Based Signal Generation and Measurement
No ratings yet
2019 OpenCL Implementation of FPGA-Based Signal Generation and Measurement
11 pages
Introduction To HDL Programming
No ratings yet
Introduction To HDL Programming
75 pages
Understanding Hardware Description Languages
No ratings yet
Understanding Hardware Description Languages
29 pages
Computer Coding For Kids A Unique Step by Step Visual Guide From Binary Code To Building Games 1st Edition Carol Vorderman Full Chapters Instanly
No ratings yet
Computer Coding For Kids A Unique Step by Step Visual Guide From Binary Code To Building Games 1st Edition Carol Vorderman Full Chapters Instanly
76 pages
C++ Programming Handbook For Beginners On GUI Development With QT 2024
No ratings yet
C++ Programming Handbook For Beginners On GUI Development With QT 2024
162 pages
Computer Science Cala 2021
100% (1)
Computer Science Cala 2021
9 pages
Ai Project Report
No ratings yet
Ai Project Report
14 pages
Creating Your ABC ID for College Admission
No ratings yet
Creating Your ABC ID for College Admission
8 pages
NSTH-Home Automation Raspberry Google Python
No ratings yet
NSTH-Home Automation Raspberry Google Python
425 pages
LTE Initial Default Bearer AppNote (UBX 20015573)
No ratings yet
LTE Initial Default Bearer AppNote (UBX 20015573)
11 pages
United Republic of Tanzania Business Registrations and Licensing Agency Application For Registration of Business Name
No ratings yet
United Republic of Tanzania Business Registrations and Licensing Agency Application For Registration of Business Name
2 pages
App Layers BreakDown
No ratings yet
App Layers BreakDown
4 pages
EDI Guidelines for Adient
No ratings yet
EDI Guidelines for Adient
31 pages
HKNC-500 Manual (1) - Compressed
No ratings yet
HKNC-500 Manual (1) - Compressed
93 pages
Robot Arm Hobby
No ratings yet
Robot Arm Hobby
69 pages
C++ Inheritance Lab Exercises
No ratings yet
C++ Inheritance Lab Exercises
2 pages
AD0 E117 Questions
No ratings yet
AD0 E117 Questions
5 pages
IoT Design Methodology
No ratings yet
IoT Design Methodology
37 pages
Experiment 6,7
No ratings yet
Experiment 6,7
14 pages
IPC-2518A Eng
No ratings yet
IPC-2518A Eng
18 pages
Selenium WebDriver Exceptions
No ratings yet
Selenium WebDriver Exceptions
4 pages
WIA2002 Software Modeling: Assoc. Prof. Dr. Siti Hafizah Ab Hamid B-3-12, FCSIT Sitihafizah@um - Edu.my
No ratings yet
WIA2002 Software Modeling: Assoc. Prof. Dr. Siti Hafizah Ab Hamid B-3-12, FCSIT Sitihafizah@um - Edu.my
31 pages
OOPS Concepts in Java Explained
No ratings yet
OOPS Concepts in Java Explained
55 pages
Guide to Creating List Views
No ratings yet
Guide to Creating List Views
42 pages
Fleet Management via Dijkstra's Algorithm
No ratings yet
Fleet Management via Dijkstra's Algorithm
4 pages
User Interface Standards
No ratings yet
User Interface Standards
7 pages
BBM 6th Sem Database Notes Updated 2024
No ratings yet
BBM 6th Sem Database Notes Updated 2024
145 pages
Best Practices For Configuring and Extending Fusion
No ratings yet
Best Practices For Configuring and Extending Fusion
15 pages
Computer Science - Chapter Wise Questions
No ratings yet
Computer Science - Chapter Wise Questions
17 pages
Urdu Arabic Books Collection
0% (1)
Urdu Arabic Books Collection
87 pages
Skillathon Questions Answers - Updated
0% (1)
Skillathon Questions Answers - Updated
4 pages
Getting Started With Scratch
No ratings yet
Getting Started With Scratch
48 pages
Batch 52 1
No ratings yet
Batch 52 1
100 pages

AnyHLS - High-Level Synthesis With Partial

Uploaded by

AnyHLS - High-Level Synthesis With Partial

Uploaded by

c 2020 IEEE. Personal use of this material is permitted.

AnyHLS: High-Level Synthesis with Partial

line buffers sliding window

efficient hardware implementations. There is an ongoing discussion whether C-based languages

aforementioned DSL compilers, AnyHLS allows programmers }

translates to VHLS-code.cpp VHLS-code.cpp AOCL-code.cl VHLS-code.cpp AOCL-code.cl

Vectorization factor (v)

AnyHLS exploits stream processing and performs implicit

You might also like