AnyHLS - High-Level Synthesis With Partial
AnyHLS - High-Level Synthesis With Partial
Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this
material for advertising or promotional purposes,creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works.
Abstract—FPGAs excel in low power and high throughput different performance or area objectives. In recent languages
arXiv:2002.05796v2 [cs.PL] 21 Jul 2020
computations, but they are challenging to program. Traditionally, such as Chisel [1], VeriScala [2], and MyHDL [3], programmers
developers rely on hardware description languages like Verilog or can create a functional description of their design but stick to
VHDL to specify the hardware behavior at the register-transfer
level. High-Level Synthesis (HLS) raises the level of abstraction, the RTL.
but still requires FPGA design knowledge. Programmers usually High-Level Synthesis (HLS) increases the abstraction level
write pragma-annotated C/C++ programs to define the hardware to an untimed high-level specification similar to imperative
architecture of an application. However, each hardware vendor programming languages and automatically solves low-level
extends its own C dialect using its own vendor-specific set design issues such as clock-level timing, register allocation,
of pragmas. This prevents portability across different vendors.
Furthermore, pragmas are not first-class citizens in the language. and structural pipelining [4]. However, an HLS code that is
This makes it hard to use them in a modular way or design proper optimized for the synthesis of high-performance circuits is
abstractions. fundamentally different from a software program delivering
In this paper, we present AnyHLS, an approach to synthesize high performance on a CPU. This is due to the significant gap
FPGA designs in a modular and abstract way. AnyHLS is between the programming paradigms. An HLS compiler has to
able to raise the abstraction level of existing HLS tools by
resorting to programming language features such as types and
optimize the memory hierarchy of a hardware implementation
higher-order functions as follows: It relies on partial evaluation and parallelize its data paths [5].
to specialize and to optimize the user application based on In order to achieve good Quality of Results (QoR), HLS
a library of abstractions. Then, vendor-specific HLS code is languages demand programmers also to specify the hardware
generated for Intel and Xilinx FPGAs. Portability is obtained architecture of an application instead of just its algorithm. For
by avoiding any vendor-specific pragmas at the source code. In
order to validate achievable gains in productivity, a library for
this reason, HLS languages offer hardware-specific pragmas.
the domain of image processing is introduced as a case study, This ad-hoc mix of software and hardware features makes
and its synthesis results are compared with several state-of-the- it difficult for programmers to optimize an application. In
art Domain-Specific Language (DSL) approaches for this domain. addition, most HLS tools rely on their own C dialect, which
prevents code portability. For example, Xilinx Vivado HLS [6]
uses C++ as base language while Intel SDK [7] (formerly
Altera) uses OpenCL C. These severe restrictions make it hard
I. I NTRODUCTION
to use existing HLS languages in a portable and modular way.
Field Programmable Gate Arrays (FPGAs) consist of a In this paper, we advocate describing FPGA designs using
network of reconfigurable digital logic cells that can be functional abstractions and partial evaluation to generate
configured to implement any combinatorial logic or sequential optimized HLS code. Consider Figure 1 for an example from
circuits. This allows the design of custom application-tailored image processing: With a functional language, we separate
hardware. In particular memory-intensive applications benefit the description of the sobel_x operator from its realization
from FPGA implementations by exploiting fast on-chip memory in hardware. The hardware realization make_local_op is
for high throughput. These features make FPGA implementa- a function that specifies the data path, the parallelization,
tions orders of magnitude faster/more energy-efficient than CPU and memory architecture. Thus, the algorithm and hardware
implementations in these areas. However, FPGA programming architecture descriptions are described by a set of higher-
poses challenges to programmers unacquainted with hardware order functions. A partial evaluator, ultimately, combines
design. these functions to generate an HLS code that delivers high-
FPGAs are traditionally programmed at Register-Transfer performance circuit designs when compiled with HLS tools.
Level (RTL). This requires to model digital signals, their timing, Since the initial descriptions are high-level, compact, and
flow between registers, as well as the operations performed functional, they are reusable and distributable as a library.
on them. Hardware Description Languages (HDLs) such as We leverage the AnyDSL compiler framework [8] to perform
Verilog or VHDL allow for the explicit description of arbitrary partial evaluation and extend it to generate input code for
circuits but require significant coding effort and verification HLS tools targeting Intel and Xilinx FPGA devices. We claim
time. This makes design iterations time-consuming and error- that this approach leads to a modular and portable code other
prone, even for experts: The code needs to be rewritten for than existing HLS approaches, and is able to produce highly
Mem2D Mem2D Mem2D
(1, h, v) (w + v − 1, h, 1)(w + v − 1, h, 1)
Mem1D Mem1D
(W × H, v) line buffer op1 (W × H, v)
row col ...
sel sel
line buffer opv
Figure 1. AnyHLS example: The algorithm description sobel_x is decoupled from its realization in hardware make_local_op. The hardware realization
is a function that specifies important transformations for the exploitation of parallelism and memory architecture. The function generate(vhls) selects the
backend for code generation, which is Vivado HLS in this case. Ultimately, an optimized input code for HLS is generated by partially evaluating the algorithm
and realization functions.
application domain by themselves (see Section III). AnyHLS is Thus, the calls
thereby built on top of AnyDSL [8] (see Section II-C). AnyDSL let z = pow(x, 5); let z = pow(3, 5);
offers partial evaluation to enable shallow embedding [34] will result in the following equivalent sequences of instructions
without the need for modifying a compiler. This means after specialization:
that there is no need to change the compiler when adding let y = x * x; let z = 243;
support for a new application domain, since programmers can let z = x * y * y;
design custom control structures. Partial evaluation specializes As syntactic sugar, @ is available as shorthand for @(true).
algorithmic variants of a program at compile-time. Compared This causes the partial evaluator to always specialize the
to metaprogramming, partial evaluation operates in a single annotated function.
language and preserves the well-typedness of programs [8]. Fur- FPGA implementations must be statically defined for QoR:
thermore, different combinations of static/dynamic parameters types, loops, functions, and interfaces must be resolved at
can be instantiated from the same code. Previously, we have compile-time [16], [18], [19]. Partial evaluation has many
shown how to abstract image border handling implementations advantages compared to metaprogramming as discussed in
for Intel FPGAs using AnyDSL [35]. In this paper, we present Section II-B. Hence, Impala’s partial evaluation is particularly
AnyHLS and an image processing library to synthesize FPGA useful to optimize HLS descriptions.
designs in a modular and abstract way for both Intel and Xilinx
FPGAs. 2 https://anydsl.github.io
2) Generators: Because iteration on various domains is a halide-app.cpp hipacc-app.cpp anyhsl-app.impala
common pattern, Impala provides syntactic sugar for invoking
Halide compiler Hipacc compiler
certain higher-order functions. The loop Vivado Vivado AOCL
AnyDSL compiler
+
Image
Processing
backend backend backend (partial evaluator) Lib.impala
for var1, ..., varn in iter(arg1, ..., argn) { /* ... */ }
an anonymous function
|var1, ..., varn| { /* ... */ } VHLS VHLS AOCL VHLS AOCL
that is passed to iter as the last argument. We call functions XILINX XILINX INTEL XILINX INTEL
FPGA FPGA FPGA FPGA FPGA
that are invokable like this generators. Domain-specific libraries
implemented in Impala make busy use of these features as
Figure 2. FPGA code generation flows for Halide, Hipacc, and AnyHLS (from
they allow programmers to write custom generators that take left to right). VHLS and AOCL are used as acronyms for Vivado HLS and
advantage of both domain knowledge and certain hardware Intel FPGA SDK for OpenCL, respectively. Halide and Hipacc rely on domain-
features, as we will see in the next section. specific compilers for image processing that instantiate template libraries.
AnyHLS allows defining all abstractions for a domain in a language called
Generators are particularly powerful in combination with Impala and relies on partial evaluation for code specialization. This ensures
partial evaluation. Consider the following functions: maintainability and extensibility of the provided domain-specific library—for
image processing in this example.
type Body = fn(int) -> ();
fn @(?a & ?b) unroll(a: int, b: int, body: Body) -> () {
if a < b { body(a); unroll(a+1, b, body) }
} with vhls() { body() } with opencl() { body() }
fn @ range(a: int, b: int, body: Body) -> () {
unroll($a, b, body) With opencl we use a grid and block size of (1, 1, 1)
}
to generate a single work-item kernel, as the official AOCL
Both generators iterate from a (inclusive) to b (exclusive) documentation recommends [7]. We extended AnyDSL’s
while invoking body each time. The filter unroll tells the OpenCL runtime by the extensions of Intel OpenCL SDK.
partial evaluator to completely unroll the recursion if both loop To provide an abstraction over both HLS backends, we create
bounds are statically known at a particular call site. a wrapper generate that expects a code generation function:
type Backend = fn(fn() -> ()) -> ();
III. T HE A NY HLS L IBRARY fn @ generate(be: Backend, body: fn() -> ()) -> () {
with be() { body() }
Efficient and resource-friendly FPGA designs require }
application-specific optimizations. These optimizations and Switching backends is now just a matter of passing an
transformations are well known in the community. For example, appropriate function to generate:
de Fine Licht et al. [20] discuss the key transformations of HLS
let backend = vhls; // or opencl
codes such as loop unrolling and pipelining. They describe with generate(backend) { body() }
the whole hardware design from the low-level memory layout
to the operator implementations with support for low-level
loop transformations throughout the design. In our setting, B. Building Abstractions for FPGA Designs
the programmer defines and provides these abstractions using
In the following, we present abstractions for the key
AnyDSL for a given domain in the form of a library. We
transformations and design patterns that are common in FPGA
rely on partial evaluation to combine those abstractions and to
design. These include (a) important loop transformations, (b)
remove overhead associated with them. Ultimately, the AnyDSL
control flow and data flow descriptions such as reductions,
compiler synthesizes optimized HLS code (C++ or OpenCL C)
Finite State Machines (FSMs) and (d) the explicit utilization of
from a given functional description of an algorithm as shown
different memory types. Approaches like Spatial [15] expose
in Figure 2. The generated code goes to the selected HLS tool.
these patterns within the language—new patterns require
This is in contrast to other domain-specific approaches like
dedicated support from the compiler. Hence, these languages
Halide-HLS [25] or Hipacc [27], which rely on domain-specific
and compilers are restricted to a specialized application
compilers to instantiate predefined templates or macros. Hipacc
domain they have been designed for. In AnyHLS, Impala’s
makes use of two distinct libraries to synthesize algorithmic
functional language and partial evaluation allow us to design
abstractions to Vivado-HLS and Intel AOCL, while AnyHLS
the abstractions needed for FPGA synthesis in the form of
uses the same image processing library that is described in
a library. New patterns can be added to the library without
Impala.
dedicated support from the compiler. This makes AnyHLS
easier to extend compared to the approaches mentioned afore.
A. HLS Code Generation 1) Loop Transformations: C++ compilers usually provide
For HLS code generation, we implemented an intrinsic certain preprocessor directives that perform particular code
named vhls in AnyHLS to emit Vivado HLS and an intrinsic transformations. A common feature is to unroll loops (see
named opencl to emit AOCL: left-hand side):
Instead of a pragma (on the left), AnyHLS uses the intrinsic
body body body
generator pipeline (on the right). Unlike the above loop
abstractions (e.g., unroll), Impala emits a tool-specific pragma
body body body
for the pipeline abstraction. This provides portability across
body
no unrolling
body body
different HLS tools. Furthermore, it allows the programmer
unroll inner loop
to invoke and pass around pipeline—just like any other
unroll outer loop unroll inner and outer loop
generator.
2) Reductions: Reductions are useful in many contexts. The
Figure 3. Parallel processing following function takes an array of values, a range within,
and an operator:
for (int i=0; i<N/W; ++i) { for i in range(0, N/W) { type T = int;
for (int w=0; w<W; ++w) { for w in unroll(0, W) { fn @(?beg & ?end) reduce(beg: int, end: int, input: &[T],
#pragma unroll op: fn(T, T) -> T) -> T {
body(i*W + w); body(i*W + w); let n = end - beg;
} } if n == 1 {
} } input(beg)
} else {
Such pragmas are built into the compiler. The Impala version let m = (end + beg) / 2;
(shown at right) uses generators that are entirely implemented let a = reduce(beg, m, input, op);
let b = reduce(m, end, input, op);
as a library. Partial evaluation optimizes Impala’s range op(a, b)
and unroll abstractions as well as the input body function }
}
according to their static inputs, i.e., N, W. The residual program
consists of the consecutive body function according to the value In the above filter, the recursion will be completely unfolded
of the W as shown in Figure 3. This generates a concise and if the range is statically known. Thus,
clean code for the target HLS compiler, which is drastically reduce(0, 4, [a, b, c, d], |x, y| x + y)
different from using a pragma.
Generators, unlike C++ pragmas, are first-class citizens of yields: (a + b) + (c + d).
the Impala language. This allows programmers to implement 3) Finite State Machines: AnyHLS models computations
sophisticated loop transformations. For example, the following that depend not only on the inputs but also on an internal
function tile returns a new generator. It instantiates a tiled state with an FSM. To define an FSM, programmers need to
loop nest of the specified tile size with the Loops inner specify states and a transition function that determines when
and outer: to change the current state based on the machine’s input. This
type Loop = fn(int, int, fn(int) -> ()) -> ();
is especially beneficial for modeling control flow. To describe
fn @ tile(size: int, inner: Loop, outer: Loop) -> Loop { an FSM in Impala, we start by introducing types to represent
@|beg, end, body| outer(0, (end-beg)/size,
|i| inner(i*size + beg, (i+1)*size + end, |j| body))
the states and the machine itself:
} type State = int;
struct FSM {
let schedule = tile(W, unroll, range); add: fn(State, fn() -> (), fn() -> State) -> (),
for i in schedule(0, N) { run: fn(State) -> ()
body(i) }
}
An object of type FSM provides two operations: adding one
Passing W for the tiling size, unroll for the inner loop, and
state with add or running the computation. The add method
range for the outer loop yields a generator that is identical
takes the name of the state, an action to be performed for this
to the loop nest at the beginning of this paragraph. With this
state, and a transition function associated with this state. Once
design, we can reuse or explore iteration techniques without
all states are added, the programmer runs the machine by
touching the actual body of a for loop. For example, consider
passing the initial state as an input parameter. The following
the processing options for a two-dimensional loop nest as shown
example adds 1 to every element of an array:
in Figure 3: When just passing range as inner and outer
let buf = /*...*/;
loop, the partial evaluator will keep the loop nest and, hence, let mut (idx, pixel) = (0, 0);
not unroll body and instantiate it only once. Unrolling the inner let fsm = make_fsm();
fsm.add(Read, || pixel = buf(idx),
loop replicates body and increases the bandwidth requirements || if idx>=len { Exit } else { Compute });
accordingly. Unrolling the outer loop also replicates body, but fsm.add(Compute, || pixel += 1, || Write);
fsm.add(Write, || buf(idx++) = pixel, || Read );
in a way that benefits data reuse from the temporal locality of fsm.run(Read);
an iterative algorithm. Unrolling both loops replicate body for
increased bandwidth and data reuse for the temporal locality. Similar the other abstractions introduced in this section, the
C/C++-based HLS solutions often use a pragma to mark a constructor for an FSM is not a built-in function of the compiler
loop amenable for pipelining. This means parallel execution but a regular Impala function. In some cases, we want to
of the loop iterations in hardware. For example, the following execute the FSM in a pipelined way. For this scenario, we add
code on the left uses an initiation interval (II) of 3: a second method run_pipelined. As all the methods, e.g.,
for (int i=0; i<N; ++i) { let II = 3; make_fsm, add, run, are annotated for partial evaluation
#pragma HLS pipeline II=3 for i in pipeline(II, 0, N) {
body(i); body(i) (by @), input functions to these methods will be optimized
} } according to their static inputs. Ultimately, AnyHLS will emit
the states of an FSM as part of a loop according to the selected
run method.
global memory on-chip memory register stream
4) Memory Types and Memory Abstractions: FPGAs have
different memory types of varying sizes and access properties.
Impala supports four memory types specific to hardware design Figure 4. Memory types provided for FPGA design
(see Figure 4): global memory, on-chip memory, registers, and
streams. Global memory (typically DRAM) is allocated on the OnChipArray
Regs2D StreamArray
host using our runtime and accessed through regular pointers. Regs1D
On-chip memory (e.g., BRAM or M10K/M20K) for the FPGA
is allocated using the reserve_onchip compiler intrinsic.
1D register array
Memory accesses using the pointer returned by this intrinsic 2D register array stream array
on-chip array
will map to on-chip memory. Standard variables are mapped
to registers, and a specific stream type is available to allow
for the communication between FPGA kernels. Memory-wise, Figure 5. Memory abstractions
a stream is mapped to registers or on-chip memory by the
HLS tools. These FPGA-specific memory types in Impala will
the smaller array. The generator (make_regs1d) returns an
be mapped to their corresponding tool-specific declarations in
Impala variable that can be read and written by index values
the residual program (on-chip memory will be defined as local
(regs in the following code), similar to C arrays.
memory for AOCL whereas it will be defined as an array in
let regs = make_regs1d(size);
Vivado HLS).
a) Memory partitioning: an array partitioning pragma However, it defines size number of registers in the residual
must be defined as follows to implement a C array with program instead of declaring an array and partitioning it by
hardware registers using Vivado HLS [6]: tool-specific pragmas as in Listing 1. The generated code
typedef int T; does not contain any compiler directives; hence it can be
T Regs1D[size];
#pragma HLS variable=Regs1D array_partition dim=0
used for different HLS tools (e.g., Vivado HLS, AOCL). Since
we annotated make_regs1d, read, and write for partial
Listing 1. A typical way of partitioning an array by using pragmas in existing
HLS tools. evaluation, any call to these functions will be inlined recursively.
This means that the search to find the register to read to or
Other HLS tools offer similar pragmas for the same task. write from will be performed at compile time. These registers
Instead, AnyHLS provides a more concise description of a will be optimized by the AnyDSL compiler, just like any other
register array without using any tool-specific pragma by the variables: unnecessary assignments will be avoided, and a clean
recursive declaration of registers as follows: HLS code will be generated.
type T = int; Correspondingly, AnyHLS provides generators (similar to
struct Regs1D { Listing 2) for one and two-dimensional arrays of on-chip
read: fn(int) -> T,
write: fn(int, T) -> (), memory (e.g., line buffers in Section IV), global memory, and
size: int streams (as illustrated in Figure 5) instead of using memory
}
fn @ make_regs1d(size: int) -> Regs1D { partitioning pragmas encouraged in existing HLS tools (as in
if size == 0 { Listing 1).
Regs1D {
read: @|_| 0,
write: @|_, _| (),
size: size IV. A L IBRARY FOR I MAGE P ROCESSING ON FPGA
}
} else {
AnyHLS allows for defining domain-specific abstractions
let mut reg: T; and optimizations that are used and applied prior to generating
let others = make_regs1d(size - 1);
Regs1D {
customized input to existing HLS tools. In this section, we
read: @|i| if i+1 == size { reg } introduce a library that is developed to support HLS for the
else { others.read(i) },
write: @|i, v| if i+1 == size { reg = v }
domain of image processing applications. It is based on the
else { others.write(i, v) }, fundamental abstractions introduced in Section III-B. Our low-
size: size
}
level implementation is similar to existing domain-specific
} languages targeting FPGAs [24], [27]. For this reason, we focus
}
on the interface of our abstractions as seen by the programmer.
Listing 2. Recursive description of a register array using partial evalution We design applications by decoupling their algorithmic
instead of declaring an array and partitioning it by HLS pragmas.
description from their schedule and memory operations. For
instance, typical image operators, such as the following
When the size is not zero, each recursive call to this
Sobel filter, just resort to the make_local_op generator.
function allocates a register variable named reg, and creates
Similarly, we implement a point operator for RGB-to-gray
a smaller register array with one element less named others.
color conversion as follows (Listing 3):
The read and write functions test if the index i is equal
fn sobel_edge(output: &mut [T], input: &[T]) -> () {
to the index of the current register. In the case of a match, let img = make_raw_mem2d(width, height, input);
the current register is used. Otherwise, the search continues in let dx = make_raw_mem2d(width, height, output);
let sobel_extents = extents(1, 1); // for 3x3 filter access an element of the vector. This increases data reuse and
let operator = make_local_op(4, // vector factor
sobel_operator_x, sobel_extents, mirror, mirror);
DRAM-to-on-chip memory bandwidth [42].
with generate(hls) { operator(img, dx); } 2) Stream Processing: Inter-kernel dependencies of an
}
algorithm should be accessed on-the-fly in combination with
fn rgb2gray(output: &mut [T], input: &[T]) -> () { fine-granular communication in order to pipeline the full
let img = make_raw_img(width, height, input);
let gray = make_raw_img(width, height, output); implementation with a fixed throughput. That is, as soon as a
let operator = make_point_op(@ |pix| { block produces one data, the next block consumes it. In the
let r = pix & 0xFF;
let g = (pix >> 8) & 0xFF; best case, this requires only a single register of a small buffer
let b = (pix >> 16) & 0xFF; instead of reading/writing to temporary images:
(r + g + b) / 3
});
Mem1D Mem1D Mem1D Mem1D
with generate(hls) { operator(img, gray); }
} Kernel1 Kernel2 Kernel3
Listing 3. Sobel filter and RGB-to-gray color conversion as example
applications described by using our library.
We define a stream between two kernels as follows:
The image data structure is opaque. The target platform fn make_mem_from_stream(size: int, data: stream) -> Mem1D;
mapping determines its layout. AnyHLS provides common
border handling functions as well as point and global operators 3) Line Buffers: Storing an entire image to on-chip memory
such as reductions (see Section III-B2). These operators are before execution is not feasible since on-chip memory blocks
composable to allow for more sophisticated ones. are limited in FPGAs. On the other hand, feeding the data
on demand from main memory is extremely slow. Still, it is
possible to leverage fast on-chip memory by using it as FIFO
A. Vectorization
buffers containing only the necessary lines of the input images
Image processing applications consist of loops that possess a (W pixels per line).
very high degree of spatial parallelism. This should be exploited Mem2D (1, h, v)
to reach the bandwidth speed of memory technologies. A
line buffer
resource-efficient approach, so-called vectorization or loop
coarsening, is to aggregate the input pixels to vectors and
process multiple input data at the same time to calculate line buffer
Mem1D (W, v)
multiple output pixels in parallel [39]–[41]. This replicates only
the arithmetic operations applied to data (so-called datapath) line buffers (W, h, v)
instead of the whole accelerator, similar to Single Instruction
Multiple Data (SIMD) architectures. Vectorization requires a This enables parallel reads at the output for every pixel read
control structure specialized to a considered hardware design. at the input. We model a line buffer as follows:
We support the automatic vectorization of an application by type LineBuf1D = fn(Mem1D) -> Mem1D;
a given factor v when using our image processing library. In fn make_linebuf1d(width: int) -> LineBuf1D;
// similar for LineBuf2D
particular, our library use the vectorization techniques proposed
in [40]. For example, the make_local_op function has Akin to Regs1D (see Section III-B4), a recursive call builds
an additional parameter to specify the desired vectorization an array of line buffers (each line buffer will be declared by a
and will propagate this information to the functions it uses separate memory component in the residual program similar
internally: make_local_op(op, v). For brevity, we omit to on-chip array in Figure 5).
the parameter for the vectorization factor for the remaining 4) Sliding Window: Registers are the most amenable re-
abstractions in this section. sources to hold data for highly parallelized access. A sliding
window of size w × h updates the constituting shift registers by
B. Memory Abstractions for Image Processing a new column of h pixels and enables parallel access to w · h
1) Memory Accessor: In order to optimize memory access pixels.
Mem2D (w, h, 1)
and encapsulate the contained memory type (on-chip memory,
etc.) into a data structure, we decouple the data transfer from Mem2D
(1, h, v)
the data use via the following memory abstractions:
struct Mem1D { struct Mem2D {
read: fn(int) -> T, read: fn(int, int) -> T,
write: fn(int, T)->(), write: fn(int, int, T)->(),
update: fn(int) -> (), update: fn(int, int) -> (), sliding window
size: int width: int, height: int
} } This provides high data reuse for temporal locality and avoids
Similar to hardware design practices, these memory abstractions waste of on-chip memory blocks that might be utilized for a sim-
require the memory address to be updated before the ilar data bandwidth. Our implementation uses make_regs2d
read/write operations. The update function transfers data for an explicit declaration of registers and supports pixel-based
from/to the encapsulated memory to/from staging registers indexing at the output. This will instantiate w · h registers in
using vector data types. Then, the read/write functions the residual program, as explained in Section III-B4.
type Swin2D = fn(Mem2D) -> Mem2D; type LocalOp = fn(Mem1D) -> Mem1D;
fn @ make_sliding_window(w: int, h: int) -> Swin2D { fn @ make_local_op(v: int, op: Op, ext: Extents,
let win = make_regs2d(w, h); bh_lower: FnBorder,
// ... bh_upper: FnBorder) -> LocalOp {
} @ |img, out| {
let mut (col, row, idx) = (0, 0, 0);
let wait = /* initial latency */
C. Loop Abstractions for Image Processing let fsm = make_fsm();
fsm.add(Read, || img.update(idx), || Compute);
1) Point Operators: Algorithms such as image scaling and fsm.add(Compute, || {
line_buffer.update(col);
color transformation calculate an output pixel for every input sliding_window.update(row);
pixel. The point operator abstraction (see Listing 4) in AnyHLS col_sel.update(col);
for i in unroll(0, v) {
yields a vectorized pipeline over the input and output image. out.write(i, op(col_sel.read(i)));
This abstraction is parametric in its vector factor v and the }
}, || if idx > wait { Write } else { Index });
desired operator function op. fsm.add(Write, || out.update(idx-wait-1), || Index);
fsm.add(Index, || {
type PointOp = fn(Mem1D) -> Mem1D;
idx++; col++;
fn @ make_point_op(v: int, op: Op) -> PointOp {
if col == img_width { col=0; row++; }
@ |img, out| {
}, || if idx < img.size { Read } else { Exit });
for idx in pipeline(1, 0, img.size) {
fsm.run_pipelined(Read, 1, 0, img.size);
img.update(idx);
}
for i in unroll(0, v) {
}
out.write(i, op(img.read(i)));
}
out.update(idx); Listing 5. Implementation of the local operator abstraction.
}
}
}
Compared to the local operator in Figure 1, we also support
Listing 4. Implementation of the point operator abstraction. boundary handling. We specify the extent of the local operator
(filter size / 2) as well as functions specifying the boundary
handling for the lower and upper bounds. Then, row and column
The total latency is
selection functions apply border handling correspondingly in x-
L = Larith + dW/ve · H cycles (2) and y−directions by using one-dimensional multiplexer arrays
similar to Özkan et al. [40].
where W and H are the width and height of the input image,
and Larith is the latency of the data path. V. E VALUATION AND R ESULTS
2) Local Operators: Algorithms such as Gaussian blur and In the following, we compare the Post Place and Route
Sobel edge detection calculate an output pixel by considering (PPnR) results using AnyHLS and other state-of-the-art domain-
the corresponding input pixel and a certain neighborhood of it specific approaches including Halide-HLS [25] and Hipacc [27].
in a local window. Thus, a local operator with a w × h window The generated HLS codes are compiled using Intel FPGA SDK
requires w · h pixel reads for every output. The same (w − 1) · h for OpenCL 18.1 and Xilinx Vivado HLS 2017.2 targeting a
pixels are used to calculate results at the image coordinates Cyclone V GT 5CGTD9 FPGA and a Zynq XC7Z020 FPGA,
(x, y) and (x + 1, y). This spatial locality is transformed into repectively.
temporal locality when input images are read in raster order for The generated hardware designs are evaluated for their
burst mode, and subsequent pixels are sequentially processed throughput, latency, and resource utilization. FPGAs possess
with a streaming pipeline implementation. The local operator two types of resources: (i) computational: LUTs and DSP
implementation in AnyHLS (shown in Listing 5) consists of blocks; (ii) memory: Flipflops (FFs) and on-chip memory
line buffers and a sliding window to hold dependency pixels (BRAM/M20K). A SLICE/ALM is comprised of look-up tables
in on-chip memory and calculates a new result for every new (LUTs) and flip flops, thus indicate the resource usage when
pixel read. considered with the DSP block and on-chip memory blocks.
Mem2D Mem2D Mem2D
The implementation results presented for Vivado HLS feature
Mem1D
(1, h, v) (w + v − 1, h, 1)(w + v − 1, h, 1)
Mem1D
only the kernel logic, while those by Intel OpenCL include
(W × H, v) line buffer op1 (W × H, v) PCIe interfaces. The execution time of an FPGA circuit (Vivado
row col
line buffer
sel sel
...
HLS implementation) equals to Tclk · latency, where Tclk is
opv
the clock period of the maximum achievable clock frequency
line buffers sliding window
(lower is better). We measured the timing results for Intel
local operator
OpenCL by executing the applications on a Cyclone V GT
This provides a throughput of v pixels per clock cycle at the 5CGTD9 FPGA. This is the case for all analyzed applications.
cost of an initial latency (v is the vectorization factor) We have no intention nor license rights [43, §4] [44, §2] to
benchmark and compare the considered FPGA technologies or
Linitial = Larith + (bh/2c · dW/ve + bdw/ve/2c) (3) HLS tools.
that is spent for caching neighboring pixels of the first
calculation. The final latency is thus: A. Applications
In our experimental evaluation, we consider the following
L = Linitial + (dW/ve · H) (4) applications:
Harris
2) Vectorization: Many FPGA implementations benefit from
FChain parallel processing in order to increase memory bandwidth.
AnyHLS implicitly parallelizes a given image pipeline by a
Harris naïve vectorization factor v. As an example, Figure 7 shows the
FChain streaming pipeline PPnR results, along with the achieved memory throughput for
0 16 35 107 different vectorization factors for the mean filter on a Cyclone V.
Execution time [ms] The memory-bound of the Cyclone V is reported by Intel’s
Figure 6. Execution time for naïve and streaming pipeline implementations Memory Bound [MB/s]
Resource Usage in %
•
On-Chip Mem Blocks Logic Resources
as a pre-processing algorithm 30
• bilateral filter (Bilateral), a 5 × 5 floating-point kernel
as an edge-preserving and noise-reducing function based 25
on exponential functions
• mean filter (MF), a 5×5 filter that determines the average 20
within a local window via 8-bit arithmetic
15
• SobelLuma, an edge detection algorithm provided as a
1 2 4 8 16 32
design example by Intel. The algorithm consists of RGB Vectorization factor (v)
to Luma color conversion, Sobel filters, and thresholding
Figure 7. PPnR results of AnyHLS’s mean filter implementation on an Intel
B. Library Optimizations Cyclone V. The memory bound of the device for our setup is 1344.80 MB/s.
latency given in Equation (4), which is L = Larith + AnyHLS 8 1646 16 1050641 801.8
Halide-HLS 16 2096 50 1060897 458.7
1.042.442 clock cycles for Gauss when v = 1. Larith = Hipacc 8 1709 16 1052693 820.1
14 for AnyHLS’ Gauss implementation as shown in
Table II.
ii) Halide-HLS pads input images according to the selected has control over code generation. Extending AnyHLS’ image
border handling mode (even when no border handling is processing library only requires adding new functions in Impala
defined). This increases the input image size from (W , (see Figure 2). Our intention to compare AnyHLS with these
H) to (W + w − 1, H + h − 1), thus the latency. DSLs is to show that we can generate equally good designs
iii) Hipacc does not pad input images, but run (H + bh/2c · without creating an entire compiler backend.
(W + bw/2c)) loop iterations for a (W × H) image 2) Experiments using Intel FPGA SDK for OpenCL (AOCL):
and (w × h) window. This is similar to the convolution Table IV presents the implementation results for an edge
example in the Vivado Design Suite User Guide [6], but detection algorithm provided as a design example by Intel. The
not optimal. algorithms consist of RGB to Luma color conversion, Sobel
The execution time of an implementation equals to Tclk · filters, and thresholding. Intel’s implementations consist of a
latency, where Tclk is the clock period of the maximum single-work item kernel that utilizes shift registers according
achievable clock frequency (lower is better). Overall, AnyHLS to the FPGA design paradigm. These types of techniques are
processes a given image faster than the other DSL implemen- recommended by Intel’s optimization guide [7] despite that
tations. the same OpenCL code performs drastically bad on other
Halide-HLS uses more on-chip memory for line buffers (see computing platforms.
Section IV-C2) compared to Hipacc and AnyHLS because of its
image padding for border handling. Let us consider the number Table IV
PP N R RESULTS OF AN EDGE DETECTION APPLICATION FOR THE I NTEL
of BRAMs utilized for the Gaussian blur: The line buffers need C YCLONE V. I MAGE SIZES ARE 1024 × 1024. N ONE OF THE
to hold 4 image lines for the 5 × 5 kernel. The image width IMPLEMENTATIONS USE DSP S .
is 1024 and the pixel size is 32 bits. Therefore, AnyHLS and
v Framework #M10K #ALM #DSP Throughput [MB/s]
Hipacc use eight 18K BRAMs as shown in Table II. However,
Halide-HLS stores 1028 integer pixels, which require 16 18K Intel’s Imp. 290 23830 0 419.5
1 AnyHLS 291 23797 0 422.5
BRAMs to buffer four image lines. This doubles the number Hipacc 318 25258 0 449.1
of BRAMs usage (see Table III). Intel’s Imp. - - 0 -
AnyHLS use the vectorization architecture proposed in [40]. 16 AnyHLS 337 29126 0 1278.3
Hipacc 362 35079 0 1327.7
This improves the use of the registers compared to Hipacc and
Intel’s Imp. - - 0 -
Halide. 32 AnyHLS 401 38069 0 1303.8
The performance metrics and resource usage reported by Hipacc 421 44059 0 1320.0
Vivado HLS correlate with our Impala descriptions, hence we
claim that the HLS code generated from AnyHLS’ image We described Intel’s handwritten SobelLuma example using
processing library does not entail severe side effects for Hipacc and AnyHLS. Both Hipacc and AnyHLS provide a
the synthesis of Vivado HLS. Hipacc and Halide-HLS have higher throughput even without vectorization. In order to reach
dedicated compiler backends for HLS code generation. These memory-bound, we would have to rewrite Intel’s hand-tuned
can be improved to achieve similar performance to AnyHLS. design example to exploit further parallelism. AnyHLS uses
However, this is not a trivial task and prone to errors. The slightly less resource, whereas Hipacc provides slightly higher
advantage of AnyDSL’s partial evaluation is that the user throughput for all the vectorization factors. Similar to Figure 7,
REFERENCES
16 AnyHLS Table V
103 PP N R FOR THE I NTEL C YCLONE V. M ISSING NUMBERS (-) INDICATE THAT
Throughput in [MPixel/s]
NDRange
8 THE GENERATED IMPLEMENTATIONS DO NOT FIT THE BOARD .
4
App v Framework #M10K #ALM #DSP Throughput [MB/s]
2
CU4/SIMD16 16 AnyHLS 401 37509 0 1330.1
102 1 Gauss
16 Hipacc 402 35090 0 1301.2
16 AnyHLS 370 31446 0 1328.8
Jacobi
CU1/SIMD1 16 Hipacc 372 30296 0 1282.9
CU16/SIMD1
1 AnyHLS 399 79270 153 326.6
Bilat.
1 Hipacc 422 79892 159 434.7
20 30 40 50 60 70 80
16 AnyHLS 400 39266 0 1255.68
Hardware resources (logic utilization [%]) MF 16 Hipacc - - - -
8 Hipacc 351 31796 0 1275.9
8 AnyHLS 418 44807 0 1230.6
Figure 8. Design space for a 5 × 5 mean filter using an NDRange kernel FChain
8 Hipacc 645 64225 0 427.4
(using the num_compute_units / num_simd_work_items attributes)
8 AnyHLS 442 50537 96 1158.5
and AnyHLS (using the vectorization factor v) for an Intel Cyclone V. Harris
8 Hipacc 668 74246 96 187.14
Hipacc AnyHLS
Throughput in [MPixel/s]
10
VI. C ONCLUSIONS
2
In this paper, we advocate the use of modern compiler
29
technologies for high-level synthesis. We combine functional
abstractions with the power of partial evaluation to decouple a
high-level algorithm description from its hardware design that
28
implements the algorithm. This process is entirely driven by
code refinement, generating input code to HLS tools, such as
Harris Gauss Bilateral Jacobi FChain MF Vivado HLS and AOCL, from the same code base. To specify
important abstractions for hardware design, we have introduced
Figure 9. Throughput measurements for an Intel Cyclone V for the a set of basic primitives. Library developers can rely on these
implementations generated from AnyHLS and Hipacc. Resource utilization primitives to create domain-specific libraries. As an example,
for the same implementations are shown in Table V.
we have implemented an image processing library for synthesis
to both Intel and Xilinx FPGAs. Finally, we have shown that
our results are on par or even better in performance compared
both frameworks yield throughputs very close to the memory
to state-of-the-art approaches.
bound of the Intel Cyclone V.
The OpenCL NDRange kernel paradigm conveys multiple
ACKNOWLEDGMENTS
concurrent threads for data-level parallelism. OpenCL-based
HLS tools exploit this paradigm to synthesize hardware. AOCL This work is supported by the Federal Ministry of Education
provides attributes for NDRange kernels to transform its iter- and Research (BMBF) as part of the Metacca, MetaDL,
ation space. The num_compute_units attribute replicates ProThOS, and REACT projects as well as the Intel Visual
the kernel logic, whereas num_simd_work_items vector- Computing Institute (IVCI) at Saarland University. It was
3
izes the kernel implementation . Combinations of those provide also partially funded by the Deutsche Forschungsgemein-
a vast design space for the same NDRange kernel. However, as schaft (DFG, German Research Foundation) – project number
Figure 8 demonstrates, AnyHLS achieves implementations that 146371743 – TRR 89 “Invasive Computing”. Many thanks to
are orders of magnitude faster than using attributes in AOCL. our colleague Puya Amiri for his work on the pipeline support.
Finally, Table V and Figure 9 present a comparison between
AnyHLS and the AOCL backend of Hipacc [45]. As shown R EFERENCES
in Figure 2, Hipacc has an individual backend and template [1] J. Bachrach et al., “Chisel: Constructing hardware in a Scala
library written with preprocessor directives to generate high- embedded language”, in Proc. of the 49th Annual Design
Automation Conf. (DAC), IEEE, Jun. 3–7, 2012.
performance OpenCL code for FPGAs. In contrast, the ap-
[2] Y. Liu et al., “A scala based framework for developing accel-
plication and library code in AnyHLS stays the same. The eration systems with FPGAs”, Journal of Systems Architecture,
generated AOCL code consists of a loop that iterates over vol. 98, 2019.
the input image. Compared to Hipacc, AnyHLS achieves [3] J. Decaluwe, “MyHDL: A Python-based hardware description
similar performance but outperforms Hipacc for multi-kernel language”, Linux Journal, no. 127, 2004.
[4] J. Cong et al., “High-level synthesis for FPGAs: From
applications such as the Harris corner detector. This shows that
prototyping to deployment”, IEEE Trans. on Computer-Aided
AnyHLS optimizes the inter-kernel dependencies better than Design of Integrated Circuits and Systems (TCAD), vol. 30, no.
Hipacc (see Section IV-B2). 4, 2011.
[5] J. Cong et al., “Automated accelerator generation and opti-
3 These mization with composable, parallel and pipeline architecture”,
parallelization attributes are suggested in [7] for NDRange kernels,
in Proc. of the 55th Annual Design Automation Conf. (DAC),
not for the single-work item kernels using shift registers such as the edge
detection application shown in Table IV. ACM, Jun. 24–29, 2018.
[6] Xilinx, Vivado Design Suite user guide high-level synthesis [27] O. Reiche et al., “Generating FPGA-based image processing
UG902, 2017. accelerators with Hipacc”, in Proc. of the Int’l Conf. On
[7] Intel, Intel FPGA SDK for OpenCL: Best practices guide, 2017. Computer Aided Design (ICCAD), IEEE, Nov. 13–16, 2017.
[8] R. Leißa et al., “AnyDSL: A partial evaluation framework for [28] N. Chugh et al., “A DSL compiler for accelerating image
programming high-performance libraries”, Proc. of the ACM processing pipelines on FPGAs”, in Proc. of the Int’l Conf.
on Programming Languages (PACMPL), vol. 2, no. OOPSLA, on Parallel Architecture and Compilation Techniques (PACT),
Nov. 4–9, 2018. ACM, Sep. 11–15, 2016.
[9] L.-N. Pouchet et al., “Polyhedral-based data reuse optimization [29] Y. Chi et al., “Soda: Stencil with optimized dataflow archi-
for configurable computing”, in Proc. of the ACM/SIGDA tecture”, in 2018 IEEE/ACM Int’l Conf. on Computer-Aided
international symposium on Field programmable gate arrays, Design (ICCAD), IEEE, 2018.
ACM, 2013. [30] R. Stewart et al., “A dataflow IR for memory efficient
[10] R. Nane et al., “A survey and evaluation of FPGA high-level RIPL compilation to FPGAs”, in Proc. of the Int’l Conf. on
synthesis tools”, IEEE Trans. on Computer-Aided Design of Algorithms and Architectures for Parallel Processing (ICA3PP),
Integrated Circuits and Systems, vol. 35, no. 10, 2015. Springer, Dec. 14–16, 2016.
[11] G. Martin and G. Smith, “High-level synthesis: Past, present, [31] M. Kristien et al., “High-level synthesis of functional patterns
and future”, IEEE Design & Test of Computers, vol. 26, no. 4, with Lift”, in Proc. of the 6th ACM SIGPLAN Int’l Workshop on
2009. Libraries, Languages and Compilers for Array Programming,
[12] D. F. Bacon et al., “FPGA programming for the masses”, ARRAY@PLDI 2019, Phoenix, AZ, USA, June 22, 2019., 2019.
Communications of the ACM, vol. 56, no. 4, 2013. [32] R. Baghdadi et al., “Tiramisu: A polyhedral compiler for
[13] S. A. Edwards, “The challenges of synthesizing hardware from expressing fast and portable code”, in Proc. of the IEEE/ACM
C-like languages”, IEEE Design & Test of Computers, vol. 23, Int’l Symp. on Code Generation and Optimization (CGO),
no. 5, 2006. IEEE, Feb. 16–20, 2019.
[14] J. Sanguinetti, “A different view: Hardware synthesis from [33] E. Del Sozzo et al., “A unified backend for targeting FPGAs
SystemC is a maturing technology”, IEEE Design & Test of from DSLs”, in Proc. of the 29th Annual IEEE Int’l Conf.
Computers, vol. 23, no. 5, 2006. on Application-specific Systems, Architectures and Processors
[15] D. Koeplinger et al., “Spatial: A language and compiler for (ASAP), IEEE, Jul. 10–12, 2018.
application accelerators”, in Proc. of the 39th ACM SIGPLAN [34] R. Leißa et al., “Shallow embedding of DSLs via online partial
Conf. on Programming Language Design and Implementation evaluation”, in Proc. of the Int’l Conf. on Generative Program-
(PLDI), ACM, Jun. 18–22, 2018. ming: Concepts & Experiences (GPCE), ACM, Oct. 26–27,
[16] H. Eran et al., “Design patterns for code reuse in HLS packet 2015.
processing pipelines”, in 27th Annual Int’l Symp. on Field- [35] M. A. Özkan et al., “A journey into DSL design using
Programmable Custom Computing Machines (FCCM), IEEE, generative programming: FPGA mapping of image border
2019. handling through refinement”, in Proc. of the 5th Int’l Workshop
[17] J. S. da Silva et al., “Module-per-object: A human-driven on FPGAs for Software Programmers (FSP), VDE, 2018.
methodology for C++-based high-level synthesis design”, in [36] N. D. Jones et al., Partial evaluation and automatic program
27th Annual Int’l Symp. on Field-Programmable Custom generation. Peter Sestoft, 1993.
Computing Machines (FCCM), IEEE, 2019. [37] Y. Futamura, “Parital computation of programs”, in Proc. of the
[18] D. Richmond et al., “Synthesizable higher-order functions for RIMS Symposia on Software Science and Engineering, 1982.
C++”, Trans. on Computer-Aided Design of Integrated Circuits [38] C. Consel, “New insights into partial evaluation: The SCHISM
and Systems, vol. 37, no. 11, 2018. experiment”, in Proc. of the 2nd European Symp. on Program-
[19] M. A. Özkan et al., “A highly efficient and comprehensive ming (ESOP), Springer, Mar. 21–24, 1988.
image processing library for C++-based high-level synthesis”, [39] M. Schmid et al., “Loop coarsening in C-based high-level
in Proc. of the 4th Int’l Workshop on FPGAs for Software synthesis”, in Proc. of the 26th Annual IEEE Int’l Conf.
Programmers (FSP), VDE, 2017. on Application-specific Systems, Architectures and Processors
[20] J. de Fine Licht et al., “Transformations of high-level synthesis (ASAP), IEEE, 2015.
codes for high-performance computing”, The Computing Re- [40] M. A. Özkan et al., “Hardware design and analysis of efficient
search Repository (CoRR), 2018. arXiv: 1805.08288 [cs.DC]. loop coarsening and border handling for image processing”,
[21] G. Ofenbeck et al., “Spiral in Scala: Towards the systematic in Proc. of the Int’l Conf. on Application-specific Systems,
construction of generators for performance libraries”, in Proc. Architectures and Processors (ASAP), IEEE, Jul. 10–12, 2017.
of the Int’l Conf. on Generative Programming: Concepts & [41] G. Stitt et al., “Scalable window generation for the Intel
Experiences (GPCE), ACM, Oct. 27–28, 2013. Broadwell+Arria 10 and high-bandwidth FPGA systems”, in
[22] P. Milder et al., “Computer generation of hardware for linear Proc. of the ACM/SIGDA Int’lSymp. on Field-Programmable
digital signal processing transforms”, ACM Trans. on Design Gate Arrays (FPGA), ACM, Feb. 25–27, 2018.
Automation of Electronic Systems (TODAES), vol. 17, no. 2, [42] Y.-k. Choi et al., “A quantitative analysis on microarchitectures
2012. of modern CPU-FPGA platforms”, in Proc. of the 53rd Annual
[23] J. Hegarty et al., “Darkroom: Compiling high-level image Design Automation Conf. (DAC), ACM, Jun. 5–9, 2016.
processing code into hardware pipelines”, ACM Trans. on [43] Core evaluation license agreement, version 2014.06, Xilinx,
Graphics (TOG), vol. 33, no. 4, 2014. Inc., Jun. 2014. [Online]. Available: https://www.xilinx.com/
[24] J. Hegarty et al., “Rigel: Flexible multi-rate image processing products/intellectual-property/license/core-evaluation-license-
hardware”, ACM Trans. on Graphics (TOG), vol. 35, no. 4, agreement.html.
2016. [44] Intel program license subscription agreement, version Rev.
[25] J. Pu et al., “Programming heterogeneous systems from an 10/2009, Intel Corporation, Oct. 2009. [Online]. Available:
image processing DSL”, ACM Trans. on Architecture and Code https://www.intel.com/content/www/us/en/programmable/
Optimization (TACO), vol. 14, no. 3, 2017. downloads/software/license/lic-prog_lic.html.
[26] J. Ragan-Kelley et al., “Halide: A language and compiler for [45] M. A. Özkan et al., “FPGA-based accelerator design from
optimizing parallelism, locality, and recomputation in image a domain-specific language”, in Proc. of the 26th Int’l Conf.
processing pipelines”, in Proc. of the Conf. on Programming on Field-Programmable Logic and Applications (FPL), IEEE,
Language Design and Implementation (PLDI), ACM, Jun. 16– Aug. 29–Sep. 2, 2016.
19, 2013.