Academia.edu no longer supports Internet Explorer.
To browse Academia.edu and the wider internet faster and more securely, please take a few seconds to upgrade your browser.
2010, International Journal of …
…
28 pages
1 file
Field-Programmable Gate Arrays (FPGAs) are becoming increasingly important in embedded and high-performance computing systems. They allow performance levels close to the ones obtained with Application-Specific Integrated Circuits (ASICs), while still keeping design and implementation flexibility. However, to efficiently program FPGAs, one needs the expertise of hardware developers in order to master hardware description languages (HDLs) such as VHDL or Verilog. Attempts to furnish a high-level compilation flow (e.g., from C programs) still have to address open issues before broader efficient results can be obtained. Bearing in mind the hardware resources available in contemporary FPGAs, we developed LALP (Language for Aggressive Loop Pipelining), a novel language to program FPGA-based accelerators, and its compilation framework. The main ideas behind LALP are to provide a higher abstraction level than HDLs, to exploit the intrinsic parallelism of hardware resources, and to allow the programmer to control execution stages whenever compiler techniques are unable to generate efficient implementations. Those features are particularly useful to implement loop pipelining, a well regarded technique used to accelerate computations in several application domains. This paper describes LALP, and shows how it can be used to achieve high-performance embedded computing solutions.
Journal of Signal Processing Systems, 2017
Current tools for High-Level Synthesis (HLS) excel at exploiting Instruction-Level Parallelism (ILP). The support for Data-Level Parallelism (DLP), one of the key advantages of Field Programmable Gate Arrays (FPGAs), is in contrast very limited. This work examines the exploitation of DLP on FPGAs using code generation for C-based HLS of image filters and streaming pipelines. In addition to well-known loop tiling techniques, we propose loop coarsening, which delivers superior performance and scalability. Loop tiling corresponds to splitting an image into separate regions, which are then processed in parallel by replicated accelerators. For data streaming, this also requires the generation of glue logic for the distribution of image data. Conversely, loop coarsening allows processing multiple pixels in parallel, whereby only the kernel operator is replicated within a single accelerator. We present concrete implementations of tiling and coarsening for Vivado HLS and Altera OpenCL. Furthermore, we present a comparison of our implementations to the keyword-driven parallelization support provided by the Altera Offline Compiler. We augment the FPGA back end of the heterogeneous Domain-Specific Language (DSL) framework Hipacc to generate loop coarsening implementations for Vivado HLS and Altera OpenCL. Moreover, we compare the resulting FPGA accelerators to highly optimized software implementations for Graphics Processing Units (GPUs), all generated from exactly the same code base.
2013 23rd International Conference on Field programmable Logic and Applications, 2013
Whether for use as the final target or simply a rapid prototyping platform, programming systems containing FPGAs is challenging. Some of the difficulty is due to the difference between the models used to program hardware and software, but great effort is also required to coordinate the simultaneous execution of the application running on the microprocessor with the accelerated kernel(s) running on the FPGA. In this paper we present a new methodology and programming model for introducing hardware-acceleration to an application running in software. The application is represented as a data-flow graph and the computation at each node in the graph is specified for execution either in software or on the FPGA using the programmer's language of choice. We have implemented an interface compiler which takes as its input the FIFO edges of the graph and generates code to connect all the different parts of the program, including those which communicate across the hardware/software boundary. Our methodology and compiler enable programmers to effectively exploit FPGA acceleration without ever leaving the application space.
Research efforts have shown the strength of FPGA-based acceleration in a wide range of application domains where compute kernels can execute efficiently on an FPGA device. Due to the complex process of FPGA-based accelerator design, the design productivity is a major issue, restricting the effective use of these accelerators to niche disciplines involving highly skilled hardware engineers. Coarse-grained FPGA overlays, such as VectorBlox MXP and DSP block based overlays, have been shown to be effective when paired with general purpose processors, offering software-like programmability, fast compilation, application portability and improved design productivity. These architectures enable general purpose hardware accelerators, allowing hardware design at a higher level of abstraction. This report presents an analysis of compute kernels (extracted from compute-intensive applications) and their implementation on multiple hardware accelerators, such as GPU, Altera OpenCL (AOCL) generated hardware accelerator and FPGA-based overlays. We experiment with simple and easy programming models like \ac{OpenCL}/Overlay APIs and produce a hardware-accelerated design with software like abstractions. To begin we analyze two existing use-cases of hardware acceleration where one of them highlights the performance benefits obtained by use of compiler optimizations. We see that compiler optimizations can provide almost a 16$\times$ improvement in execution time on the ARM processor of the zedboard. This is because of the use of SIMD NEON engine which accelerated the execution. The use of MXP overlay for the same application provides an even higher improvement in the execution time compared to SIMD NEON engine. The other hardware acceleration case study analyses the feasibility of dynamic loading of tasks to the FPGA fabric and the effect on the execution time. We use AOCL to create accelerators for multiple tasks and then using the software API, we perform the dynamic reconfiguration. We show that the use of overlay is preferable in such a scenario due to their ease of use, simple programming model and dynamic task loading without actual reconfiguration of the FPGA fabric. When the same task was executed on an overlay it ran much faster as there is no need to reconfigure the FPGA fabric with a new bitstream. We present experiments to compare the performance of a naive implementation of few compute kernels with their hardware accelerated versions that were built either using \ac{OpenCL} or using Overlay APIs. We observe up to 10$\times$ improvement in timing performance in certain applications like 12-Tap FIR filtering when accelerated using hardware and almost 100$\times$ in certain applications like 2D Convolution. These performance improvements were obtained by using very basic and naive implementation of hardware accelerators generated at a high level of programming abstractions (OpenCL/Overlay APIs). With optimizations, the performance can surely be improved, this would be one of the key areas of future research work beyond this thesis. Finally, we make the case for hardware virtualization by using the cloud and demonstrate how by means of a simple web browser we can program remote computing platforms connected to the cloud servers. Such virtualization methods could be used in teaching labs and for hardware evaluations.
2012
Abstract The development of applications for high-performance Field Programmable Gate Array (FPGA) based embedded systems is a long and error-prone process. Typically, developers need to be deeply involved in all the stages of the translation and optimization of an application described in a high-level programming language to a lower-level design description to ensure the solution meets the required functionality and performance.
Proceedings of the 50th Annual Design Automation Conference on - DAC '13, 2013
FPGA-based accelerators have repeatedly demonstrated superior speed-ups on an ever-widening spectrum of applications. However, their use remains beyond the reach of traditionally trained applications code developers because of the complexity of their programming tool-chain. Compilers for high-level languages targeting FPGAs have to bridge a huge abstraction gap between two divergent computational models: a temporal, sequentially consistent, control driven execution in the stored program model versus a spatial, parallel, data-flow driven execution in the spatial hardware model. In this paper we discuss these challenges to the compiler designer and report on our experience with the ROCCC toolset.
Proceedings of the 42nd ACM SIGPLAN International Conference on Programming Language Design and Implementation, 2021
Modern field-programmable gate arrays (FPGAs) have recently powered high-profile efficiency gains in systems from datacenters to embedded devices by offering ensembles of heterogeneous, reconfigurable hardware units. Programming stacks for FPGAs, however, are stuck in the pastÐthey are based on traditional hardware languages, which were appropriate when FPGAs were simple, homogeneous fabrics of basic programmable primitives. We describe Reticle, a new low-level abstraction for FPGA programming that, unlike existing languages, explicitly represents the special-purpose units available on a particular FPGA device. Reticle has two levels: a portable intermediate language and a target-specific assembly language. We show how to use a standard instruction selection approach to lower intermediate programs to assembly programs, which can be both faster and more effective than the complex metaheuristics that existing FPGA toolchains use. We use Reticle to implement linear algebra operators and coroutines and find that Reticle compilation runs up to 100 times faster than current approaches while producing comparable or better run-time and utilization.
Lecture Notes in Computer Science, 2011
Recent increase in the complexity of the circuits has brought high-level synthesis tools as a must in the digital circuit design. However, these tools come with several limitations, and one of them is the efficient use of pipelined arithmetic operators. This paper explains how to generate efficient hardware with pipelined operators for regular codes with perfect loop nests. The part to be mapped to the operator is identified, then the program is scheduled so that each operator result is available exactly at the time it is needed by the operator, keeping the operator busy and avoiding the use of a temporary buffer. Finally, we show how to generate the VHDL code for the control unit and how to link it with specialized pipelined floating-point operators generated using open-source FloPoCo tool. The method has been implemented in the Bee research compiler and experimental results on DSP kernels show promising results with a minimum of 94% efficient utilization of the pipelined operators for a complex kernel.
The Art, Science, and Engineering of Programming
In recent years, heterogeneous computing has emerged as the vital way to increase computers' performance and energy efficiency by combining diverse hardware devices, such as Graphics Processing Units (GPUs) and Field Programmable Gate Arrays (FPGAs). The rationale behind this trend is that different parts of an application can be offloaded from the main CPU to diverse devices, which can efficiently execute these parts as co-processors. FPGAs are a subset of the most widely used co-processors, typically used for accelerating specific workloads due to their flexible hardware and energy-efficient characteristics. These characteristics have made them prevalent in a broad spectrum of computing systems ranging from low-power embedded systems to high-end data centers and cloud infrastructures. However, these hardware characteristics come at the cost of programmability. Developers who create their applications using high-level programming languages (e.g., Java, Python, etc.) are required to familiarize with a hardware description language (e.g., VHDL, Verilog) or recently heterogeneous programming models (e.g., OpenCL, HLS) in order to exploit the co-processors' capacity and tune the performance of their applications. Currently, the above-mentioned heterogeneous programming models support exclusively the compilation from compiled languages, such as C and C++. Thus, the transparent integration of heterogeneous co-processors to the software ecosystem of managed programming languages (e.g. Java, Python) is not seamless. In this paper we rethink the engineering trade-offs that we encountered, in terms of transparency and compilation overheads, while integrating FPGAs into high-level managed programming languages. We present a novel approach that enables runtime code specialization techniques for seamless and high-performance execution of Java programs on FPGAs. The proposed solution is prototyped in the context of the Java programming language and TornadoVM; an open-source programming framework for Java execution on heterogeneous hardware. Finally, we evaluate the proposed solution for FPGA execution against both sequential and multithreaded Java implementations showcasing up to 224× and 19.8× performance speedups, respectively, and up to 13.82× compared to TornadoVM running on an Intel integrated GPU. We also provide a breakdown analysis of the proposed compiler optimizations for FPGA execution, as a means to project their impact on the applications' characteristics. ACM CCS 2012 Software and its engineering → Runtime environments;
2015 IEEE International Symposium on Signal Processing and Information Technology (ISSPIT), 2015
Hardware acceleration is often used to address the need for speed and computing power in embedded systems. FPGAs always represented a good solution for HW acceleration and, recently, new SoC platforms extended the flexibility of the FPGAs by combining on a single chip both high-performance CPUs and FPGA fabric. The aim of this work is the implementation of hardware accelerators for these new SoCs. The innovative feature of these accelerators is the on-the-fly reconfiguration of the hardware to dynamically adapt the accelerator's functionalities to the current CPU workload. The realization of the accelerators preliminarily requires also the profiling of both the SW (ARM CPU + NEON Units) and HW (FPGA) performance, an evaluation of the partial reconfiguration times and the development of an applicationspecific IP-cores library. This paper focuses on the profiling aspect of both the SW and HW implementation of the same operations, using arithmetic routines (BLAS) as the reference point for benchmarking, and presents a comparison of the results in terms of speed, power consumption and resources utilization.
ACM Transactions on Architecture and Code Optimization, 2008
The wider acceptance of FPGAs as a computing device requires a higher level of programming abstraction. ROCCC is an optimizing C to HDL compiler. We describe the code generation approach in ROCCC. The smart buffer is a component that reuses input data between adjacent iterations. It significantly improves the performance of the circuit and simplifies loop control. The ROCCCgenerated datapath can execute one loop iteration per clock cycle when there is no loop dependency or there is only scalar recurrence variable dependency. ROCCC's approach to supporting while-loops operating on scalars makes the compiler able to move scalar iterative computation into hardware.
Loading Preview
Sorry, preview is currently unavailable. You can download the paper by clicking the button above.
IEEE Transactions on Very Large Scale Integration Systems, 2008
Proceedings. The 5th Annual IEEE Symposium on Field-Programmable Custom Computing Machines Cat. No.97TB100186)
Lecture Notes in Computer Science, 2002
The Computer Journal, 2011
Sigplan Notices, 2002
2011 IEEE/ACM International Conference on Computer-Aided Design (ICCAD), 2011
2015 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS), 2015
IEEE Transactions on Computer-aided Design of Integrated Circuits and Systems, 2011