Academia.edu no longer supports Internet Explorer.
To browse Academia.edu and the wider internet faster and more securely, please take a few seconds to upgrade your browser.
2011, Lecture Notes in Computer Science
Recent increase in the complexity of the circuits has brought high-level synthesis tools as a must in the digital circuit design. However, these tools come with several limitations, and one of them is the efficient use of pipelined arithmetic operators. This paper explains how to generate efficient hardware with pipelined operators for regular codes with perfect loop nests. The part to be mapped to the operator is identified, then the program is scheduled so that each operator result is available exactly at the time it is needed by the operator, keeping the operator busy and avoiding the use of a temporary buffer. Finally, we show how to generate the VHDL code for the control unit and how to link it with specialized pipelined floating-point operators generated using open-source FloPoCo tool. The method has been implemented in the Bee research compiler and experimental results on DSP kernels show promising results with a minimum of 94% efficient utilization of the pipelined operators for a complex kernel.
Microprocessors and Microsystems, 2012
The increased capacity and enhanced features of modern FPGAs opens new opportunities for their use as application accelerators. However, for FPGAs to be accepted as mainstream acceleration solutions, long design cycles must be shortened by using high-level synthesis tools in the design process. Current HLS tools targeting FPGAs have several limitations including the inefficient use of deeply pipelined arithmetic operators, commonly encountered in high-throughput FPGA designs. We focus here on the efficient generation of FPGA-specific hardware accelerators for regular codes with perfect loop nests where inner statements are implemented as a pipelined arithmetic operator, which is often the case of scientific codes using floating-point arithmetic. We propose a semi-automatic code generation process where the arithmetic operator is identified and generated. Its pipeline information is used to reschedule the initial program execution in order to keep the operator's pipeline as ''busy'' as possible, while minimizing memory access. Next, we show how our method can be used as a tool to generate control FSMs for multiple parallel computing cores. Finally, we show that accounting for the application's accuracy needs allows designing smaller and faster operators.
2009
Custom operators, working at custom precisions, are a key ingredient to fully exploit the FPGA flexibility advantage for high-performance computing. Unfortunately, such operators are costly to design, and application designers tend to rely on less efficient off-the-shelf operators. To address this issue, an open-source architecture generator framework is introduced. Its salient features are an easy learning curve from VHDL, the ability to embed arbitrary synthesizable VHDL code, portability to mainstream FPGA targets from Xilinx and Altera, automatic management of complex pipelines with support for frequency-directed pipeline, and automatic test-bench generation. This generator is presented around the simple example of a collision detector, which it significantly improves in accuracy, DSP count, logic usage, frequency and latency with respect to an implementation using standard floating-point operators.
2008
ABSTRACT Complex algorithms and increased functionality are expanding the computation demands of embedded systems. Hardware accelerators are commonly used to meet these demands by executing critical application loop nests in custom logic, achieving performance requirements while minimizing hardware cost. Traditionally, these loop accelerators are designed in a single-function manner, wherein each loop nest is implemented as dedicated hardware.
2003
Most modern processors rely on pipeline techniques to achieve high throughput. This work reports the development of scalable, floating-point (FP) arithmetic operators with variable number of pipeline stages. A new algorithm for pipeline insertion was developed and used for FP Multiplication and FP Addition. The use of this algorithm enables operating frequencies up to 175MHz when implemented on a Xilinx Virtex II FPGA. Future work includes the automation of the process and the inclusion of the algorithm into FP square root and division units.
ACM Transactions on Architecture and Code Optimization, 2008
The wider acceptance of FPGAs as a computing device requires a higher level of programming abstraction. ROCCC is an optimizing C to HDL compiler. We describe the code generation approach in ROCCC. The smart buffer is a component that reuses input data between adjacent iterations. It significantly improves the performance of the circuit and simplifies loop control. The ROCCCgenerated datapath can execute one loop iteration per clock cycle when there is no loop dependency or there is only scalar recurrence variable dependency. ROCCC's approach to supporting while-loops operating on scalars makes the compiler able to move scalar iterative computation into hardware.
Research efforts have shown the strength of FPGA-based acceleration in a wide range of application domains where compute kernels can execute efficiently on an FPGA device. Due to the complex process of FPGA-based accelerator design, the design productivity is a major issue, restricting the effective use of these accelerators to niche disciplines involving highly skilled hardware engineers. Coarse-grained FPGA overlays, such as VectorBlox MXP and DSP block based overlays, have been shown to be effective when paired with general purpose processors, offering software-like programmability, fast compilation, application portability and improved design productivity. These architectures enable general purpose hardware accelerators, allowing hardware design at a higher level of abstraction. This report presents an analysis of compute kernels (extracted from compute-intensive applications) and their implementation on multiple hardware accelerators, such as GPU, Altera OpenCL (AOCL) generated hardware accelerator and FPGA-based overlays. We experiment with simple and easy programming models like \ac{OpenCL}/Overlay APIs and produce a hardware-accelerated design with software like abstractions. To begin we analyze two existing use-cases of hardware acceleration where one of them highlights the performance benefits obtained by use of compiler optimizations. We see that compiler optimizations can provide almost a 16$\times$ improvement in execution time on the ARM processor of the zedboard. This is because of the use of SIMD NEON engine which accelerated the execution. The use of MXP overlay for the same application provides an even higher improvement in the execution time compared to SIMD NEON engine. The other hardware acceleration case study analyses the feasibility of dynamic loading of tasks to the FPGA fabric and the effect on the execution time. We use AOCL to create accelerators for multiple tasks and then using the software API, we perform the dynamic reconfiguration. We show that the use of overlay is preferable in such a scenario due to their ease of use, simple programming model and dynamic task loading without actual reconfiguration of the FPGA fabric. When the same task was executed on an overlay it ran much faster as there is no need to reconfigure the FPGA fabric with a new bitstream. We present experiments to compare the performance of a naive implementation of few compute kernels with their hardware accelerated versions that were built either using \ac{OpenCL} or using Overlay APIs. We observe up to 10$\times$ improvement in timing performance in certain applications like 12-Tap FIR filtering when accelerated using hardware and almost 100$\times$ in certain applications like 2D Convolution. These performance improvements were obtained by using very basic and naive implementation of hardware accelerators generated at a high level of programming abstractions (OpenCL/Overlay APIs). With optimizations, the performance can surely be improved, this would be one of the key areas of future research work beyond this thesis. Finally, we make the case for hardware virtualization by using the cloud and demonstrate how by means of a simple web browser we can program remote computing platforms connected to the cloud servers. Such virtualization methods could be used in teaching labs and for hardware evaluations.
… Automation and Test …, 2004
This work investigates the use of very deep pipelines for implementing circuits in FPGAs, where each pipeline stage is limited to a single FPGA logic element (LE). The architecture and VHDL design of a parameterized integer array multiplier is presented and also an IEEE 754 compliant 32-bit floating-point multiplier. We show how to write VHDL cells that implement such approach, and how the array multiplier architecture was adapted. Synthesis and simulation were performed for Altera Apex20KE devices, although the VHDL code should be portable to other devices. For this family, a 16 bit integer multiplier achieves a frequency of 266MHz, while the floating point unit reaches 235MHz, performing 235 MFLOPS in an FPGA. Additional cells are inserted to synchronize data, what imposes significant area penalties. This and other considerations to apply the technique in real designs are also addressed.
2006
Abstract In this paper, we present a methodology for designing a pipeline of accelerators for an application. The application is modeled using sequential C language with simple stylizations. The synthesis of the accelerator pipeline involves designing loop accelerators for individual kernels, instantiating buffers for arrays used in the application, and hooking up these building blocks to form a pipeline. A compiler-based system automatically synthesizes loop accelerators for individual kernels at varying performance levels.
Proceedings of the 50th Annual Design Automation Conference on - DAC '13, 2013
FPGA-based accelerators have repeatedly demonstrated superior speed-ups on an ever-widening spectrum of applications. However, their use remains beyond the reach of traditionally trained applications code developers because of the complexity of their programming tool-chain. Compilers for high-level languages targeting FPGAs have to bridge a huge abstraction gap between two divergent computational models: a temporal, sequentially consistent, control driven execution in the stored program model versus a spatial, parallel, data-flow driven execution in the spatial hardware model. In this paper we discuss these challenges to the compiler designer and report on our experience with the ROCCC toolset.
2003
In this paper, we describe a set of compiler analyses and an implementation that automatically map a sequential and un-annotated C program into a pipelined implementation, targeted for an FPGA with multiple external memories. For this purpose, we extend array data-flow analysis techniques from parallelizing compilers to identify pipeline stages, required inter-pipeline stage communication, and opportunities to find a minimal program execution time by trading communication overhead with the amount of computation overlap in different stages. Using the results of this analysis, we automatically generate application-specific pipelined FPGA hardware designs. We use a sample image processing kernel to illustrate these concepts. Our algorithm finds a solution in which transmitting a row of an array between pipeline stages per communication instance leads to a speedup of 1.76 over an implementation that communicates the entire array at once.
2010
Looping operations impose a significant bottleneck to achieving better computational efficiency for embedded applications. To confront this problem in embedded computation either in the form of programmable processors or FSMD (Finite-State Machine with Datapath) architectures, the use of customized loop controllers has been suggested. In this paper, a thorough examination of zero-cycle overhead loop controllers applicable to perfect loop nests operating on multi-dimensional data is presented. The design of such loop controllers is formalized by the introduction of a hardware algorithm that fully automates this task for the spectrum of behavioral as well as generated register-transfer level architectures. The presented algorithm would prove beneficial in the field of high-level synthesis of architectures for data-intensive processing. It is also shown that the proposed loop controllers can be efficiently utilized for supporting generalized loop structures such as imperfect loop nests. The performance characteristics (cycle time, chip area) of the proposed architectures have been evaluated for FPGA target implementations. It is shown that maximum clock frequencies of above 230MHz with low logic footprints of about 1.4% of the overall logic resources can be achieved for supporting up to 8 nested loops with 16-bit indices on a modestly-sized Xilinx Virtex-5 device.
2018
Multi-Processor System-on-Chip FPGAs can utilize programmable logic for compute intensive functions, using socalled Accelerators, implementing a heterogeneous computing architecture. Thereby, Embedded Systems can benefit from the computing power of programmable logic while still maintaining the software flexibility of a CPU. As a design option to the well-established RTL design process, Accelerators can be designed using High-Level Synthesis. The abstraction level for the functionality description can be raised to algorithm level by a tool generating HDL code from a high-level language like C/C++. The Xilinx tool Vivado HLS allows the user to guide the generated RTL implementation by inserting compiler pragmas into the C/C++ source code. This paper analyzes the possibilities to improve the performance of an FPGA accelerator generated with Vivado HLS and integrated into a Vivado block design. It investigates, how much the pragmas affect the performance and resource cost and shows pro...
Journal of Signal Processing Systems, 2017
Current tools for High-Level Synthesis (HLS) excel at exploiting Instruction-Level Parallelism (ILP). The support for Data-Level Parallelism (DLP), one of the key advantages of Field Programmable Gate Arrays (FPGAs), is in contrast very limited. This work examines the exploitation of DLP on FPGAs using code generation for C-based HLS of image filters and streaming pipelines. In addition to well-known loop tiling techniques, we propose loop coarsening, which delivers superior performance and scalability. Loop tiling corresponds to splitting an image into separate regions, which are then processed in parallel by replicated accelerators. For data streaming, this also requires the generation of glue logic for the distribution of image data. Conversely, loop coarsening allows processing multiple pixels in parallel, whereby only the kernel operator is replicated within a single accelerator. We present concrete implementations of tiling and coarsening for Vivado HLS and Altera OpenCL. Furthermore, we present a comparison of our implementations to the keyword-driven parallelization support provided by the Altera Offline Compiler. We augment the FPGA back end of the heterogeneous Domain-Specific Language (DSL) framework Hipacc to generate loop coarsening implementations for Vivado HLS and Altera OpenCL. Moreover, we compare the resulting FPGA accelerators to highly optimized software implementations for Graphics Processing Units (GPUs), all generated from exactly the same code base.
IEEE Transactions on Very Large Scale Integration (VLSI) Systems, 2017
Many embedded applications process large amounts of data using regular computational kernels, amenable to acceleration by specialized hardware coprocessors. To reduce the significant design effort, the dedicated hardware may be automatically generated, usually starting from the application's source or binary code. This paper presents a moduloscheduled loop accelerator capable of executing multiple loops and a supporting toolchain. A generation/scheduling procedure, which fully relies on MicroBlaze instruction traces, produces accelerator instances, customized in terms of functional units and interconnections. The accelerators support integer and single-precision floatingpoint arithmetic, and exploit instruction-level parallelism, loop pipelining, and memory access parallelism via two read/write ports. A complete implementation of the proposed architecture is evaluated in a Virtex-7 device. Augmenting a MicroBlaze processor with a tailored accelerator achieves a geometric mean speedup, over software-only execution, of 6.61× for 13 floatingpoint kernels from the Livermore Loops set, and of 4.08× for 11 integer kernels from Texas Instruments' IMGLIB. The proposed customized accelerators are compared with ALU-based ones. The average specialized accelerator requires only 0.47× the number of field-programmable gate array slices of an accelerator with four ALUs. A geometric mean speedup of 1.78× over a four-issue very long instruction word (without floating-point support) was obtained for the integer kernels.
2016
Author(s): Cheng, Shaoyi | Advisor(s): Wawrzynek, John | Abstract: As the scaling down of transistor size no longer provides boost to processor clock frequency, there has been a move towards parallel computers and more recently, heterogeneous computing platforms. To target the FPGA component in these systems, high-level synthesis (HLS) tools were developed to facilitate hardware generation from higher level algorithmic descriptions. Despite being an effective method for rapid hardware generation, in the context of offloading compute intensive software kernels to FPGA accelerators, current HLS tools do not always take full advantage of the hardware platforms. Processor centric software implementations often have to be rewritten if good quality of results is desired.In this work, we present a framework to refactor and restructure compute intensive software kernels, making them better suited for FPGA platforms. An algorithm was proposed to decouple memory operations and computation, ge...
This article reports the work done on the optimization of scalable floating-point addition and multiplication operators. Both operators were previously accomplished but some of their characteristics offered room for improvement. Their structure and main components are discussed and characterized to allow quantifying the improvements achieved.
International Conference on Field Programmable Logic and Applications, 2005., 2005
Trident is a compiler for floating point algorithms written in C, producing circuits in reconfigurable logic that exploit the parallelism available in the input description. Trident automatically extracts parallelism and pipelines loop bodies using conventional compiler optimizations and scheduling techniques. Trident also provides an open framework for experimentation, analysis, and optimization of floating point algorithms on FPGAs and the flexibility to easily integrate custom floating point libraries.
IEEE Transactions on Very Large Scale Integration (VLSI) Systems, 2000
To construct complete systems on silicon, application specific DSP accelerators are needed to speed up the execution of high throughput DSP algorithms. In this paper, a methodology is presented to synthesize high throughput DSP functions into accelerator processors containing a datapath of highly pipelined, bitparallel hardware units. Emphasis will be put on the definition of a controller architecture that allows eficient run-time schedules of these DSP algorithms on such highly pipelined data paths. The methodology will be illustrated by means of an F F T butterfly accelerator block.
Proceedings. The 5th Annual IEEE Symposium on Field-Programmable Custom Computing Machines Cat. No.97TB100186)
2014
In order for FPGAs to be successful outside traditional markets, tools which enable software programmers to achieve high levels of system performance while abstracting away the FPGA-specific details are needed. DSPB Builder Advanced (DSPBA) is one such tool. DSPBA provides model-based design environment using Matlab's Simulink frontend that decouples the fully-algorithmic design description from the details of FPGA system generation. DSPBA offers several levels of debugging: from Simulink scopes to bit-accurate-simulation and silver reference models. It also offers the most comprehensive set of fixed-point, floating-point and signal-processing IPs available today. The combination of 7 floating-point precisions, fused-datapath support, custom operator support and automated folding allows exploring the best tradeoffs between accuracy, size and throughput. The DSPBA backend protects users from the details of device-dependent operator mapping offering both efficiency and prompt support for new devices and features such as the Arria10 floating-point cores. The collection of features available in DSPBA allows both unexperienced and expert users to efficiently migrate performance-crucial systems to the FPGA architecture.
Loading Preview
Sorry, preview is currently unavailable. You can download the paper by clicking the button above.