Academia.edu no longer supports Internet Explorer.
To browse Academia.edu and the wider internet faster and more securely, please take a few seconds to upgrade your browser.
2008, ACM Transactions on Architecture and Code Optimization
…
26 pages
1 file
The wider acceptance of FPGAs as a computing device requires a higher level of programming abstraction. ROCCC is an optimizing C to HDL compiler. We describe the code generation approach in ROCCC. The smart buffer is a component that reuses input data between adjacent iterations. It significantly improves the performance of the circuit and simplifies loop control. The ROCCCgenerated datapath can execute one loop iteration per clock cycle when there is no loop dependency or there is only scalar recurrence variable dependency. ROCCC's approach to supporting while-loops operating on scalars makes the compiler able to move scalar iterative computation into hardware.
Design, Automation and Test in Europe, 2005
FPGAs, as computing devices, offer significant speedup over microprocessors. Furthermore, their configurability offers an advantage over traditional ASICs. However, they do not yet enjoy high-level language programmability, as microprocessors do. This has become the main obstacle for their wider acceptance by application designers.
FPGAs, as computing devices, offer significant speedup over microprocessors. Furthermore, their configurability offers an advantage over traditional ASICs. However, they do not yet enjoy high-level language programmability, as microprocessors do. This has become the main obstacle for their wider acceptance by application designers.
Lecture Notes in Computer Science, 2006
Loop unrolling is the main compiler technique that allows reconfigurable architectures achieve large degrees of parallelism. However, loop unrolling increases the area and can potentially have a negative impact on clock cycle time. In most embedded applications, the critical parameter is the throughput. Loop unrolling can therefore have contradictory effects on the throughput. As a consequence there exists, in general, a degree of unrolling that maximizes the throughput per unit area. This paper studies the effect of loop unrolling on the area, clock speed and throughput within the ROCCC, C to VHDL compilation framework. Our results indicate that due to the unique design of the ROCCC compilation framework, FPGA area either shrinks or increases at a very low rate for the first few times the loops are unrolled. This reduced area causes the clock cycle time to decrease and thus a great gain in throughput. Our results also show that there are different optimal unrolling factors for different programs.
Proceedings of the 50th Annual Design Automation Conference on - DAC '13, 2013
FPGA-based accelerators have repeatedly demonstrated superior speed-ups on an ever-widening spectrum of applications. However, their use remains beyond the reach of traditionally trained applications code developers because of the complexity of their programming tool-chain. Compilers for high-level languages targeting FPGAs have to bridge a huge abstraction gap between two divergent computational models: a temporal, sequentially consistent, control driven execution in the stored program model versus a spatial, parallel, data-flow driven execution in the spatial hardware model. In this paper we discuss these challenges to the compiler designer and report on our experience with the ROCCC toolset.
FPGA computing is always thought as a media to dramatically improve computational performances. The real obstacle to its widespread diffusion is primarily due to the lack of compiling tools which allow to use common specification languages (like the ANSI C); on the contrary, FPGAs have to be programmed either through very low level HDL languages or through some not standard languages which are dialects derived from the C but which are very far from the standard C-language. In order to overcome previous drawbacks, Ylichron developed a compiling chain, the HARWEST Compiling Environment (HCE), which allows to specify algorithms to be mapped onto FPGAs through standard C programs: as a consequence, no special skills are required to access the power of FPGA computing and no special efforts have to be spent to learn proprietary languages. The HCE Design Flow and some performance figures are presented in the paper.
2010 18th IEEE Annual International Symposium on Field-Programmable Custom Computing Machines, 2010
While FPGA-based hardware accelerators have repeatedly been demonstrated as a viable option, their programmability remains a major barrier to their wider acceptance by application code developers. These platforms are typically programmed in a low level hardware description language, a skill not common among application developers and a process that is often tedious and error-prone. Programming FPGAs from high level languages would provide easier integration with software systems as well as open up hardware accelerators to a wider spectrum of application developers. In this paper, we present a major revision to the Riverside Optimizing Compiler for Configurable Circuits (ROCCC) designed to create hardware accelerators from C programs. Novel additions to ROCCC include (1) intuitive modular bottom-up design of circuits from C, and (2) separation of code generation from specific FPGA platforms. The additions we make do not introduce any new syntax to the C code and maintain the high level optimizations from the ROCCC system that generate efficient code. The modular code we support functions identically as software or hardware. Additionally, we enable user control of hardware optimizations such as systolic array generation and temporal common subexpression elimination. We evaluate the quality of the ROCCC 2.0 tool by comparing it to hand-written VHDL code. We show comparable clock frequencies and a 18% higher throughput. The productivity advantages of ROCCC 2.0 is evaluated using the metrics of lines of code and programming time showing an average of 15x improvement over hand-written VHDL.
Proceedings. The 5th Annual IEEE Symposium on Field-Programmable Custom Computing Machines Cat. No.97TB100186)
Journal of Signal Processing Systems, 2017
Current tools for High-Level Synthesis (HLS) excel at exploiting Instruction-Level Parallelism (ILP). The support for Data-Level Parallelism (DLP), one of the key advantages of Field Programmable Gate Arrays (FPGAs), is in contrast very limited. This work examines the exploitation of DLP on FPGAs using code generation for C-based HLS of image filters and streaming pipelines. In addition to well-known loop tiling techniques, we propose loop coarsening, which delivers superior performance and scalability. Loop tiling corresponds to splitting an image into separate regions, which are then processed in parallel by replicated accelerators. For data streaming, this also requires the generation of glue logic for the distribution of image data. Conversely, loop coarsening allows processing multiple pixels in parallel, whereby only the kernel operator is replicated within a single accelerator. We present concrete implementations of tiling and coarsening for Vivado HLS and Altera OpenCL. Furthermore, we present a comparison of our implementations to the keyword-driven parallelization support provided by the Altera Offline Compiler. We augment the FPGA back end of the heterogeneous Domain-Specific Language (DSL) framework Hipacc to generate loop coarsening implementations for Vivado HLS and Altera OpenCL. Moreover, we compare the resulting FPGA accelerators to highly optimized software implementations for Graphics Processing Units (GPUs), all generated from exactly the same code base.
This paper describes our approaches to raise the level of abstraction at which hardware suitable for accelerating computationally intensive applications can be specified. Field-programmable gate arrays are becoming adopted as a computational platform by the high-performance computing community, but there are challenges to extract maximum performance from these devices. Unlike other approaches, our focus is on data memory organization and input–output bandwidth considerations, which are the typical stumbling block of existing hardware compilation schemes. We describe our approaches, which are based on formal optimization techniques, and present some results showing the advantage of exposing the interaction between data memory system design and parallelism extraction to the compiler.
2008
This paper describes our approaches to raise the level of abstraction at which hardware suitable for accelerating computationally-intensive applications can be specified. Field-Programmable Gate Arrays (FPGAs) are becoming adopted as a computational platform by the high-performance computing community, but there are challenges to extract maximum performance from these devices. Unlike other approaches, our focus is on data memory organisation and input-output bandwidth considerations, which are the typical stumbling block of existing hardware compilation schemes. We describe our approaches, which are based on formal optimization techniques, and present some results showing the advantage of exposing the interaction between data memory system design and parallelism extraction to the compiler.
Loading Preview
Sorry, preview is currently unavailable. You can download the paper by clicking the button above.
The Computer Journal, 2011
Proceedings of the 32nd ACM SIGPLAN International Conference on Compiler Construction
Lecture Notes in Computer Science, 2002
Microprocessors and Microsystems, 2012
Lecture Notes in Computer Science, 2011
Proceedings of the 42nd ACM SIGPLAN International Conference on Programming Language Design and Implementation, 2021
Lecture Notes in Computer Science, 2001
IEEE Transactions on Computer-aided Design of Integrated Circuits and Systems, 2011