Academia.edu no longer supports Internet Explorer.
To browse Academia.edu and the wider internet faster and more securely, please take a few seconds to upgrade your browser.
2010
While ASIC design and manufacturing costs are soaring with each new technology node, the computing power and logic capacity of modern FPGAs steadily advances. Therefore, high-performance computing with FPGA-based system becomes increasingly attractive and viable. Unfortunately, truly unleashing the computing potential of FPGAs often stipulates cumbersome HDL programming and laborious manual optimization. To circumvent such challenges, we propose a Many-core Approach to Reconfigurable Computing (MARC) that (i) allows programmers to easily express parallelism through a high-level programming language, (ii) supports coarse-grain multithreading and dataflowstyle fine-grain threading while permitting bit-level resource control, and (iii) greatly reduces the effort required to repurpose the hardware system for different algorithms or different applications. Leveraging a many-core architectural template, sophisticated logic synthesizing techniques, and state-of-art compiler optimization te...
ABSTRACT Reconfigurable computing systems provide the capability for spatial/parallel computation and so can achieve important speed-ups on program execution. Compilers capable to exploit the full potential of available parallelism, able to consider the wire and gate-level flexibility of commercial FPGAs, the hierarchical memory levels offered and the reconfiguration facility, are still required and are an important focus of research.
International Journal of Reconfigurable Computing, 2012
We present a highly productive approach to hardware design based on a many-core microarchitectural template used to implement compute-bound applications expressed in a high-level data-parallel language such as OpenCL. The template is customized on a per-application basis via a range of high-level parameters such as the interconnect topology or processing element architecture.
2013
— Reconfigurable systems can offer the high spatial parallelism and fine-grained, bit-level resource control traditionally associated with hardware implementations, along with the flexibility and adaptability characteristic of software. While reconfigurable systems create new opportunities for engineering and delivering high-performance programmable systems, the traditional approaches to programming and managing computations used for hardware systems (e.g. Verilog, VHDL) and software systems (e.g. C, Fortran, Java) are inappropriate and inadequate for exploiting reconfigurable platforms. To address this need, we develop a stream-oriented compute model, system architecture, and execution patterns which can capture and exploit the parallelism of spatial computations while simultaneously abstracting software applications from hardware details (e.g., timing, device capacity, microarchitectural implementation details) and consequently allowing applications to scale to exploit newer, larg...
To accelerate the execution of an application, repetitive logic and arithmetic computation tasks may be mapped to reconfigurable hardware, since dedicated hardware can deliver much higher speeds than those of a general-purpose processor. However, this is only feasible if the run-time reconfiguration of new tasks is fast enough, so as not to delay application execution. Currently, this is opposed by architectural constraints intrinsic to current Field-Programmable Logic Array (FPGA) architectures. Despite all new features exhibited by current FPGAs, architecturally they are still largely based on general-purpose architectures that are inadequate for the demands of reconfigurable computing. Large configuration file sizes and poor hardware and software support for partial and dynamic reconfiguration limits the acceleration that reconfigurable computing may bring to applications. The objective of this work is the identification of the architectural limitations exhibited by current FPGAs...
Proceedings of the 42nd ACM SIGPLAN International Conference on Programming Language Design and Implementation, 2021
Modern field-programmable gate arrays (FPGAs) have recently powered high-profile efficiency gains in systems from datacenters to embedded devices by offering ensembles of heterogeneous, reconfigurable hardware units. Programming stacks for FPGAs, however, are stuck in the pastÐthey are based on traditional hardware languages, which were appropriate when FPGAs were simple, homogeneous fabrics of basic programmable primitives. We describe Reticle, a new low-level abstraction for FPGA programming that, unlike existing languages, explicitly represents the special-purpose units available on a particular FPGA device. Reticle has two levels: a portable intermediate language and a target-specific assembly language. We show how to use a standard instruction selection approach to lower intermediate programs to assembly programs, which can be both faster and more effective than the complex metaheuristics that existing FPGA toolchains use. We use Reticle to implement linear algebra operators and coroutines and find that Reticle compilation runs up to 100 times faster than current approaches while producing comparable or better run-time and utilization.
Second NASA/ESA Conference on Adaptive Hardware and Systems (AHS 2007), 2007
We describe the FPGA HPC Alliance's Parallel Toolkit (PTK), an initial step towards the standardization of high-level configuration and APIs for high-performance reconfigurable computing (HPRC). We discuss the motivation and challenges of reaping the performance benefits of FPGAs for memory-bound HPC codes and describe the approach we have taken on the FHPCA supercomputer Maxwell.
2008 Canadian Conference on Electrical and Computer Engineering, 2008
In this paper a novel approach for compiling parallel applications to a target Coarse-Grained Reconfigurable Architecture (CGRA) is presented. We have given a formal definition of the compilation problem for the CGRA. The application will be written in HARPO/L, a parallel object oriented language suitable for hardware. HARPO/L is first compiled to a Data Flow Graph (DFG) representation. The remaining compilation steps are a combination of three tasks: scheduling, placement and routing. For compiling cyclic portions of the application, we have adapted a modulo scheduling algorithm: modulo scheduling with integrated register spilling. For scheduling, the nodes of the DFG are ordered using the hypernode reduction modulo scheduling (HRMS) method. The placement and routing is done using the neighborhood relations of the PEs.
Computer, 2000
Initial performance results with FPGAs were impressive. However, commercial FPGAs have inherent shortcomings, which heretofore made reconfigurable computing impractical for mainstream computing: • Logic granularity. FPGAs are designed for logic replacement. The functional units' granularity is optimized to replace random logic, not to perform multimedia computations. Reconfigurable computing will change the way computing systems are designed, built, and used. PipeRench, a new reconfigurable fabric, combines the flexibility of general-purpose processors with the efficiency of customized hardware to achieve extreme performance speedup.
International Journal of Reconfigurable Computing, 2012
Partial reconfiguration (PR) is an FPGA feature that allows the modification of certain parts of an FPGA while the rest of the system continues to operate without disruption. This distinctive characteristic of FPGAs has many potential benefits but also challenges. The lack of good CAD tools and the deep hardware knowledge requirement result in a hard-to-use feature. In this paper, the new partition-based Xilinx PR flow is used to incorporate PR within our MPI-based message-passing framework to allow hardware designers to create template bitstreams, which are predesigned, prerouted, generic bitstreams that can be reused for multiple applications. As an example of the generality of this approach, four different applications that use the same template bitstream are run consecutively, with a PR operation performed at the beginning of each application to instantiate the desired application engine. We demonstrate a simplified, reusable, high-level, and portable PR interface for X86-FPGA hybrid machines. PR issues such as local resets of reconfigurable modules and context saving and restoring are addressed in this paper followed by some examples and preliminary PR overhead measurements.
2008
Dynamic hardware generation reduces the number of FPGA resources needed and speeds up the application by optimizing the configuration for the exact problem at hand at run-time. If the problem changes, the system needs to be reconfigured. When this occurs too often, the total reconfiguration overhead is too high and the benefit of using dynamic hardware generation vanishes. Hence, it is important to minimize the number of reconfigurations. We propose a novell technique to reduce the number of reconfigurations by using loop transformations. Our approach is similar to the temporal data locality optimizations. By applying our technique, we can drastically reduce the number of reconfigurations, as indicated by the matrix multiplication example. After applying the loop transformations, the number of reconfigurations decreases by an order of magnitude. Combined with a dynamic hardware generation technique with a very low overhead, our technique obtains a significant speedup over generic circuits.
2007
Over the years reconfigurable computing devices such as FPGAs have evolved from gate-level glue logic to complex reprogrammable processing architectures. However, the tools used for mapping computations to such architectures still require the knowledge about architectural details of the target device to extract efficiency.
Proceedings of the IEEE, 2015
This paper provides a focused survey of five tools to improve productivity in developing code for FPGAs.
Advances in Cyber-Physical Systems, 2016
The FPGA-based accelerators and reconfigurable computer systems based on them require designing the application-specific processor soft-cores and are effective for certain classes of problems only, for which application-specific processor soft-cores were previously developed. In Self-Configurable FPGA-based Computer Systems the problem of designing the application-specific processor soft-cores is solved with use of the C2HDL tools, allowing them to be generated automatically. In this paper, we study the questions of the self-configurable computer systems efficiency increasing with use of the partially reconfigurable FPGAs and Chameleon © C2HDL design tool. One of the features of the Chameleon © C2HDL design tool is its ability to generate a number of applicationspecific processor soft-cores executing the same algorithm that differ by the amount of FPGA resources required for their implementation. If the self-configurable computer systems are based on partially reconfigurable FPGAs, this feature allows them to acquire in every moment of its operation such a configuration that will provide an optimal use of its reconfigurable logic at a given level of hardware multitasking.
IEEE Transactions on Very Large Scale Integration (VLSI) Systems, 2008
This paper introduces hthreads, a unifying programming model for specifying application threads running within a hybrid computer processing unit (CPU)/field-programmable gate-array (FPGA) system. Presently accepted hybrid CPU/FPGA computational models-and access to these computational models via high level languages-focus on programming language extensions to increase accessibility and portability. However, this paper argues that new high-level programming models built on common software abstractions better address these goals. The hthreads system, in general, is unique within the reconfigurable computing community as it includes operating system and middleware layer abstractions that extend across the CPU/FPGA boundary. This enables all platform components to be abstracted into a unified multiprocessor architecture platform. Application programmers can then express their computations using threads specified from a single POSIX threads (pthreads) multithreaded application program and can then compile the threads to either run on the CPU or synthesize them to run within an FPGA. To enable this seamless framework, we have created the hardware thread interface (HWTI) component to provide an abstract, platform-independent compilation target for hardware-resident computations. The HWTI enables the use of standard thread communication and synchronization operations across the software/hardware boundary. Key operating system primitives have been mapped into hardware to provide threads running in both hardware and software uniform access to a set of sub-microsecond, minimal-jitter services. Migrating the operating system into hardware removes the potential bottleneck of routing all system service requests through a central CPU.
Lecture Notes in Computer Science, 2009
Reconfigurable computing is an emerging paradigm enabled by the growth in size and speed of FPGAs. In this paper we discuss its place in the evolution of computing as a technology as well as the role it can play in the current technology outlook. We discuss the evolution of ROCCC (Riverside Optimizing Compiler for Configurable Computing) in this context.
In this work, we propose a configurable many-core overlay for high-performance embedded computing. The size of internal memory, supported operations and number of ports can be configured independently for each core of the overlay. The overlay was evaluated with matrix multiplication, LU decomposition and Fast-Fourier Transform (FFT) on a ZYNQ-7020 FPGA platform. The results show that using a system-level many-core overlay avoids complex hardware design and still provides good performance results.
Proceedings of the Design Automation & Test in Europe Conference, 2006
In this paper, we propose two FPGA-area allocation algorithms based on profiling results for reducing the impact on performance of dynamic reconfiguration overheads. The problem of FPGA-area allocation is presented as a 0-1 integer linear programming problem and efficient solvers are incorporated for finding the optimal solutions. Additionally, we discuss the FPGA-area allocation problem in two scenarios. In the first
Implementing an application on a FPGA remains a difficult, non-intuitive task that often requires hardware design expertise in a hardware description language (HDL). High-level synthesis (HLS) raises the design abstraction from HDL to languages such as C/C++/Scala/Java. Despite this, in order to get a good quality of result (QoR), a designer must carefully craft the HLS code. In other words, HLS designers must implement the application using an abstract language in a manner that generates an efficient micro-architecture; we call this process writing restructured code. This reduces the benefits of implementing the application at a higher level of abstraction and limits the impact of HLS by requiring explicit knowledge of the underlying hardware architecture. Developers must know how to write code that reflects low level implementation details of the application at hand as it is interpreted by HLS tools. As a result, FPGA design still largely remains job of either hardware engineers or expert HLS designers. In this work, we aim to take a step towards making HLS tools useful for a broader set of programmers.
ACS/IEEE International Conference on Computer Systems and Applications, 2003. Book of Abstracts., 2003
The main focus of this paper is on implementing high level functional algorithms in reconfigurable hardware. The approach adopts the transformational programming paradigm for deriving massively parallel algorithms from functional specifications. It extends previous work by systematically generating efficient circuits and mapping them into reconfigurable hardware. The massive parallelisation of the algorithm works by carefully composing "off the shelf" highly parallel implementations of each of the basic building blocks involved in the algorithm. These basic building blocks are a small collection of well-known higher order functions such as map, fold, and zipwith. By using function decomposition and data refinement techniques, these powerful functions are refined into highly parallel implementations described in Hoare's CSP. The CSP descriptions are very closely associated with Handle-C program fragments. Handle-C is a programming language based on C and extended by parallelism and communication primitives taken from CSP. In the final stage the circuit description is generated by compiling Handle-C programs and mapping them onto the targeted reconfigurable hardware such as the Celoxica RC-1000 FPGA system. This approach is illustrated by a case study involving the generation of several versions of the matrix multiplication algorithm.
2008
Dynamic hardware generation reduces the number of FPGA resources needed and speeds up the application by optimizing the configuration for the exact problem at hand at run-time. If the problem changes, the system needs to be reconfigured. When this occurs too often, the total reconfiguration overhead is too high and the benefit of using dynamic hardware generation vanishes. Hence, it is important to minimize the number of reconfigurations. We propose a novell technique to reduce the number of reconfigurations by using loop transformations. Our approach is similar to the temporal data locality optimizations. By applying our technique, we can drastically reduce the number of reconfigurations, as indicated by the matrix multiplication example. After applying the loop transformations, the number of reconfigurations decreases by an order of magnitude. Combined with a dynamic hardware generation technique with a very low overhead, our technique obtains a significant speedup over generic circuits.
Loading Preview
Sorry, preview is currently unavailable. You can download the paper by clicking the button above.