Academia.edu no longer supports Internet Explorer.
To browse Academia.edu and the wider internet faster and more securely, please take a few seconds to upgrade your browser.
2006, Microprocessors and Microsystems
Reconfigurable systems can offer the high spatial parallelism and fine-grained, bit-level resource control traditionally associated with hardware implementations, along with the flexibility and adaptability characteristic of software. While reconfigurable systems create new opportunities for engineering and delivering high-performance programmable systems, the traditional approaches to programming and managing computations used for hardware systems (e.g., Verilog, VHDL) and software systems (e.g., C, Fortran, Java) are inappropriate and inadequate for exploiting reconfigurable platforms. To address this need, we develop a stream-oriented compute model, system architecture, and execution patterns which can capture and exploit the parallelism of spatial computations while simultaneously abstracting software applications from hardware details (e.g., timing, device capacity, and microarchitectural implementation details) and consequently allowing applications to scale to exploit newer, larger, and faster hardware platforms. Further, we describe hardware and software techniques that make this late-bound platform mapping viable and efficient.
2000
A primary impediment to wide-spread exploitation of reconfigurable computing is the lack of a unifying computational model which allows application portability and longevity without sacrificing a substantial fraction of the raw capabilities. We introduce SCORE (Stream Computation Organized for Reconfigurable Execution), a streambased compute model which virtualizes reconfigurable computing resources (compute, storage, and communication) by dividing a computation up into fixed-size "pages" and time-multiplexing the virtual pages on available physical hardware. Consequently, SCORE applications can scale up or down automatically to exploit a wide range of hardware sizes. We hypothesize that the SCORE model will ease development and deployment of reconfigurable applications and expand the range of applications which can benefit from reconfigurable execution. Further, we believe that a well engineered SCORE implementation can be efficient, wasting little of the capabilities of the raw hardware. In this paper, we introduce the key components of the SCORE system.
2013
— Reconfigurable systems can offer the high spatial parallelism and fine-grained, bit-level resource control traditionally associated with hardware implementations, along with the flexibility and adaptability characteristic of software. While reconfigurable systems create new opportunities for engineering and delivering high-performance programmable systems, the traditional approaches to programming and managing computations used for hardware systems (e.g. Verilog, VHDL) and software systems (e.g. C, Fortran, Java) are inappropriate and inadequate for exploiting reconfigurable platforms. To address this need, we develop a stream-oriented compute model, system architecture, and execution patterns which can capture and exploit the parallelism of spatial computations while simultaneously abstracting software applications from hardware details (e.g., timing, device capacity, microarchitectural implementation details) and consequently allowing applications to scale to exploit newer, larg...
2000
A primary impediment to wide-spread exploitation of reconfigurable computing is the lack of a unifying computational model which allows application portability and longevity without sacrificing a substantial fraction of the raw capabilities. We introduce SCORE (Stream Computation Organized for Reconfigurable Execution), a stream-based compute model which virtualizes reconfigurable computing resources (compute, storage, and communication) by dividing a computation up into fixed-size “pages” and time-multiplexing the virtual pages on available physical hardware. Consequently, SCORE applications can scale up or down automatically to exploit a wide range of hardware sizes. We hypothesize that the SCORE model will ease development and deployment of reconfigurable applications and expand the range of applications which can benefit from reconfigurable execution. Further, we believe that a well engineered SCORE implementation can be efficient, wasting little of the capabilities of the raw hardware. In this abstract, we highlight the key components of the SCORE system.
2000
A primary impediment to wide-spread exploitation of reconfigurable computing is the lack of a unifying computational model which allows application portability and longevity without sacrificing a substantial fraction of the raw capabilities. We introduce SCORE (Stream Computation Organized for Reconfigurable Execution), a stream-based compute model which virtualizes reconfigurable computing resources (compute, storage, and communication) by dividing a computation up into fixed-size pages and time-multiplexing the virtual pages on available physical hardware. Consequently, SCORE applications can scale up or down automatically to exploit a wide range of hardware sizes. We hypothesize that the SCORE model will ease development and deployment of reconfigurable applications and expand the range of applications which can benefit from reconfigurable execution. Further, we believe that a well engineered SCORE implementation can be efficient, wasting little of the capabilities of the raw har...
New parallel architectures are emerging to meet the increased computational demands of streaming applications. This creates a need for high-level, architecture-independent languages. One such language is StreamIt, designed around the notions of streams and stream transformers, which allows efficient mapping to a variety of architectures. This paper presents our approach of compiling StreamIt applications to the XPP reconfigurable array architecture. We focus mainly on the compiler back end. Although StreamIt exposes the parallelism in the stream program, still a thorough analysis is needed to adapt it to the target architecture. A code generator has been designed for the XPP. It has been demonstrated that by applying optimizations, performance comparable to the low level NML implementation can be achieved. Moreover, the construction of the compiler makes it possible to port StreamIt applications to multiprocessor architectures by doing some architecture specific modifications in the back end.
2007
This work deals with reconfigurable computation platforms for high speed simulation of physical phenomena, based on numerical models of algebraic linear systems. This type of simulation is of extreme importance in research centers as CENPES/Petrobrs, that develops applications of geophysical processing for prospection of oil and gas. Currently, these applications are implemented on PCs conventional clusters. A new approach for this type of problem is presented here, based on reconfigurable computer systems using Field Programmable Gate Arrays technology (FPGA) and its implications regarding the hardware/software partitioning, operating system, memory connections, communication and device drivers. Such technologies make possible appreciable profits in terms of performance -electric power and processing speed when compared to the conventional clusters. This solution also promotes cost reduction when applied to massive computation and high complexity large data applications, normally used in scientific computation.
Computer, 2000
Initial performance results with FPGAs were impressive. However, commercial FPGAs have inherent shortcomings, which heretofore made reconfigurable computing impractical for mainstream computing: • Logic granularity. FPGAs are designed for logic replacement. The functional units' granularity is optimized to replace random logic, not to perform multimedia computations. Reconfigurable computing will change the way computing systems are designed, built, and used. PipeRench, a new reconfigurable fabric, combines the flexibility of general-purpose processors with the efficiency of customized hardware to achieve extreme performance speedup.
MICRO-54: 54th Annual IEEE/ACM International Symposium on Microarchitecture, 2021
Coarse-grain reconfigurable arrays (CGRAs) can achieve much higher performance and efficiency than general-purpose cores, approaching the performance of a specialized design while retaining programmability. Unfortunately, CGRAs have so far only been effective on applications with regular compute patterns. However, many important workloads like graph analytics, sparse linear algebra, and databases, are irregular applications with unpredictable access patterns and control flow. Since CGRAs map computation statically to a spatial fabric of functional units, irregular memory accesses and control flow cause frequent stalls and load imbalance. We present Fifer, an architecture and compilation technique that makes irregular applications efficient on CGRAs. Fifer first decouples irregular applications into a feed-forward network of pipeline stages. Each resulting stage is regular and can efficiently use the CGRA fabric. However, irregularity causes stages to have widely varying loads, resulting in high load imbalance if they execute spatially in a conventional CGRA. Fifer solves this by introducing dynamic temporal pipelining: it time-multiplexes multiple stages onto the same CGRA, and dynamically schedules stages to avoid load imbalance. Fifer makes time-multiplexing fast and cheap to quickly respond to load imbalance while retaining the efficiency and simplicity of a CGRA design. We show that Fifer improves performance by gmean 2.8× (and up to 5.5×) over a conventional CGRA architecture (and by gmean 17× over an out-of-order multicore) on a variety of challenging irregular applications. CCS CONCEPTS • Computer systems organization → Parallel architectures; Reconfigurable computing.
2001
Reconfigurable computing devices such as Field Programmable Gate Arrays (FPGAs) have demonstrated 10x–100x gain over conventional microprocessors in performance and functional density (operations per area-time) for a variety of applications [6]. The strength of reconfigurable computing comes from its combining of spatial execution with programmability— the former allows computational data paths to be highly parallel, while the latter allows data paths to be highly specialized to the application at hand. The commercial marketplace has relegated reconfigurable devices to be used primarily as ASIC replacements, executing only a single static configuration. This usage ignores many of the key performance benefits of reconfigurable technology, for instance dynamic reconfiguration and run-time specialization. The Berkeley SCORE project [4] contends that the present underuse of reconfigurable technology is due in great part to a lack of any unifying compute model to support its key technolo...
Zenodo (CERN European Organization for Nuclear Research), 2022
Reconfigurable heterogeneous computing systems (RHCS) have been used to exploit parallelism by means of coupled and coordinated processing between FPGA and different microprogrammable computing devices. However, these systems have high programming complexity due to the details associated with parallelism and FPGA logic design. Therefore, the development of the hardware and software components of an application at the same level of abstraction has been difficult to achieve. Several techniques have been described in the literature that attempt to reduce such complexity to the programmer, but without achieving sufficient transparency and abstraction. In this paper we introduce a reconfigurable pattern of parallel 'pipeline' computing, called PipeSkeleton. It is an algorithmic skeleton provided as a high-level template in OpenCL code. As a demonstration, a tests suite with different configurations integrating hardware and software components are described. It was shown that configurations with hardware-implemented kernels and software-implemented data input/output run faster and consume fewer resources. As conclusion, the tool provides to the programmer the abstraction level to easily move functionalities between software and hardware during the exploration stage of application design spaces.
ACS/IEEE International Conference on Computer Systems and Applications, 2003. Book of Abstracts., 2003
The main focus of this paper is on implementing high level functional algorithms in reconfigurable hardware. The approach adopts the transformational programming paradigm for deriving massively parallel algorithms from functional specifications. It extends previous work by systematically generating efficient circuits and mapping them into reconfigurable hardware. The massive parallelisation of the algorithm works by carefully composing "off the shelf" highly parallel implementations of each of the basic building blocks involved in the algorithm. These basic building blocks are a small collection of well-known higher order functions such as map, fold, and zipwith. By using function decomposition and data refinement techniques, these powerful functions are refined into highly parallel implementations described in Hoare's CSP. The CSP descriptions are very closely associated with Handle-C program fragments. Handle-C is a programming language based on C and extended by parallelism and communication primitives taken from CSP. In the final stage the circuit description is generated by compiling Handle-C programs and mapping them onto the targeted reconfigurable hardware such as the Celoxica RC-1000 FPGA system. This approach is illustrated by a case study involving the generation of several versions of the matrix multiplication algorithm.
Lecture Notes in Computer Science, 2009
Reconfigurable computing is an emerging paradigm enabled by the growth in size and speed of FPGAs. In this paper we discuss its place in the evolution of computing as a technology as well as the role it can play in the current technology outlook. We discuss the evolution of ROCCC (Riverside Optimizing Compiler for Configurable Computing) in this context.
arXiv: Programming Languages, 2017
This methodology paper addresses high-performance highproductivity programming on spatial architectures. Spatial architectures are efficient for executing dataflow algorithms, yet for high-performance programming, the productivity is low and verification is painful. We show that coding and verification are the biggest obstacle to the wide adoption of spatial architectures. We propose a new programming methodology, T2S (Temporal to Spatial), to remove this obstacle. A programmer specifies a temporal definition and a spatial mapping. The temporal definition defines the functionality to compute, while the spatial mapping defines how to decompose the functionality and map the decomposed pieces onto a spatial architecture. The specification precisely controls a compiler to actually implement the loop and data transformations specified in the mapping. The specification is loop-nest-and matrix-oriented, and thus lends itself to the compiler for automatic, static verification. Many generic, strategic loop and data optimizations can be systematically expressed. Consequently, high performance is expected with substantially higher productivity: compared with high-performance programming in today's high-level synthesis (HLS) languages or hardware description languages (HDLs), the engineering effort on coding and verification is expected to be reduced from months to hours, a reduction of 2 or 3 orders of magnitude.
IEEE Transactions on Very Large Scale Integration (VLSI) Systems, 2008
This paper introduces hthreads, a unifying programming model for specifying application threads running within a hybrid computer processing unit (CPU)/field-programmable gate-array (FPGA) system. Presently accepted hybrid CPU/FPGA computational models-and access to these computational models via high level languages-focus on programming language extensions to increase accessibility and portability. However, this paper argues that new high-level programming models built on common software abstractions better address these goals. The hthreads system, in general, is unique within the reconfigurable computing community as it includes operating system and middleware layer abstractions that extend across the CPU/FPGA boundary. This enables all platform components to be abstracted into a unified multiprocessor architecture platform. Application programmers can then express their computations using threads specified from a single POSIX threads (pthreads) multithreaded application program and can then compile the threads to either run on the CPU or synthesize them to run within an FPGA. To enable this seamless framework, we have created the hardware thread interface (HWTI) component to provide an abstract, platform-independent compilation target for hardware-resident computations. The HWTI enables the use of standard thread communication and synchronization operations across the software/hardware boundary. Key operating system primitives have been mapped into hardware to provide threads running in both hardware and software uniform access to a set of sub-microsecond, minimal-jitter services. Migrating the operating system into hardware removes the potential bottleneck of routing all system service requests through a central CPU.
Design, Automation & Test in Europe Conference & Exhibition (DATE), 2015, 2015
The acceleration of applications, running on a general purpose processor (GPP), by mapping parts of their execution to reconfigurable hardware is an approach which does not involve program's source code and still ensures program portability over different target reconfigurable fabrics. However, the problem is very challenging, as suitable sequences of GPP instructions need to be translated/mapped to hardware, possibly at runtime. Thus, all mapping steps, from compiler analysis and optimizations to hardware generation, need to be both efficient and fast. This paper introduces some of the most representative approaches for binary acceleration using reconfigurable hardware, and presents our binary acceleration approach and the latest results. Our approach extends a GPP with a Reconfigurable Processing Unit (RPU), both sharing the data memory. Repeating sequences of GPP instructions are migrated to an RPU composed of functional units and interconnect resources, and able to exploit instruction-level parallelism, e.g., via loop pipelining. Although we envision a fully dynamic system, currently the RPU resources are selected and organized offline using execution trace information. We present implementation prototypes of the system on a Spartan-6 FPGA with a MicroBlaze as GPP and the very encouraging results achieved with a number of benchmarks.
IEEE Design and Test of Computers, 2005
Proceedings of the international conference on Compilers, architecture, and synthesis for embedded systems - CASES '01, 2001
The rapid growth of silicon densities has made it feasible to deploy reconfigurable hardware as a highly parallel computing platform. However, in most cases, the application needs to be programmed in hardware description or assembly languages, whereas most application programmers are familiar with the algorithmic programming paradigm. SA-C has been proposed as an expression-oriented language designed to implicitly express data parallel operations. Morphosys is a reconfigurable system-on-chip architecture that supports a data-parallel, SIMD computational model. This paper describes a compiler framework to analyze SA-C programs, perform optimizations, and map the application onto the Morphosys architecture. The mapping process involves operation scheduling, resource allocation and binding and register allocation in the context of the Morphosys architecture. The execution times of some compiled image-processing kernels can achieve up to 42x speed-up over an 800 MHz Pentium III machine.
Loading Preview
Sorry, preview is currently unavailable. You can download the paper by clicking the button above.