Academia.edu no longer supports Internet Explorer.
To browse Academia.edu and the wider internet faster and more securely, please take a few seconds to upgrade your browser.
ACACES Poster Abstracts, L\'Aquila, Italy
AI
This paper presents a method for optimizing loops in applications targeting reconfigurable architectures, specifically addressing loop transformations, including loop unrolling and loop shifting. These transformations are shown to enhance performance by maximizing parallelism within loops, particularly when applied to kernel functions in embedded systems such as video encoders. The proposed approach reduces design-space exploration time and is applicable to a wide range of kernel hardware implementations.
Field Programmable Logic …, 2008
ACM Transactions on Reconfigurable Technology and Systems, 2009
In this paper, we present a new technique for optimizing loops that contain kernels mapped on a recongurable fabric. We assume the Molen machine organization as our framework. We propose combining loop unrolling with loop shifting, which is used to relocate the function calls contained in the loop body such that in every iteration of the transformed loop, software functions (running on GPP) execute in parallel with multiple instances of the kernel (running on FPGA). The algorithm computes the optimal unroll factor and determines the most appropriate transformation (which can be the combination of unrolling plus shifting or either of the two). This method is based on proling information about the kernel's execution times on GPP and FPGA, memory transfers and area utilization. In the experimental part, we apply this method to several kernels from loop nests extracted from real-life applications (DCT and SAD from MPEG2 encoder, Quantizer from JPEG, and Sobel's Convolution) and perform an analysis of the results, comparing them with the theoretical maximum speedup by Amdahl's Law and showing when and how our transformations are benecial.
2009 International Conference on Field-Programmable Technology, 2009
In this paper, we propose new techniques for improving the performance of applications running on a reconfigurable platform supporting the Molen programming paradigm. We focus on parallelizing loops that contain hardware-mapped kernels in the loop body (called K-loops) with wavefront-like dependencies. For this purpose, we use traditional transformations, such as loop skewing for eliminating the dependencies and loop unrolling for parallelization. The first technique presented in this paper improves the application performance by running in parallel on the reconfigurable hardware multiple instances of the kernel. The second technique extends the first one and determines how many kernel instances should be scheduled for software execution in each iteration, concurrently with the hardware execution, such that the hardware and software times are balanced. In the experimental part, we present results when parallelizing the Deblocking Filter (DF), which is part of the H.264 encoder and decoder, after skewing the main DF loop to eliminate the data dependencies. For the unroll factor 8, we report a loop speedup of up to 4.78. 1 http://ce.et.tudelft.nl/DWB/ 978-1-4244-4377-2/09/$25.00
2007 International Conference on Field-Programmable Technology, 2007
Loops are an important source of performance improvement, for which there exists a large number of compiler based optimizations. Few optimizations assume that the loop will be fully mapped on hardware. In this paper, we discuss a loop transformation called Recursive Variable Expansion, which can be efficiently implemented in hardware. It removes all the data dependencies from the program and then the parallelism is only bounded by the amount of resources one has. To show the performance improvement and the utilization of resources, we have chosen four kernels from widely used applications (FIR, DCT, Sobel edge detection algorithm and matrix multiplication). The hardware implementation of these kernels proved to be 1.5 to 77 times faster (depending on application) than the code compiled and run on PowerPC.
VLSI: Systems on a Chip, 2000
Currently multi-FPGA reeonfigurable eomputing systems are still eommonly used for accelerating algorithms. This teehnology where acceleration is aehieved by spatial implementation of an algorithm in reeonfigurable hardware has proven to be feasible. However, the best suiting algorithms are those who are very struetured, ean benefit from deep pipelining and need only loeal eommunieation resourees. Many algorithms ean not fulfil the third requirement onee the problem size grows and multi-FPGA systems beeome neeessary. In this paper we address the emulation of a run time reeonfigurable proeessor arehiteeture, whieh seales better for this kind of eomputing problems.
ACM Transactions on Architecture and Code Optimization, 2012
Pipelining algorithms are typically concerned with improving only the steady-state performance, or the kernel time. The pipeline setup time happens only once and therefore can be negligible compared to the kernel time. However, for Coarse-Grained Reconfigurable Architectures (CGRAs) used as a coprocessor to a main processor, pipeline setup can take much longer due to the communication delay between the two processors, and can become significant if it is repeated in an outer loop of a loop nest. In this paper we evaluate the overhead of such non-kernel execution times when mapping nested loops for CGRAs, and propose a novel architecture-compiler cooperative scheme to reduce the overhead, while also minimizing the number of extra configurations required. Our experimental results using loops from multimedia and scientific domains demonstrate that our proposed techniques can greatly increase the performance of nested loops by up to 2.87 times compared to the conventional approach of acc...
ACM Transactions on Reconfigurable Technology and Systems, 2014
This article presents a reconfigurable hardware/software architecture for binary acceleration of embedded applications. A Reconfigurable Processing Unit (RPU) is used as a coprocessor of the General Purpose Processor (GPP) to accelerate the execution of repetitive instruction sequences called Megablocks. A toolchain detects Megablocks from instruction traces and generates customized RPU implementations. The implementation of Megablocks with memory accesses uses a memory-sharing mechanism to support concurrent accesses to the entire address space of the GPP's data memory. The scheduling of load/store operations and memory access handling have been optimized to minimize the latency introduced by memory accesses. The system is able to dynamically switch the execution between the GPP and the RPU when executing the original binaries of the input application. Our proof-of-concept prototype achieved geometric mean speedups of 1.60× and 1.18× for, respectively, a set of 37 benchmarks and a subset considering the 9 most complex benchmarks. With respect to a previous version of our approach, we achieved geometric mean speedup improvements from 1.22 to 1.53 for the 10 benchmarks previously used.
To accelerate the execution of an application, repetitive logic and arithmetic computation tasks may be mapped to reconfigurable hardware, since dedicated hardware can deliver much higher speeds than those of a general-purpose processor. However, this is only feasible if the run-time reconfiguration of new tasks is fast enough, so as not to delay application execution. Currently, this is opposed by architectural constraints intrinsic to current Field-Programmable Logic Array (FPGA) architectures. Despite all new features exhibited by current FPGAs, architecturally they are still largely based on general-purpose architectures that are inadequate for the demands of reconfigurable computing. Large configuration file sizes and poor hardware and software support for partial and dynamic reconfiguration limits the acceleration that reconfigurable computing may bring to applications. The objective of this work is the identification of the architectural limitations exhibited by current FPGAs...
2008
Dynamic hardware generation reduces the number of FPGA resources needed and speeds up the application by optimizing the configuration for the exact problem at hand at run-time. If the problem changes, the system needs to be reconfigured. When this occurs too often, the total reconfiguration overhead is too high and the benefit of using dynamic hardware generation vanishes. Hence, it is important to minimize the number of reconfigurations. We propose a novell technique to reduce the number of reconfigurations by using loop transformations. Our approach is similar to the temporal data locality optimizations. By applying our technique, we can drastically reduce the number of reconfigurations, as indicated by the matrix multiplication example. After applying the loop transformations, the number of reconfigurations decreases by an order of magnitude. Combined with a dynamic hardware generation technique with a very low overhead, our technique obtains a significant speedup over generic circuits.
2005
The X4CP32 is an architecture that combines the parallel and reconfigurable paradigms. It consists of a grid of Reconfigurable and Programming Units (RPUs), each one containing 4 Cells (including a microprocessor in each Cell), responsible for all the processing and program flow. This paper presents architectural modifications in the X4CP32 in order to increase its performance. The RPU was implemented according to the VLIW (Very Long Instruction Word) methodology, and the Cells were redesigned with a pipelined implementation. These improvements raised the maximum IPC of the RPU from 0.5 to 4 with an area overhead of 26%. To evaluate the new architecture, versions of the 2D Discrete Cosine Transform, Montgomery Modular Multiplication and Color Space Conversion were mapped, using the baseline architecture and the pipelined VLIW architecture.
2011
In this paper we present "Snake", a novel technique for allocating and executing hardware tasks onto partially reconfigurable Xilinx FPGAs. Snake permits to alleviate the bottleneck introduced by the Internal Configuration Access Port (ICAP) in Xilinx FPGAs, by reusing both intermediate partial results and previously allocated pieces of circuitry. Moreover, Snake considers often neglected aspects in previous approaches when making allocation decisions, such as the technological constraints introduced by reconfigurable technology and inter-task communication issues. As a result of being a realistic solution its implementation using real FPGA hardware has been successful. We have checked its ability to reduce not only the overall execution time of a wide range of synthetic reconfigurable applications, but also time overheads in making allocation decisions in the first place.
International Journal of Reconfigurable Computing, 2012
We propose afast data relay(FDR) mechanism to enhance existing CGRA (coarse-grained reconfigurable architecture). FDR can not only provide multicycle data transmission in concurrent with computations but also convert resource-demanding inter-processing-element global data accesses into local data accesses to avoid communication congestion. We also propose the supporting compiler techniques that can efficiently utilize the FDR feature to achieve higher performance for a variety of applications. Our results on FDR-based CGRA are compared with two other works in this field: ADRES and RCP. Experimental results for various multimedia applications show that FDR combined with the new compiler deliver up to 29% and 21% higher performance than ADRES and RCP, respectively.
ACM Transactions on Embedded Computing Systems, 2007
In this paper, we describe the compiler developed to target the Molen reconfigurable processor and programming paradigm. The compiler automatically generates optimized binary code for C applications, based on pragma annotation of the code executed on the reconfigurable hardware. For the IBM PowerPC 405 processor included in the Virtex II Pro platform FPGA, we implemented code generation, register and stack frame allocation following the PowerPC EABI (Embedded Application Binary Interface). The PowerPC backend has been extended to generate the appropriate instructions for the reconfigurable hardware and data transfer, taking into account the information of the specific hardware implementations and system. Starting with an annotated C application, a complete design flow has been integrated to generate the executable bitstream for the reconfigurable processor. The flexible design of the proposed infrastructure allows to consider the special features of the reconfigurable architectures. In order to hide the reconfiguration latencies, we implemented an instruction scheduling algorithm for the dynamic hardware configuration instructions. The algorithm schedules in advance the hardware configuration instructions, taking into account the conflicts for the reconfigurable hardware resources (FPGA area) between the hardware operations. To verify the Molen compiler, we used the multimedia video frame M-JPEG encoder of which the extended Discrete Cosine Transform(DCT*) function was mapped on the FPGA. We obtained an overall speedup of 2.5 (about 84 % efficiency over the maximal theoretical speedup of 2.96). The performance efficiency is achieved using automatically generated non-optimized DCT* hardware implementation. The instruction scheduling algorithm has been tested for DCT, Quantization and VLC operations. Based on simulation results, we determine that, while a simple scheduling produces a significant performance decrease, our proposed scheduling contributes for up to 16x M-JPEG encoder speedup.
Lecture Notes in Computer Science, 2008
Loops are an important source of optimization. In this paper, we address such optimizations for those cases when loops contain kernels mapped on reconfigurable fabric. We assume the Molen machine organization and Molen programming paradigm as our framework. The proposed algorithm computes the optimal unroll factor u for a loop that contains a hardware kernel K such that u instances of K run in parallel on the reconfigurable hardware, and the targeted balance between performance and resource usage is achieved. The parameters of the algorithm consist of profiling information about the execution times for running K in both hardware and software, the memory transfers and the utilized area. In the experimental part, we illustrate this method by applying it to a loop nest from a real-life application (MPEG2), containing the DCT kernel.
Design, Automation & Test in Europe Conference & Exhibition (DATE), 2015, 2015
The acceleration of applications, running on a general purpose processor (GPP), by mapping parts of their execution to reconfigurable hardware is an approach which does not involve program's source code and still ensures program portability over different target reconfigurable fabrics. However, the problem is very challenging, as suitable sequences of GPP instructions need to be translated/mapped to hardware, possibly at runtime. Thus, all mapping steps, from compiler analysis and optimizations to hardware generation, need to be both efficient and fast. This paper introduces some of the most representative approaches for binary acceleration using reconfigurable hardware, and presents our binary acceleration approach and the latest results. Our approach extends a GPP with a Reconfigurable Processing Unit (RPU), both sharing the data memory. Repeating sequences of GPP instructions are migrated to an RPU composed of functional units and interconnect resources, and able to exploit instruction-level parallelism, e.g., via loop pipelining. Although we envision a fully dynamic system, currently the RPU resources are selected and organized offline using execution trace information. We present implementation prototypes of the system on a Spartan-6 FPGA with a MicroBlaze as GPP and the very encouraging results achieved with a number of benchmarks.
2013
— Reconfigurable systems can offer the high spatial parallelism and fine-grained, bit-level resource control traditionally associated with hardware implementations, along with the flexibility and adaptability characteristic of software. While reconfigurable systems create new opportunities for engineering and delivering high-performance programmable systems, the traditional approaches to programming and managing computations used for hardware systems (e.g. Verilog, VHDL) and software systems (e.g. C, Fortran, Java) are inappropriate and inadequate for exploiting reconfigurable platforms. To address this need, we develop a stream-oriented compute model, system architecture, and execution patterns which can capture and exploit the parallelism of spatial computations while simultaneously abstracting software applications from hardware details (e.g., timing, device capacity, microarchitectural implementation details) and consequently allowing applications to scale to exploit newer, larg...
2010
In this paper we present a novel technique to accelerate reconfigurable video coding with parallel architectures. We focus on the use of the Graphics Processing Unit (GPU) as our platform for parallel processing but the algorithm can be implemented on other parallel architectures. Implementation of the solution shows that execution time is reduced 16-60% depending on the decoder module implemented on the GPU.
International Journal of Electronics, 2007
In this paper, we target at a Reconfigurable Instruction Set Processor (RISP), which tightly couples a coarse-grain Reconfigurable Functional Unit (RFU) to a RISC processor. Furthermore, the architecture is supported by a flexible development framework. By allowing the definition of alternate architectural parameters the framework can be used to explore the design space and fine-tune the architecture at design time. Initially, two architectural enhancements, namely partial predicated execution and virtual opcode are proposed and the extensions performed in the architecture and the framework to support them, are presented. To evaluate these issues kernels from the multimedia domain are considered and an exploration to derive an appropriate instance of the architecture is performed. The efficiency of the derived instance and the proposed enhancements are evaluated using an MPEG-2 encoder application.
2010
The advantage in multiprocessors is the performance speedup obtained with processorlevel parallelism. Similarly, the exibility for application-specic adaptability is the advantage in recongurable architectures. To benet from both these architectures, we present a recongurable multiprocessor template that combines parallelism in multiprocessors and exibility in recongurable architectures. A fast, single cycle, resourceecient, run-time reconguration scheme accelerates customisations in the recongurable multiprocessor template. Based on this methodology, a four-core multiprocessor called QuadroCore has been implemented on UMC's 90nm standard cells and on Xilinx's FPGA. QuadroCore is customisable and adapts to variations in the granularity of parallelism, the amount of communication between tasks, and the frequency of synchronisation. To validate the advantages of this approach, a diverse set of applications has been mapped onto the QuadroCore multiprocessor. Experimental results show speedups in the range of 3 to 11 in comparison to a single processor. In addition, energy savings of up to 30% were noted on account of reconguration. Furthermore, to steer application mapping based on power considerations, an instruction-level power model has been developed. Using this model, power-driven instruction selection introduces energy savings of up to 70% in the QuadroCore multiprocessor.
3rd IEEE International Symposium on Industrial Embedded Systems (SIES'2008), Montpellier (France), pp. 11-18, 11-13 June 2008
Dynamic scheduling algorithms have been successfully used for parallel computations of nested loops in traditional parallel computers and clusters. In this paper we propose a new architecture, implementing a coarse grain dynamic loop scheduling, suitable for reconfigurable hardware platforms. We use an analytical model and a case study to evaluate the performance of the proposed architecture. This approach makes efficient memory and processing elements use and thus gives better results than previous approaches.
Loading Preview
Sorry, preview is currently unavailable. You can download the paper by clicking the button above.