Academia.edu no longer supports Internet Explorer.
To browse Academia.edu and the wider internet faster and more securely, please take a few seconds to upgrade your browser.
ACACES Poster Abstracts, L\'Aquila, Italy
…
4 pages
1 file
Loops are an important source of optimization. In this paper, we show how traditional loop trans-formations such as loop unrolling and loop shifting can be applied in the context or reconfigurable computing. By applying unrolling and shifting to a loop containing a hardware kernel, we ...
Field Programmable Logic …, 2008
ACM Transactions on Reconfigurable Technology and Systems, 2009
In this paper, we present a new technique for optimizing loops that contain kernels mapped on a recongurable fabric. We assume the Molen machine organization as our framework. We propose combining loop unrolling with loop shifting, which is used to relocate the function calls contained in the loop body such that in every iteration of the transformed loop, software functions (running on GPP) execute in parallel with multiple instances of the kernel (running on FPGA). The algorithm computes the optimal unroll factor and determines the most appropriate transformation (which can be the combination of unrolling plus shifting or either of the two). This method is based on proling information about the kernel's execution times on GPP and FPGA, memory transfers and area utilization. In the experimental part, we apply this method to several kernels from loop nests extracted from real-life applications (DCT and SAD from MPEG2 encoder, Quantizer from JPEG, and Sobel's Convolution) and perform an analysis of the results, comparing them with the theoretical maximum speedup by Amdahl's Law and showing when and how our transformations are benecial.
2009 International Conference on Field-Programmable Technology, 2009
In this paper, we propose new techniques for improving the performance of applications running on a reconfigurable platform supporting the Molen programming paradigm. We focus on parallelizing loops that contain hardware-mapped kernels in the loop body (called K-loops) with wavefront-like dependencies. For this purpose, we use traditional transformations, such as loop skewing for eliminating the dependencies and loop unrolling for parallelization. The first technique presented in this paper improves the application performance by running in parallel on the reconfigurable hardware multiple instances of the kernel. The second technique extends the first one and determines how many kernel instances should be scheduled for software execution in each iteration, concurrently with the hardware execution, such that the hardware and software times are balanced. In the experimental part, we present results when parallelizing the Deblocking Filter (DF), which is part of the H.264 encoder and decoder, after skewing the main DF loop to eliminate the data dependencies. For the unroll factor 8, we report a loop speedup of up to 4.78. 1 http://ce.et.tudelft.nl/DWB/ 978-1-4244-4377-2/09/$25.00
Lecture Notes in Computer Science, 2008
Loops are an important source of optimization. In this paper, we address such optimizations for those cases when loops contain kernels mapped on reconfigurable fabric. We assume the Molen machine organization and Molen programming paradigm as our framework. The proposed algorithm computes the optimal unroll factor u for a loop that contains a hardware kernel K such that u instances of K run in parallel on the reconfigurable hardware, and the targeted balance between performance and resource usage is achieved. The parameters of the algorithm consist of profiling information about the execution times for running K in both hardware and software, the memory transfers and the utilized area. In the experimental part, we illustrate this method by applying it to a loop nest from a real-life application (MPEG2), containing the DCT kernel.
2007 International Conference on Field-Programmable Technology, 2007
Loops are an important source of performance improvement, for which there exists a large number of compiler based optimizations. Few optimizations assume that the loop will be fully mapped on hardware. In this paper, we discuss a loop transformation called Recursive Variable Expansion, which can be efficiently implemented in hardware. It removes all the data dependencies from the program and then the parallelism is only bounded by the amount of resources one has. To show the performance improvement and the utilization of resources, we have chosen four kernels from widely used applications (FIR, DCT, Sobel edge detection algorithm and matrix multiplication). The hardware implementation of these kernels proved to be 1.5 to 77 times faster (depending on application) than the code compiled and run on PowerPC.
Workshop on …, 2004
There are a growing number of reconfigurable architectures that combine the advantages of a hardwired implementation (performance, power consumption) with the advantages of a software solution (flexibility, time to market). Today, there are devices on the market that can be dynamically reconfigured at run-time within one clock cycle. But the benefits of these architectures can only be utilized if applications can be mapped efficiently. In this paper we describe a design approach for reconfigurable architectures that takes into account the three aspects architecture, compiler, and applications. To realize the proposed design flow we developed a synthesizable architecture model. From this model we obtain estimations for speed, area, and power that are used to provide the compiler with the necessary timing information and to optimize the architecture.
ACM Transactions on Architecture and Code Optimization, 2012
Pipelining algorithms are typically concerned with improving only the steady-state performance, or the kernel time. The pipeline setup time happens only once and therefore can be negligible compared to the kernel time. However, for Coarse-Grained Reconfigurable Architectures (CGRAs) used as a coprocessor to a main processor, pipeline setup can take much longer due to the communication delay between the two processors, and can become significant if it is repeated in an outer loop of a loop nest. In this paper we evaluate the overhead of such non-kernel execution times when mapping nested loops for CGRAs, and propose a novel architecture-compiler cooperative scheme to reduce the overhead, while also minimizing the number of extra configurations required. Our experimental results using loops from multimedia and scientific domains demonstrate that our proposed techniques can greatly increase the performance of nested loops by up to 2.87 times compared to the conventional approach of acc...
2004
This paper considers the problem of compiling programs, written in a high-level programming language, into hardware circuits executed by an Field Programmable Gate Array (FPGA). In particular, we consider the problem of synthesising nested loops that frequently access array elements stored in an external memory (outside the FPGA). We propose an aggressive profile-based compilation scheme, based on loop unrolling and code flattening techniques, where array references from/to the external memory are overlapped with an uninterrupted hardware evaluation of the synthesised loop's circuit. Experimental results show that large code segments can be compiled into circuits by using the proposed scheme.
2011
In this paper we present "Snake", a novel technique for allocating and executing hardware tasks onto partially reconfigurable Xilinx FPGAs. Snake permits to alleviate the bottleneck introduced by the Internal Configuration Access Port (ICAP) in Xilinx FPGAs, by reusing both intermediate partial results and previously allocated pieces of circuitry. Moreover, Snake considers often neglected aspects in previous approaches when making allocation decisions, such as the technological constraints introduced by reconfigurable technology and inter-task communication issues. As a result of being a realistic solution its implementation using real FPGA hardware has been successful. We have checked its ability to reduce not only the overall execution time of a wide range of synthetic reconfigurable applications, but also time overheads in making allocation decisions in the first place.
2008
The ever increasing intricacy of the systems and the increasing use of reconfigurble heterogeneous devices significantly enlarges the design complexity of the modern embedded systems. As a result, to create a good design, it is essential to perform Design Space Exploration(DSE) at various design levels in order to evaluate several design choices. DSE at early design stages helps designers to systematically explore trade-offs between various design goals and to make various design decisions such as hardware/software partitioning, architecture -to -application mappings, task scheduling and task allocation, performance evaluation etc. As the design progresses, the design space can be gradually trimmed and pruned at different design levels of unsuitable design alternatives until a final optimal solution is reached. In this work, we present a system level framework for higher level runtime design space exploration of reconfigurable architectures.
Loading Preview
Sorry, preview is currently unavailable. You can download the paper by clicking the button above.
3rd IEEE International Symposium on Industrial Embedded Systems (SIES'2008), Montpellier (France), pp. 11-18, 11-13 June 2008
Design, Automation and Test in Europe
International Journal of Computer Applications, 2014
IEEE Transactions on Very Large Scale Integration (VLSI) Systems, 2008
Parallel and Distributed …
VLSI: Systems on a Chip, 2000
2011 9th IEEE International Conference on Industrial Informatics, 2011
2013 International Conference on Embedded Computer Systems: Architectures, Modeling, and Simulation (SAMOS), 2013
ACM Transactions on Reconfigurable Technology and Systems, 2009