Academia.edu no longer supports Internet Explorer.
To browse Academia.edu and the wider internet faster and more securely, please take a few seconds to upgrade your browser.
2010
Looping operations impose a significant bottleneck to achieving better computational efficiency for embedded applications. To confront this problem in embedded computation either in the form of programmable processors or FSMD (Finite-State Machine with Datapath) architectures, the use of customized loop controllers has been suggested. In this paper, a thorough examination of zero-cycle overhead loop controllers applicable to perfect loop nests operating on multi-dimensional data is presented. The design of such loop controllers is formalized by the introduction of a hardware algorithm that fully automates this task for the spectrum of behavioral as well as generated register-transfer level architectures. The presented algorithm would prove beneficial in the field of high-level synthesis of architectures for data-intensive processing. It is also shown that the proposed loop controllers can be efficiently utilized for supporting generalized loop structures such as imperfect loop nests. The performance characteristics (cycle time, chip area) of the proposed architectures have been evaluated for FPGA target implementations. It is shown that maximum clock frequencies of above 230MHz with low logic footprints of about 1.4% of the overall logic resources can be achieved for supporting up to 8 nested loops with 16-bit indices on a modestly-sized Xilinx Virtex-5 device.
Microprocessors and Microsystems, 2012
The increased capacity and enhanced features of modern FPGAs opens new opportunities for their use as application accelerators. However, for FPGAs to be accepted as mainstream acceleration solutions, long design cycles must be shortened by using high-level synthesis tools in the design process. Current HLS tools targeting FPGAs have several limitations including the inefficient use of deeply pipelined arithmetic operators, commonly encountered in high-throughput FPGA designs. We focus here on the efficient generation of FPGA-specific hardware accelerators for regular codes with perfect loop nests where inner statements are implemented as a pipelined arithmetic operator, which is often the case of scientific codes using floating-point arithmetic. We propose a semi-automatic code generation process where the arithmetic operator is identified and generated. Its pipeline information is used to reschedule the initial program execution in order to keep the operator's pipeline as ''busy'' as possible, while minimizing memory access. Next, we show how our method can be used as a tool to generate control FSMs for multiple parallel computing cores. Finally, we show that accounting for the application's accuracy needs allows designing smaller and faster operators.
2007
This paper present a framework for automatic mapping of perfectly nested loops with constant dependences onto regular processor arrays, suitable for direct implementation on Field Programmable Gate Arrays (FPGAs). The problem is modeled as that of finding a suitable completion procedure for a full-rank linear transformation on the iteration space. The approach enables extraction of necessary degrees of communication-free and pipelined parallelism to optimize performance under the resource constraints of limited logic resources and I/O bandwidth available on an FPGA. The generation of control signals for the custom processing elements is also addressed. Examples of automatic derivation of parallel designs for some common nested loops are provided. Experimental results on the Cray XD1 show that an FPGA-based matrix-multiplication design obtained using the framework attains significant speedup on the XD1's attached FPGA, when compared to execution on the XD1 CPU.
IEEE Transactions on Computers, 2008
Looping operations impose a significant bottleneck to achieving better computational efficiency for embedded applications. In this paper, a novel zero-overhead loop controller (ZOLC) supporting arbitrary loop structures with multiple-entry and multiple-exit nodes is described and utilized to enhance embedded RISC processors. A graph formalism is introduced for representing the loop structure of application programs, which can assist in ZOLC code synthesis. Also, a portable description of a ZOLC component is given in detail, which can be exploited in the scope of RTL synthesis for enabling its utilization. This description is designed to be easily retargetable to single-issue RISC processors, requiring only minimal effort for this task. The ZOLC unit has been incorporated to different RISC processor models and research ASIPs at different abstraction levels (RTL VHDL and ArchC) to provide effective means for low-overhead looping without negative impact to the processor cycle time. Average performance improvements of 25.5% and 44% are feasible for a set of kernel benchmarks on an embedded RISC and an applicationspecific processor, respectively. A corresponding 10% speedup is achieved on the same RISC for a subset of MiBench applications, not necessarily featuring the examined performance-critical kernels.
2014 IEEE 32nd International Conference on Computer Design (ICCD), 2014
Real-world applications such as image processing, signal processing, and others often contain a sequence of computation intensive kernels, each represented in the form of a nested loop. High-level synthesis (HLS) enables efficient hardware implementation of these loops using high-level programming languages. HLS tools also allow the designers to evaluate design choices with different trade-offs through pragmas/directives. Prior design space exploration techniques for HLS primarily focus on either single nested loop or multiple loops without consideration to the data dependencies among them. In this paper, we propose efficient design space exploration techniques for applications that consist of multiple nested loops with or without data dependencies. In particular, we develop an algorithm to derive the Paretooptimal curve (performance versus area) of the application when mapped onto FPGAs using HLS. Our algorithm is efficient as it effectively prunes the dominated points in the design space. We also develop accurate performance and area models to assist the design space exploration process. Experiments on various scientific kernels and real-world applications demonstrate that our design space exploration technique is accurate and efficient.
2014
In order for FPGAs to be successful outside traditional markets, tools which enable software programmers to achieve high levels of system performance while abstracting away the FPGA-specific details are needed. DSPB Builder Advanced (DSPBA) is one such tool. DSPBA provides model-based design environment using Matlab's Simulink frontend that decouples the fully-algorithmic design description from the details of FPGA system generation. DSPBA offers several levels of debugging: from Simulink scopes to bit-accurate-simulation and silver reference models. It also offers the most comprehensive set of fixed-point, floating-point and signal-processing IPs available today. The combination of 7 floating-point precisions, fused-datapath support, custom operator support and automated folding allows exploring the best tradeoffs between accuracy, size and throughput. The DSPBA backend protects users from the details of device-dependent operator mapping offering both efficiency and prompt support for new devices and features such as the Arria10 floating-point cores. The collection of features available in DSPBA allows both unexperienced and expert users to efficiently migrate performance-crucial systems to the FPGA architecture.
2008
Dynamic hardware generation reduces the number of FPGA resources needed and speeds up the application by optimizing the configuration for the exact problem at hand at run-time. If the problem changes, the system needs to be reconfigured. When this occurs too often, the total reconfiguration overhead is too high and the benefit of using dynamic hardware generation vanishes. Hence, it is important to minimize the number of reconfigurations. We propose a novell technique to reduce the number of reconfigurations by using loop transformations. Our approach is similar to the temporal data locality optimizations. By applying our technique, we can drastically reduce the number of reconfigurations, as indicated by the matrix multiplication example. After applying the loop transformations, the number of reconfigurations decreases by an order of magnitude. Combined with a dynamic hardware generation technique with a very low overhead, our technique obtains a significant speedup over generic circuits.
Journal of Signal Processing Systems, 2017
Current tools for High-Level Synthesis (HLS) excel at exploiting Instruction-Level Parallelism (ILP). The support for Data-Level Parallelism (DLP), one of the key advantages of Field Programmable Gate Arrays (FPGAs), is in contrast very limited. This work examines the exploitation of DLP on FPGAs using code generation for C-based HLS of image filters and streaming pipelines. In addition to well-known loop tiling techniques, we propose loop coarsening, which delivers superior performance and scalability. Loop tiling corresponds to splitting an image into separate regions, which are then processed in parallel by replicated accelerators. For data streaming, this also requires the generation of glue logic for the distribution of image data. Conversely, loop coarsening allows processing multiple pixels in parallel, whereby only the kernel operator is replicated within a single accelerator. We present concrete implementations of tiling and coarsening for Vivado HLS and Altera OpenCL. Furthermore, we present a comparison of our implementations to the keyword-driven parallelization support provided by the Altera Offline Compiler. We augment the FPGA back end of the heterogeneous Domain-Specific Language (DSL) framework Hipacc to generate loop coarsening implementations for Vivado HLS and Altera OpenCL. Moreover, we compare the resulting FPGA accelerators to highly optimized software implementations for Graphics Processing Units (GPUs), all generated from exactly the same code base.
ACM Transactions on Architecture and Code Optimization, 2008
The wider acceptance of FPGAs as a computing device requires a higher level of programming abstraction. ROCCC is an optimizing C to HDL compiler. We describe the code generation approach in ROCCC. The smart buffer is a component that reuses input data between adjacent iterations. It significantly improves the performance of the circuit and simplifies loop control. The ROCCCgenerated datapath can execute one loop iteration per clock cycle when there is no loop dependency or there is only scalar recurrence variable dependency. ROCCC's approach to supporting while-loops operating on scalars makes the compiler able to move scalar iterative computation into hardware.
2006
Abstract To meet the conflicting goals of high-performance low-cost embedded systems, critical application loop nests are commonly executed on specialized hardware accelerators. These loop accelerators are traditionally designed in a single-function manner, wherein each loop nest is implemented as a dedicated hardware block. This paper focuses on hardware sharing across loop nests by creating multifunction loop accelerators, or accelerators capable of executing multiple algorithms.
High-Performance …, 1997
This paper considers the automatic synthesis of systolic architectures from nested loop algorithmic specifications. The high level input is given in the form of uniform dependence loops with unit dependencies and the target architecture is a multidimensional systolic array with unbounded number of cells. A complete methodology for the hardware synthesis of the resulting architecture, based on VHDL specifications, is presented. This methodology automatically detects all necessary computation and communication elements and produces optimal layouts. The theoretical framework of our method is based on the properties of the generalized UET grids. First, we calculate the optimal makespan for the generalized UET grids and then we establish the minimum number of systolic cells required to achieve the optimal makespan. The complexity of the proposed scheduling algorithm is completely independent of the size of the nested loop and depends only on its dimension, thus being the most efficient (in terms of complexity) known to us. All these methods were implemented and incorporated in an integrated software package which provides the designer with a powerful parallel design environment, from high level algorithmic specifications to lowlevel (i.e., actual layouts) optimal implementation. Index terms: UET grid index space, optimal makespan, optimal mapping, number of systolic cells, uniform unit dependence vectors, VHDL based design automation.
2008
ABSTRACT Complex algorithms and increased functionality are expanding the computation demands of embedded systems. Hardware accelerators are commonly used to meet these demands by executing critical application loop nests in custom logic, achieving performance requirements while minimizing hardware cost. Traditionally, these loop accelerators are designed in a single-function manner, wherein each loop nest is implemented as dedicated hardware.
Multimedia applications are examples of a class of algorithms that are both calculation and data intensive and have real-time requirements. As a result dedicated hardware acceleration is often needed.
ACACES Poster Abstracts, L\'Aquila, Italy
2012 International Conference on Embedded Computer Systems (SAMOS), 2012
Application specific MPSoCs are often used to implement high-performance data-intensive applications. MP-SoC design requires a rapid and efficient exploration of the hardware architecture possibilities to adequately orchestrate the data distribution and architecture of parallel MPSoC computing resources. Behavioral specifications of data-intensive applications are usually given in the form of a loop-based sequential code, which requires parallelization and task scheduling for an efficient MPSoC implementation. Existing approaches in application specific hardware synthesis, use loop transformations to efficiently parallelize single nested loops and use Synchronous Data Flows to statically schedule and balance the data production and consumption of multiple communicating loops. This creates a separation between data and task parallelism analyses, which can reduce the possibilities for throughput optimization in high-performance data-intensive applications. This paper proposes a method for a concurrent exploration of data and task parallelism when using loop transformations to optimize data transfer and storage mechanisms for both single and multiple communicating nested loops. This method provides orchestrated application specific decisions on communication architecture, memory hierarchy and computing resource parallelism. It is computationally efficient and produces high-performance architectures.
2006
Looping operations impose a significant bottleneck to achieving better computational efficiency for embedded applications. To confront this problem on embedded RISC processors, an architectural modification involving the integration of a zerooverhead loop controller (ZOLC) has been suggested, supporting arbitrary loop structures with multiple-entry and multiple-exit nodes. In this paper, a graph formalism is introduced for representing the loop structure of application programs, which can assist in ZOLC code synthesis. Also, a portable description of a ZOLC component is given in detail, which can be exploited in the scope of RTL synthesis, compiler optimizations or assembly level transformations for enabling its utilization. This description is designed to be easily retargetable to single-issue RISC processors, requiring only minimal effort for this task.
High-level synthesis overcomes the high design effort re-quired by using an FPGA by moving the hardware design to a higher abstraction level. At this higher level, loop trans-formations are used to improve the characteristics of the program. These transformations have a large impact on the resulting hardware, but their impact is only known after the time-consuming synthesis steps. This hinders a fast design space exploration. In this paper, we tackle this issue by estimating the per-formance of the hardware loop controller, an often over-looked component in other approaches. We present an equation based model to estimate the area and clock fre-quency of the loop controller during high-level synthesis. In our approach, we manage to keep estimation errors reason-ably low, so our estimation model can be used during de-sign space exploration. Due to its simplicity, the overhead is minimal, which is critical when lots of design variants need to be estimated.
Proceedings of the ACM/SIGDA …, 2012
Memory bandwidth is critical to achieving high performance in many FPGA applications. The bandwidth of SDRAM memories is, however, highly dependent upon the order in which addresses are presented on the SDRAM interface. We present an automated tool for constructing an application specific on-chip memory address sequencer which presents requests to the external memory with an ordering that optimizes off-chip memory bandwidth for fixed onchip memory resource. Within a class of algorithms described by affine loop nests, this approach can be shown to reduce both the number of requests made to external memory and the overhead associated with those requests. Data presented shows a trade off between the use of on-chip resources and achievable off-chip memory bandwidth where a range of improvements from 3.6× to 4× gain in efficiency on the external memory interface can be gained at a cost of up to a 1.4× increase in the ALUTs dedicated to address generation circuits in an Altera Stratix III device.
— Advanced flag processors assume a noteworthy part in electronic gadgets, bio restorative applications, correspondence conventions. Effective IC configuration is a key variable to accomplish low power and high throughput IP center improvement for convenient gadgets. Computerized flag processors assume a huge parts progressively figuring and preparing yet region overhead and power utilization are real disadvantages to accomplish productive outline requirements. Adaptable DSP engineering utilizing circle back calculation is a proposed way to deal with beat existing outline limitations. For instance, plan of 8 point FFT engineering requires 3 phases for butterfly calculation unit that 48 adders and 12 multipliers prompts high power and territory utilization. To lessen range and power, Loop back calculation is proposed and it requires 16 adders and 4 multipliers for general outline. Likewise outline of various DSP layouts like FFT, First request FIR channel and Second request FIR channel is acquainted and mapping in with the design as preparing component and applying the circle back calculation. In the outline of FFT,FIR formats adders, for example, parallel prefix viper, move and include multiplier, baugh-wooley multiplier are utilized to break down effective plan of DSP engineering. Recreation and examination of inactivity, territory, control productivity with the current structures are happens utilizing model sim 6.4a and combine utilizing Xilinx 14.3 ISE. Keywords— Baugh-UWooley Multiplier, Loop Back Algorithm, Parallel Prefix Adders.
Loading Preview
Sorry, preview is currently unavailable. You can download the paper by clicking the button above.