Academia.edu no longer supports Internet Explorer.
To browse Academia.edu and the wider internet faster and more securely, please take a few seconds to upgrade your browser.
Parallel computing is emerging as an important area of research in computer architectures and software systems. Many algorithms can be greatly accelerated using parallel computing techniques. Specialized parallel computer architectures are used for accelerating specific tasks. High-Energy Physics Experiments measuring systems often uses FPGAs for fine-grained computation. FPGA combines many benefits of both software and ASIC implementations. Like software, the mapped circuit is flexible, and can be reconfigured over the lifetime of the system. FPGAs therefore have the potential to achieve far greater performance than software as a result of bypassing the fetch-decode-execute operations of traditional processors, and possibly exploiting a greater level of parallelism. Creating parallel programs implemented in FPGAs is not trivial. This paper presents existing methods and tools for fine-grained computation implemented in FPGA using High Level Programming Languages.
Photonics Applications in Astronomy, Communications, Industry, and High-Energy Physics Experiments 2013, 2013
Parallel computing is emerging as an important area of research in computer architectures and software systems. Many algorithms can be greatly accelerated using parallel computing techniques. Specialized parallel computer architectures are used for accelerating specific tasks. High-Energy Physics Experiments measuring systems often use FPGAs for fine-grained computation. FPGA combines many benefits of both software and ASIC implementations. Like software, the mapped circuit is flexible, and can be reconfigured over the lifetime of the system. FPGAs therefore have the potential to achieve far greater performance than software as a result of bypassing the fetch-decode-execute operations of traditional processors, and possibly exploiting a greater level of parallelism. Creating parallel programs implemented in FPGAs is not trivial. This paper presents existing methods and tools for fine-grained computation implemented in FPGA using Behavioral Description and High Level Programming Languages.
The Mini-Symposium "Parallel computing with FPGAs" aimed at exploring the many ways in which field programmable gate arrays can be arranged into high-performance computing blocks. Examples include high-speed operations obtained by sheer parallelism, numerical algorithms mapped into hardware, co-processing time critical sections and the development of powerful programming environments for hardware software co-design.
Current high performance computing (HPC) applications are found in many consumer, industrial and research fields. From web searches to auto crash simulations to weather predictions, these applications require large amounts of power by the compute farms and supercomputers required to run them. The demand for more and faster computation continues to increase along with an even sharper increase in the cost of the power required to operate and cool these installations. The ability of standard processor based systems to address these needs has declined in both speed of computation and in power consumption over the past few years. This paper presents a new method of computation based upon programmable logic as represented by Field Programmable Gate Arrays (FPGAs) that addresses these needs in a manner requiring only minimal changes to the current software design environment.
High-Performance Computing with FPGA-Based Parallel Data Processing Systems, 2024
Traditional architectures of Central Processing Units (CPUs) and Graphics Processing Units (GPUs) are becoming unsuitable for High-Performance Computing (HPC) due to their high-power consumption and inability to process data in real-time. This study introduces a novel parallel data processing system that is based on FPGAs and capitalizes on the remarkable reconfigurability of FPGAs to enhance the speed and efficiency of computation. The system's modular architecture combines pipeline parallelism with dataflow computation, enabling the continuous and concurrent execution of tasks with a substantially reduced latency. Throughput and scalability are enhanced by dynamic task scheduling, enhanced resource allocation, hardware-accelerated compute cores, and other critical features. The proposed FPGA-based system boasts a throughput that is up to five times greater, a latency that is 60% lower, and a power consumption that is 40% lower than conventional architectures based on CPUs and GPUs, as evidenced by the results of their experiments.
2011
Abstract-Field-Programmable Gate-Arrays (FPGAs) are becoming increasingly popular as computing platforms for high-performance embedded systems. Their flexibility and customization capabilities allow them to achieve orders of magnitude better performance than conventional embedded computing systems. Programming FPGAs is, however, cumbersome and error-prone and as a result their true potential is often only achieved at unreasonably high design efforts.
2006
It has been shown that a small number of FPGAs can significantly accelerate certain computing tasks by up to two or three orders of magnitude. However, particularly intensive large-scale computing applications, such as molecular dynamics simulations of biological systems, underscore the need for even greater speedups to address relevant length and time scales.
Second NASA/ESA Conference on Adaptive Hardware and Systems (AHS 2007), 2007
We describe the FPGA HPC Alliance's Parallel Toolkit (PTK), an initial step towards the standardization of high-level configuration and APIs for high-performance reconfigurable computing (HPRC). We discuss the motivation and challenges of reaping the performance benefits of FPGAs for memory-bound HPC codes and describe the approach we have taken on the FHPCA supercomputer Maxwell.
2013 23rd International Conference on Field programmable Logic and Applications, 2013
Whether for use as the final target or simply a rapid prototyping platform, programming systems containing FPGAs is challenging. Some of the difficulty is due to the difference between the models used to program hardware and software, but great effort is also required to coordinate the simultaneous execution of the application running on the microprocessor with the accelerated kernel(s) running on the FPGA. In this paper we present a new methodology and programming model for introducing hardware-acceleration to an application running in software. The application is represented as a data-flow graph and the computation at each node in the graph is specified for execution either in software or on the FPGA using the programmer's language of choice. We have implemented an interface compiler which takes as its input the FIFO edges of the graph and generates code to connect all the different parts of the program, including those which communicate across the hardware/software boundary. Our methodology and compiler enable programmers to effectively exploit FPGA acceleration without ever leaving the application space.
2013
Future computing systems will require dedicated accelerators to achieve high-performance. The mini-symposium ParaFPGA explores parallel computing with FPGAs as an interesting avenue to reduce the gap between the architecture and the application. Topics discussed are the power of functional and dataflow languages, the performance of high-level synthesis tools, the automatic creation of hardware multi-cores using C-slow retiming, dynamic power management to control the energy consumption, real-time reconfiguration of streaming image processing filters and memory optimized event image segmentation.
… , Signal Processing and …, 2009
Algorithms used in signal and image processing applications are computationally intensive. For optimized hardware realization of such algorithms with efficient utilization of available resources, an in-depth knowledge of the targeted field programmable gate array (FPGA) technology is required. This paper presents an overview of the architectures and technologies used in modern FPGAs. A case study of most popular and widely used state-of-the-art commercial FPGA technologies from Xilinx and Altera is also presented. Three-Dimensional (3D)-FPGA architecture is also discussed.
IEEE Design & Test of Computers, 2011
As part of their ongoing work with the National Science Foundation (NSF) Center for High-Performance Reconfigurable Computing (CHREC), the authors are developing a complete tool chain for FPGA-based acceleration of scientific computing, from early-stage assessment of applications down to rapid routing. This article provides an overview of this tool chain.
MASAUM Journal of Computing, 2009
Algorithms used in signal processing, image processing and high performance computing applications are computationally intensive. For efficient implementation of such algorithms with efficient utilization of available resources, an indepth knowledge of the targeted field programmable gate array (FPGA) technology is required. This paper presents a state-ofthe-art review of the architectures and technologies used in modern FPGAs. A case study of most popular and widely used state-of-the-art commercial FPGA technologies from Xilinx and Altera is also presented in this paper. Upcoming three-Dimensional (3D)-FPGA architecture is also discussed.
2010
While ASIC design and manufacturing costs are soaring with each new technology node, the computing power and logic capacity of modern FPGAs steadily advances. Therefore, high-performance computing with FPGA-based system becomes increasingly attractive and viable. Unfortunately, truly unleashing the computing potential of FPGAs often stipulates cumbersome HDL programming and laborious manual optimization. To circumvent such challenges, we propose a Many-core Approach to Reconfigurable Computing (MARC) that (i) allows programmers to easily express parallelism through a high-level programming language, (ii) supports coarse-grain multithreading and dataflowstyle fine-grain threading while permitting bit-level resource control, and (iii) greatly reduces the effort required to repurpose the hardware system for different algorithms or different applications. Leveraging a many-core architectural template, sophisticated logic synthesizing techniques, and state-of-art compiler optimization te...
Computing Research Repository, 2007
This paper describes JANUS, a modular massively parallel and reconfigurable FPGA-based computing system. Each JANUS module has a computational core and a host. The computational core is a 4x4 array of FPGA-based processing elements with nearest-neighbor data links. Processors are also directly connected to an I/O node attached to the JANUS host, a conventional PC. JANUS is tailored for, but
Parallel Computing, 2007
High-performance computing using accelerators A recent trend in high-performance computing is the development and use of heterogeneous architectures that combine fine-grain and coarse-grain parallelism using tens or hundreds of disparate processing cores. These processing cores are available as accelerators or many-core processors, which are designed with the goal of achieving higher parallel-code performance. This is in contrast with traditional multicore CPUs that effectively replicate serial CPU cores. The recent demand for these accelerators comes primarily from consumer applications, including computer gaming and multimedia. Examples of such accelerators include graphics processing units (GPUs), Cell Broadband Engines (Cell BEs), field-programmable gate arrays (FPGAs), and other data-parallel or streaming processors. Compared to conventional CPUs, the accelerators can offer an order-of-magnitude improvement in performance per dollar as well as per watt. Moreover, some recent industry announcements are pointing towards the design of heterogeneous processors and computing environments, which are scalable from a system with a single homogeneous processor to a high-end computing platform with tens, or even hundreds, of thousands of heterogeneous processors. This special issue on ''High-Performance Computing Using Accelerators'' includes many papers on such commodity, many-core processors, including GPUs, Cell BEs, and FPGAs. GPGPUs: Current top-of-the-line GPUs have tens or hundreds of fragment processors and high memory bandwidth, i.e. 10• more than current CPUs. This processing power of GPUs has been successfully exploited for scientific, database, geometric and imaging applications (i.e. GPGPUs, short for General-Purpose computation on GPUs). The significant increase in parallelism within a processor can also lead to other benefits including higher power-efficiency and better memory-latency tolerance. In many cases, an order-of-magnitude performance was shown, as compared to top-of-the-line CPUs. For example, GPUTeraSort used the GPU interface to drive memory more efficiently and resulted in a threefold improvement in records/second/CPU. Similarly, some of the fastest algorithms for many numerical computations-including FFT, dense matrix multiplications and linear solvers, and collision and proximity computations-use GPUs to achieve tremendous speed-ups. Cell Broadband Engines: The Cell Broadband Engine is a joint venture between Sony, Toshiba, and IBM. It appears in consumer products such as Sony's PlayStation 3 computer entertainment system and Toshiba's Cell Reference Set, a development tool for Cell Broadband Engine applications. When viewed as a processor, the Cell can exploit the orthogonal dimensions of task and data parallelism on a single chip. The Cell processor consists of a symmetric multi-threaded (SMT) Power Processing Element (PPE) and eight Synergistic Processing Elements (SPEs) with pipelined SIMD capabilities. The processor achieves a theoretical peak performance of over 200 Gflops for single-precision floating-point calculations and has a peak memory bandwidth of over 25 GB/s. Actual speed-up factors achieved when automatically parallelizing sequential code kernels via the Cell's pipelined SIMD capabilities reach as high as 26-fold. Field-Programmable Gate Arrays (FPGAS): FPGAs support the notion of reconfigurable computing and offer a high degree of on-chip parallelism that can be mapped directly from the dataflow characteristics of an application's parallel algorithm. Their recent emergence in the high-performance computing arena can be attributed to a hybrid approach that combines the logic blocks and interconnects of traditional FPGAs with
Journal of Signal Processing Systems, 2017
Current tools for High-Level Synthesis (HLS) excel at exploiting Instruction-Level Parallelism (ILP). The support for Data-Level Parallelism (DLP), one of the key advantages of Field Programmable Gate Arrays (FPGAs), is in contrast very limited. This work examines the exploitation of DLP on FPGAs using code generation for C-based HLS of image filters and streaming pipelines. In addition to well-known loop tiling techniques, we propose loop coarsening, which delivers superior performance and scalability. Loop tiling corresponds to splitting an image into separate regions, which are then processed in parallel by replicated accelerators. For data streaming, this also requires the generation of glue logic for the distribution of image data. Conversely, loop coarsening allows processing multiple pixels in parallel, whereby only the kernel operator is replicated within a single accelerator. We present concrete implementations of tiling and coarsening for Vivado HLS and Altera OpenCL. Furthermore, we present a comparison of our implementations to the keyword-driven parallelization support provided by the Altera Offline Compiler. We augment the FPGA back end of the heterogeneous Domain-Specific Language (DSL) framework Hipacc to generate loop coarsening implementations for Vivado HLS and Altera OpenCL. Moreover, we compare the resulting FPGA accelerators to highly optimized software implementations for Graphics Processing Units (GPUs), all generated from exactly the same code base.
IEEE Transactions on Computers, 2021
This paper presents the new features of the OmpSs@FPGA framework. OmpSs is a data-flow programming model that supports task nesting and dependencies to target asynchronous parallelism and heterogeneity. OmpSs@FPGA is the extension of the programming model addressed specifically to FPGAs. OmpSs environment is built on top of Mercurium source to source compiler and Nanos++ runtime system. To address FPGA specifics Mercurium compiler implements several FPGA related features as local variable caching, wide memory accesses or accelerator replication. In addition, part of the Nanos++ runtime has been ported to hardware. Driven by the compiler this new hardware runtime adds new features to FPGA codes, such as task creation and dependence management, providing both performance increases and ease of programming. To demonstrate these new capabilities, different high performance benchmarks have been evaluated over different FPGA platforms using the OmpSs programming model. The results demonstrate that programs that use the OmpSs programming model achieve very competitive performance with low to moderate porting effort compared to other FPGA implementations.
International Journal of …, 2010
Field-Programmable Gate Arrays (FPGAs) are becoming increasingly important in embedded and high-performance computing systems. They allow performance levels close to the ones obtained with Application-Specific Integrated Circuits (ASICs), while still keeping design and implementation flexibility. However, to efficiently program FPGAs, one needs the expertise of hardware developers in order to master hardware description languages (HDLs) such as VHDL or Verilog. Attempts to furnish a high-level compilation flow (e.g., from C programs) still have to address open issues before broader efficient results can be obtained. Bearing in mind the hardware resources available in contemporary FPGAs, we developed LALP (Language for Aggressive Loop Pipelining), a novel language to program FPGA-based accelerators, and its compilation framework. The main ideas behind LALP are to provide a higher abstraction level than HDLs, to exploit the intrinsic parallelism of hardware resources, and to allow the programmer to control execution stages whenever compiler techniques are unable to generate efficient implementations. Those features are particularly useful to implement loop pipelining, a well regarded technique used to accelerate computations in several application domains. This paper describes LALP, and shows how it can be used to achieve high-performance embedded computing solutions.
Computing in Science & Engineering, 2000
This paper describes JANUS, a modular massively parallel and reconfigurable FPGA-based computing system. Each JANUS module has a computational core and a host. The computational core is a 4x4 array of FPGA-based processing elements with nearest-neighbor data links. Processors are also directly connected to an I/O node attached to the JANUS host, a conventional PC. JANUS is tailored for, but not limited to, the requirements of a class of hard scientific applications characterized by regular code structure, unconventional data manipulation instructions and not too large data-base size. We discuss the architecture of this configurable machine, and focus on its use on Monte Carlo simulations of statistical mechanics. On this class of application JANUS achieves impressive performances: in some cases one JANUS processing element outperfoms high-end PCs by a factor 1000. We also discuss the role of JANUS on other classes of scientific applications.
Loading Preview
Sorry, preview is currently unavailable. You can download the paper by clicking the button above.