Academia.edu no longer supports Internet Explorer.
To browse Academia.edu and the wider internet faster and more securely, please take a few seconds to upgrade your browser.
2012, International Journal of Reconfigurable Computing
We present a highly productive approach to hardware design based on a many-core microarchitectural template used to implement compute-bound applications expressed in a high-level data-parallel language such as OpenCL. The template is customized on a per-application basis via a range of high-level parameters such as the interconnect topology or processing element architecture.
2010
While ASIC design and manufacturing costs are soaring with each new technology node, the computing power and logic capacity of modern FPGAs steadily advances. Therefore, high-performance computing with FPGA-based system becomes increasingly attractive and viable. Unfortunately, truly unleashing the computing potential of FPGAs often stipulates cumbersome HDL programming and laborious manual optimization. To circumvent such challenges, we propose a Many-core Approach to Reconfigurable Computing (MARC) that (i) allows programmers to easily express parallelism through a high-level programming language, (ii) supports coarse-grain multithreading and dataflowstyle fine-grain threading while permitting bit-level resource control, and (iii) greatly reduces the effort required to repurpose the hardware system for different algorithms or different applications. Leveraging a many-core architectural template, sophisticated logic synthesizing techniques, and state-of-art compiler optimization te...
2011 IEEE 19th Annual International Symposium on Field-Programmable Custom Computing Machines, 2011
The problem of automatically generating hardware modules from a high level representation of an application has been at the research forefront in the last few years. In this paper, we use OpenCL, an industry supported standard for writing programs that execute on multicore platforms and accelerators such as GPUs. Our architectural synthesis tool, SOpenCL (Silicon-OpenCL), adapts OpenCL into a novel hardware design flow which efficiently maps coarse and fine-grained parallelism of an application onto an FPGA reconfigurable fabric. SOpenCL is based on a sourceto-source code transformation step that coarsens the OpenCL fine-grained parallelism into a series of nested loops, and on a template-based hardware generation back-end that configures the accelerator based on the functionality and the application performance and area requirements. Our experimentation with a variety of OpenCL and C kernel benchmarks reveals that area, throughput and frequency optimized hardware implementations are attainable using SOpenCL.
2014 International Conference on ReConFigurable Computing and FPGAs (ReConFig14), 2014
Hardware accelerators are capable of achieving significant performance improvement. However, designing hardware accelerators lacks the flexibility and the productivity. Combining hardware accelerators with multiprocessor system-on-chip (MP-SoC) is an alternative way to balance the flexibility, the productivity, and the performance. In this work, we present a unified hybrid OpenCL-flavor (HOpenCL) parallel programming model on MPSoC supporting both hardware and software kernels. By integrating the HOpenCL hardware IPs and software libraries, the same kernel function can execute as either hardware kernels on the dedicated hardware accelerators or software kernels on the general-purpose processors. Using the automatic design flow, the corresponding hybrid hardware platform is generated along with the executable. We use the matrix multiplication of 512×512 to examine the potential of our hybrid system in terms of performance, scalability, and productivity. The results show that hardware kernels reach more than 10 times speedup compared with the software kernels. Our prototype platform also demonstrates a good performance scalability when the number of group computation units (GCUs) increases from 1 to 6 until it becomes a memory bound problem. Compared with the hard ARM core on the Zynq 7045 device, we find that the performance of one ARM core is equivalent to 2 or 3 GCUs with software kernel implementations. On the other hand, a single GCU with hardware kernel implementation is 5 times faster than the ARM core.
2011 IEEE/ACM International Conference on Computer-Aided Design (ICCAD), 2011
The problem of automatically generating hardware modules from high level application representations has been at the forefront of EDA research during the last few years. In this paper, we introduce a methodology to automatically synthesize hardware accelerators from OpenCL applications. OpenCL is a recent industry supported standard for writing programs that execute on multicore platforms and accelerators such as GPUs. Our methodology maps OpenCL kernels into hardware accelerators, based on architectural templates that explicitly decouple computation from memory communication whenever this is possible. The templates can be tuned to provide a wide repertoire of accelerators that meet user performance requirements and FPGA device characteristics. Furthermore, a set of high-and low-level compiler optimizations is applied to generate optimized accelerators. Our experimental evaluation shows that the generated accelerators are tuned efficiently to match the applications memory access pattern and computational complexity, and to achieve user performance requirements. An important objective of our tool is to expand the FPGA development user base to software engineers, thereby expanding the scope of FPGAs beyond the realm of hardware design.
2010
OpenCL is a programming language standard which enables the programmer to express the application by structuring its computation as kernels. The OpenCL compiler is given the explicit freedom to parallelize the execution of kernel instances at all the levels of parallelism. In comparison to the traditional C programming language which is sequential in nature, OpenCL enables higher utilization of parallelism naturally available in hardware constructs while still having a feasible learning curve for engineers familiar with the C language.
ACM Transactions on Embedded Computing Systems, 2015
The design cycle for complex special purpose compute systems is extremely costly and time-consuming. It involves a multi-parametric design space exploration for optimization, followed by design verification. Designers of special purpose VLSI implementations often need to explore parameters, such as optimal bitwidth and data representation through time consuming Monte-Carlo simulations. A prominent example of this simulation-based exploration process is the design of decoders for error correcting systems, such as Low-Density Parity-Check (LDPC) codes, adopted by modern communication standards, which involves thousands of Monte-Carlo runs for each design point. Currently, high-performance computing offers a wide set of acceleration options that range from multicore CPUs to graphics processing units (GPUs) and FPGAs. The exploitation of diverse target architectures is typically associated with developing multiple code versions, often using distinct programming paradigms. In this context we evaluate the concept of retargeting a single OpenCL program to multiple-platforms, thereby significantly reducing design time. A single OpenCL-based parallel kernel is used without modifications or code tuning on multicore CPUs, GPUs and FPGAs. We use SOpenCL (Silicon to OpenCL), a tool that automatically converts OpenCL kernels to RTL in order to introduce FPGAs as a potential platform to efficiently execute simulations coded in OpenCL. We use LDPC decoding simulations as a case study. Experimental results were obtained by testing a variety of regular and irregular LDPC codes that range from short/medium (e.g. 8000 bit) to large length (e.g. 64800 bit) DVB-S2 codes. We observe that, depending on the design parameters to be simulated, on the dimension and phase of the design, the GPU or FPGA may suit different purposes more conveniently, providing different acceleration factors over conventional multicore CPUs.
— Nowadays, multi-core architectures have become mainstream in the microprocessor industry. However, while the number of cores integrated in a single chip growth, more important becomes the need for an adequate programming model. In recent years, the OpenCL programming model has attracted the attention of multi-core designers' community. This paper presents an OpenCL-compliant architecture and demonstrates that such programming model can be successfully used as programming model for general-purpose multi-core architectures.
2012
The capacity of FPGA devices has reached the 1-million-LUT level, which provides space to accommodate a complete Multiprocessor System-on-Chip on the programmable device. In this work we propose a design flow to create application-specific computation architecture based on the application entry in OpenCL. The application-specific system is based on a distributed memory hierarchy in which each processor is equipped with its own local memories for data and instructions. The multiple kernels in an application are mapped to individual processors to achieve the processing parallelism. In order to efficiently use the limited resource on FPGA devices, the sizes of local data and instruction memories are determined by analyzing the application source code. Compared with the traditional shared memory architectures, the application-specific platform is capable of reducing the resource requirement, providing more processing parallelism, and achieving higher performance.
Lecture Notes in Computer Science, 2018
Heterogeneous computing systems with multiple CPUs and GPUs are increasingly popular. Today, heterogeneous platforms are deployed in many setups, ranging from low-power mobile systems to high performance computing systems. Such platforms are usually programmed using OpenCL which allows to execute the same program on different types of device. Nevertheless, programming such platforms is a challenging job for most non-expert programmers. To enable an efficient application runtime on heterogeneous platforms, programmers require an efficient workload distribution to the available compute devices. The decision how the application should be mapped is non-trivial. In this paper, we present a new approach to build accurate predictive-models for OpenCL programs. We use a machine learning-based predictive model to estimate which device allows best application speed-up. With the LLVM compiler framework we develop a tool for dynamic code-feature extraction. We demonstrate the effectiveness of our novel approach by applying it to different prediction schemes. Using our dynamic feature extraction techniques, we are able to build accurate predictive models, with accuracies varying between 77% and 90%, depending on the prediction mechanism and the scenario. We evaluated our method on an extensive set of parallel applications. One of our findings is that dynamically extracted code features improve the accuracy of the predictive-models by 6.1% on average (maximum 9.5%) as compared to the state of the art.
Journal of Signal Processing Systems
While data parallelism aspects of OpenCL have been of primary interest due to the massively data parallel GPUs being on focus, OpenCL also provides powerful capabilities to describe task parallelism. In this article we study the task parallel concepts available in OpenCL and find out how well the different vendor-specific implementations can exploit task parallelism when the parallelism is described in various ways utilizing the command queues. We show that the vendor implementations are not yet capable of extracting kernel-level task parallelism from in-order queues automatically. To assess the potential performance benefits of in-order queue parallelization, we implemented such capabilities to an open source implementation of OpenCL. The evaluation was conducted by means of a case study of an advanced noise reduction algorithm described as a multi-kernel OpenCL application.
Many data-intensive tasks can be accelerated significantly by switching to accelerators such as GPUs and coprocessors. Alas, large portions of the algorithms often need to be re-implemented as different accelerators typically require different programming interfaces. The heterogeneous programming model of OpenCL was designed as a standard to provide functional portability across a variety of supported devices. However, OpenCL does not guarantee performance portability. The same algorithm may need to be implemented differently, albeit using the same language and interfaces, to achieve optimal performance on different devices. We demonstrate how OpenCL facilitates different devicespecific optimizations while it maintains the same kernel codebase. We also show how to optimize execution parameters based on runtime inputs such as dataset size and target device, which is made possible with OpenCL. We employ a grid-based data mining regression and classification application as an example. Our approach optimizes code on the fly, with only minimal modifications applied for a target device. We focus on general performance portability across the NVIDIA* Tesla K20X GPU and the new many-core Intel R Xeon Phi TM coprocessor. Finally, we report and analyze performance measurements for both types of devices.
International journal of engineering research and technology, 2013
With the rapid development of smart phones, tablets, and pads, there has been widespread adoption of Graphic Processing Units (GPUs). The hand-held market is now seeing an ever-increasing rate of development of computationally intensive applications, which require significant amounts of processing resources. To meet this challenge, GPUs can be used for general-purpose processing. We are moving towards a future where devices will be more connected and integrated. This will allow applications to run on handheld devices, while offloading computationally intensive tasks to other compute units available. There is a growing need for a general programming framework which can utilize heterogeneous processing units such as GPUs and DSPs. OpenCL, a widely used programming framework has been a step towards integrating these different processing units on desktop platforms. A thorough study of the factors that impact the performance of OpenCL on CPUs. Specifically, starting from the two main arc...
Many-core architectures are now a reality and programming them is still a challenge.
Sigplan Notices, 2002
This paper describes an automated approach to hardware design space exploration, through a collaboration between parallelizing compiler technology and high-level synthesis tools. We present a compiler algorithm that automatically explores the large design spaces resulting from the application of several program transformations commonly used in application-specific hardware designs. Our approach uses synthesis estimation techniques to quantitatively evaluate alternate designs for a loop nest computation. We have implemented this design space exploration algorithm in the context of a compilation and synthesis system called DE-FACTO, and present results of this implementation on five multimedia kernels. Our algorithm derives an implementation that closely matches the performance of the fastest design in the design space, and among implementations with comparable performance, selects the smallest design. We search on average only 0.3% of the design space. This technology thus significantly raises the level of abstraction for hardware design and explores a design space much larger than is feasible for a human designer.
2015
Modern platform FPGAs are sufficiently dense to allow the assembly of a complete chip heterogeneous multiprocessor systems on chip (CHMPs) within a single die. Based on CHMP, every research group that sets out to explore how an application can be accelerated on an FPGA platform must firstly integrate processors, buses, memories, and support IP components into a base architecture prior to beginning their application specific analysis. Lacking of computer architecture background, many application developers will spend significant amounts of valuable research time to create and debug the base CHMP system. In this paper, we present Archborn, an open source tool that automates the generation of modifiable CHMP architectures. Archborn can be used by application developers to create a base CHMP system that can be synthesized in seconds. The systems created by Archborn can also be modified by those who wish to create fully customized CHMPs systems. We show the ease and flexibility of Archborn by automatically generating three unique systems: an NUMA architecture that is modified with an Hthreads hw/sw co-designed microkernel for multithreading, and a NUMA architecture to support the OpenCL programming model. We present analysis on resource utilizations and scalability for creating systems with up to 64 processing elements (PEs).
Motivated by the desire to maximize use of cheap GPU processing power by increasing the kinds of algorithms that can profit from it. Currently only data parallel algorithms can profit from running their code on GPU. One way is by making a programming language whose code can be treated as data and streamed through an OpenCL kernel. Herein is the design of a programming language that allows using code as data that can be streamed through an OpenCL kernel. This way not only data parallel algorithms can reap the benefits of GPU, but also any referentially transparent code can.
When adapting GPU-specific OpenCL kernels to run on multi-core/many-core CPUs, coarsening the thread granularity is necessary and thus extensively used. However, locality concerns exposed in GPU-specific OpenCL code are usually inherited without analysis, which may give side-effects on the CPU performance. When executing GPU-specific kernels on CPUs, local-memory arrays no longer match well with the hardware and the associated synchronizations are costly. To solve this dilemma, we actively analyze the memory access patterns by using array-access descriptors derived from GPU-specific kernels, which can thus be adapted for CPUs by removing all the unwanted local-memory arrays together with the obsolete barrier statements. Experiments show that the automated transformation can satisfactorily improve OpenCL kernel performances on Sandy Bridge CPU and Intel's Many-Integrated-Core coprocessor.
In modern mobile embedded systems, various energy-efficient hardware acceleration units are employed in addition to a multi-core CPU. To fully utilize the computational power in such heterogeneous systems, Open Computing Language (OpenCL) has been proposed. A key benefit of OpenCL is that it works on various computing platforms. However, most vendors offer OpenCL software development kits (SDKs) that support their own computing platforms. The study of the OpenCL framework for embedded multi-core CPUs is in a rudimentary stage. In this paper, an OpenCL framework for embedded multi-core CPUs that dynamically redistributes the time-varying workload to CPU cores in real time is proposed. A compilation environment for both host programs and OpenCL kernel programs was developed and OpenCL libraries were implemented. A performance evaluation was carried out with respect to various definitions of the device architecture and the execution model. When running on embedded multi-core CPUs, applications parallelized by OpenCL C showed much better performance than the applications written in C without parallelization. Furthermore, since programmers are capable of managing hardware resources and threads using OpenCL application programming interfaces (APIs) automatically, highly efficient computing both in terms of the performance and energy consumption on a heterogeneous computing platform can be easily achieved 1 .
2012
Abstract Modeling emerging multicore architectures is challenging and imposes a tradeoff between simulation speed and accuracy. An effective practice that balances both targets well is to map the target architecture on FPGA platforms. We find that accurate prototyping of hundreds of cores on existing FPGA boards faces at least one of the following problems:(i) limited fast memory resources (SRAM) to model caches,(ii) insufficient inter-board connectivity for scaling the design or (iii) the board is too expensive.
2013 23rd International Conference on Field programmable Logic and Applications, 2013
Whether for use as the final target or simply a rapid prototyping platform, programming systems containing FPGAs is challenging. Some of the difficulty is due to the difference between the models used to program hardware and software, but great effort is also required to coordinate the simultaneous execution of the application running on the microprocessor with the accelerated kernel(s) running on the FPGA. In this paper we present a new methodology and programming model for introducing hardware-acceleration to an application running in software. The application is represented as a data-flow graph and the computation at each node in the graph is specified for execution either in software or on the FPGA using the programmer's language of choice. We have implemented an interface compiler which takes as its input the FIFO edges of the graph and generates code to connect all the different parts of the program, including those which communicate across the hardware/software boundary. Our methodology and compiler enable programmers to effectively exploit FPGA acceleration without ever leaving the application space.
Loading Preview
Sorry, preview is currently unavailable. You can download the paper by clicking the button above.