Academia.edu no longer supports Internet Explorer.
To browse Academia.edu and the wider internet faster and more securely, please take a few seconds to upgrade your browser.
2017, 2017 25th Euromicro International Conference on Parallel, Distributed and Network-based Processing (PDP)
…
5 pages
1 file
OpenACC is a parallel programming model for hardware accelerators, such as GPUs or Xeon Phi, which has been in development for several years by now. During this time, different compilers have appeared, both commercial and open source, which are still on development stage. Due to the fact that both the OpenACC standard and its implementations are relatively recent, we propose a benchmark suite specifically designed to check the performance of the OpenACC features in the code generated by different compilers on different architectures. Our benchmark suite is named TORMENT OpenACC2016. Along with this tool we have developed an adequate metric for the comparison of performance among different machine-compiler pairs which we have named TORMENT ACC2016 Score. The version 1 of TORMENT OpenACC2016 presented in this paper, contains six benchmarks, and is available online.
Lecture Notes in Computer Science, 2016
OpenACC has been on development for a few years now. The OpenACC 2.5 specification was recently made public and there are some initiatives for developing full implementations of the standard to make use of accelerator capabilities. There is much to be done yet, but currently, OpenACC for GPUs is reaching a good maturity level in various implementations of the standard, using CUDA and OpenCL as backends. Nvidia is investing in this project and they have released an OpenACC Toolkit, including the PGI Compiler. There are, however, more developments out there. In this work, we analyze different available OpenACC compilers that have been developed by companies or universities during the last years. We check their performance and maturity, keeping in mind that OpenACC is designed to be used without extensive knowledge about parallel programming. Our results show that the compilers are on their way to a reasonable maturity, presenting different strengths and weaknesses.
2016 45th International Conference on Parallel Processing (ICPP), 2016
We evaluate SAFARA and the new clauses using SPEC and NAS OpenACC benchmarks; our results suggest that these approaches will be effective for improving overall performance of code executing on GPUs. We got up to 2.5 speedup running NAS and 2.08 speedup while running SPEC benchmarks.
International Journal of Parallel Programming, 2014
This paper describes PETRA: a portable performance evaluation tool for parallelizing compilers and their individual techniques. Automatic parallelization of sequential programs combined with performance tuning is an important alternative to manual parallelization for exploiting the performance potential of today's multicores. Given the renewed interest in autoparallelization, this paper aims at a comprehensive evaluation, identifying strengths and weaknesses in the underlying techniques. The findings allow engineers to make informed decisions about techniques to include in industrial products and direct researchers to potential improvements. We present an experimental methodology and a fully automated implementation for comprehensively evaluating the effectiveness of parallelizing compilers and their underlying optimization techniques. The methodology is the first to (1) include automatic tuning, (2) measure the performance contributions of individual techniques at multiple optimization levels, and (3) quantify the interactions of compiler optimizations. The results will also help close the gap between research compilers and industrial compilers, which are still far behind. We applied the proposed methodology using PETRA on five modern parallelizing compilers and their tuning capabilities, illustrating several use cases and applications for the evaluation tool. We report speedups, parallel coverage, and the number of parallel loops, using the NAS Benchmarks as a program suite. We found parallelizers to be reasonably successful in about half of the given science-engineering programs. An important finding is also that some techniques
2014
Performance portability means a single program gives good performance across a variety of systems, without modifying the program. OpenACC is designed to offer performance portability across CPUs with SIMD extensions and accelerators based on GPU or many-core architectures. Using a sequence of examples, we explore the aspects of performance portability that are well-addressed by OpenACC itself and those that require underlying compiler optimization techniques. We introduce the concepts of forward and backward performance portability, where the former means legacy codes optimized for SIMD-capable CPUs can be compiled for optimal execution on accelerators and the latter means the opposite. The goal of an OpenACC compiler should be to provide both, and we uncover some interesting opportunities as we explore the concept of backward performance portability.
Proceedings of the 2017 Workshop on Adaptive Resource Management and Scheduling for Cloud Computing - ARMS-CC '17
Many modern parallel computing systems are heterogeneous at their node level. Such nodes may comprise general purpose CPUs and accelerators (such as, GPU, or Intel Xeon Phi) that provide high performance with suitable energy-consumption characteristics. However, exploiting the available performance of heterogeneous architectures may be challenging. There are various parallel programming frameworks (such as, OpenMP, OpenCL, OpenACC, CUDA) and selecting the one that is suitable for a target context is not straightforward. In this paper, we study empirically the characteristics of OpenMP, OpenACC, OpenCL, and CUDA with respect to programming productivity, performance, and energy. To evaluate the programming productivity we use our homegrown tool CodeStat, which enables us to determine the percentage of code lines that was required to parallelize the code using a specific framework. We use our tool x-MeterPU to evaluate the energy consumption and the performance. Experiments are conducted using the industry-standard SPEC benchmark suite and the Rodinia benchmark suite for accelerated computing on heterogeneous systems that combine Intel Xeon E5 Processors with a GPU accelerator or an Intel Xeon Phi co-processor.
2012
The irruption in the HPC scene of hardware accelerators, like GPUs, has made available unprecedented performance to developers. However, even expert developers may not be ready to exploit the new complex processor hierarchies. We need to find a way to leverage the programming effort in these devices at programming language level, otherwise, developers will spend most of their time focusing on device-specific code instead of implementing algorithmic enhancements. The recent advent of the OpenACC standard for heterogeneous computing represents an effort in this direction. This initiative, combined with future releases of the OpenMP standard, will converge into a fully heterogeneous framework that will cope the programming requirements of future computer architectures. In this work we present accULL, a novel implementation of the OpenACC standard, based on the combination of a source to source compiler and a runtime library. To our knowledge, our approach is the first providing support for both OpenCL and CUDA platforms under this new standard.
GPUs and other accelerators are available on many different devices, while GPGPU has been massively adopted by the HPC research community. Although a plethora of libraries and applications providing GPU support are available, the need of implementing new algorithms from scratch, or adapting sequential programs to accelerators, will always exist. Writing CUDA or OpenCL codes, although an easier task than using their predecessors, is not trivial. Obtaining performance is even harder, as it requires deep understanding of the underlying architecture. Some efforts have been directed toward the automatic code generation for GPU devices, with different results. In this work, we present a comparison between three directive-based programming models: hiCUDA, PGI Accelerator and OpenACC, using for the last our novel accULL implementation.
Future Generation Computer Systems, 2006
Unified Parallel C (UPC) is an explicit parallel extension to ISO C which follows the Partitioned Global Address Space (PGAS) programming model. UPC, therefore, combines the ability to express parallelism while exploiting locality. To do so, compilers must embody effective UPCspecific optimizations. In this paper we present a strategy for evaluating the performance of PGAS compilers. It is based on emulating possible optimizations and comparing the performance to the raw compiler performance. It will be shown that this technique uncovers missed optimization opportunities. The results also demonstrate that, with such automatic optimizations, the UPC performance will be compared favorably with other paradigms.
The Journal of Supercomputing, 2010
The role of the compiler is fundamental to exploit the hardware capabilities of a system running a particular application, minimizing the sequential execution time and, in some cases, offering the possibility of parallelizing part of the code automatically. This paper relies on the SPEC CPU2006 v1.1 benchmark suite to evaluate the performance of the code generated by three widely-used compilers (Intel C++/Fortran Compiler 11.0, Sun Studio 12 and GCC 4.3.2). Performance is measure in terms of base speed for reference problem sizes. Both sequential and automatic parallel performance obtained is analyzed, using different hardware architectures and configurations. The study includes a detailed description of the different problems that arise while compiling SPEC CPU2006 benchmarks with these tools, an information difficult to obtain elsewhere. Having in mind that performance is a moving target in the field of compilers, our evaluation shows that the sequential code generated by both Sun and Intel compilers for the SPEC CPU2006 integer benchmarks present a similar performance, while the floating-point code generated by Intel compiler is faster than its competitors. With respect to the auto-parallelization options offered by Intel and Sun compilers, our study shows that their benefits only apply to some floating-point benchmarks, with an average speedup of 1.2× with four processors. Meanwhile, the GCC suite evaluated is not capable of compiling the SPEC CPU2006 benchmark with auto-parallelization options enabled.
2005
ROCCC (Riverside Optimizing Configurable Computing Compiler) is an optimizing C to HDL compiler targeting FPGA and CSOC (Configurable System On a Chip) architectures. ROCCC system is built on the SUIF-MACHSUIF compiler infrastructure. Our system first identifies frequently executed kernel loops inside programs and then compiles them to VHDL after optimizing the kernels to make best use of FPGA resources. This paper presents an overview of the ROCCC project as well as optimizations performed inside the ROCCC compiler.
Loading Preview
Sorry, preview is currently unavailable. You can download the paper by clicking the button above.
Accelerator Programming Using Directives, 2021
2015 International Conference on Parallel Architecture and Compilation (PACT), 2015
ACM Transactions on Architecture and Code Optimization, 2012
Concurrency and Computation: Practice and Experience, 2015
Proceedings of the 17th annual international conference on Supercomputing - ICS '03, 2003
2016 7th International Workshop on Performance Modeling, Benchmarking and Simulation of High Performance Computer Systems (PMBS), 2016
Lecture Notes in Computer Science, 2013
2011 Second International Conference on Intelligent Systems, Modelling and Simulation, 2011
Proceedings of the 29th Symposium on the Implementation and Application of Functional Programming Languages, 2017
Concurrency and Computation: Practice and Experience, 2007