Academia.edu no longer supports Internet Explorer.
To browse Academia.edu and the wider internet faster and more securely, please take a few seconds to upgrade your browser.
2015, Proceedings of the 2015 ACM SIGPLAN International Conference on Generative Programming: Concepts and Experiences
…
10 pages
1 file
Using GPUs as general-purpose processors has revolutionized parallel computing by offering, for a large and growing set of algorithms, massive data-parallelization on desktop machines. An obstacle to widespread adoption, however, is the difficulty of programming them and the low-level control of the hardware required to achieve good performance. This paper suggests a programming library, SafeGPU, that aims at striking a balance between programmer productivity and performance, by making GPU data-parallel operations accessible from within a classical object-oriented programming language. The solution is integrated with the design-bycontract approach, which increases confidence in functional program correctness by embedding executable program specifications into the program text. We show that our library leads to modular and maintainable code that is accessible to GPGPU non-experts, while providing performance that is comparable with handwritten CUDA code. Furthermore, runtime contract checking turns out to be feasible, as the contracts can be executed on the GPU.
2011
The recent rise in the popularity of Graphics Processing Units (GPUs) has been fueled by software frameworks, such as NVIDIA’s Compute Unified Device Architecture (CUDA) and Khronos Group’s OpenCL that make GPUs available for general purpose computing. However, CUDA and OpenCL are still lowlevel approaches that require users to handle details about data layout and movement across levels of memory hierarchy. We propose a declarative approach to coordinating computation and data movement between CPU and GPU, through a domain-specific language that we called Harlan. Not only does a declarative language obviate the need for the programmer to write low-level error-prone boilerplate code, by raising the abstraction of specifying GPU computation it also allows the compiler to optimize data movement and overlap between CPU and GPU computation. By focusing on the “what”, and not the “how”, of data layout, data movement, and computation scheduling, the language eliminates the sources of many ...
Proceedings of the 2011 ACM SIGPLAN X10 Workshop, 2011
GPU architectures have emerged as a viable way of considerably improving performance for appropriate applications. Program fragments (kernels) appropriate for GPU execution can be implemented in CUDA or OpenCL and glued into an application via an API. While there is plenty of evidence of performance improvements using this approach, there are many issues with productivity. Programmers must understand an additional programming model and API to program the accelerator; concurrency and synchronization in this programming model is typically expressed differently from the programming model for the host. On top of this, the languages used to write kernels are very low level and thus prone to the kinds of errors that one does not encounter in higher level languages. Programmers must explicitly deal with moving data back-and-forth between the host and the accelerator. These problems are compounded when the user code must be run across a cluster of accelerated nodes. Now the host programming model must further be extended with constructs to deal with scale-out and remote accelerators. We believe there is a critical need for a single source programming model that can be used to write clean, efficient code for heterogeneous, multi-core and scale-out architectures. The APGAS programming model has been developed for such architectures over the past six years. APGAS is based on four fundamental (and architecture-independent) notions: locality, asynchrony, conditional atomicity and order. X10 is an instantiation of the APGAS programming model on top of a base sequential language with Java-style productivity. Earlier work has shown that X10 can be used to write clean and efficient code for homogeneous multi-cores, SMPs, Cell-accelerated nodes, and clusters of such nodes. In this paper we show how X10 programmers can write code that can be compiled and run on GPUs. GPU programming idioms such as threads, blocks, barriers, constant memory, local registers, shared memory variables, etc. can be directly expressed in X10, and do not require new language extensions. We present the design of an extension of the X10-to-C++ compiler which recognizes such idioms and produces CUDA kernel code. We show several benchmarks written in this style. The performance of these kernels is within 80% of handwritten CUDA kernels. [Copyright notice will appear here once 'preprint' option is removed.] We believe these results establish X10 as a single-source programming language in which clean, efficient programs can be written for GPU-accelerated clusters.
2010
Abstract—General-purpose computing on GPUs (graphics processing units) has received much attention lately due to the benefits of stream processing to exploit limitations of parallel processing. However, programming GPUs has several challenges with respect to the amount of effort spent in combining the kernel functional code of an application with the parallel concerns offered by APIs from various GPUs. This paper introduces our approach for raising the level of abstaction for programming GPUs.
Springer eBooks, 2020
Over the years, researchers have developed many formal method tools to support software development. However, hardly any studies are conducted to determine whether the actual problems developers encounter are sufficiently addressed. For the relatively young field of GPU programming, we would like to know whether the tools developed so far are sufficient, or whether some problems still need attention. To this end, we first look at what kind of problems programmers encounter in OpenCL and CUDA. We gather problems from Stack Overflow and categorise them with card sorting. We find that problems related to memory, synchronisation of threads, threads in general and performance are essential topics. Next, we look at (verification) tools in industry and research, to see how these tools addressed the problems we discovered. We think many problems are already properly addressed, but there is still a need for easy to use sound tools. Alternatively, languages or programming styles can be created, that allows for easier checking for soundness. Keywords: GPU • GPGPU • Formal methods • Verification • Bugs • CUDA • OpenCL
2012
Graphics processing units (GPUs) are powerful devices capable of rapid parallel computation. GPU programming, however, can be quite difficult, limiting its use to experienced programmers and keeping it out of reach of a large number of potential users. We present Chestnut, a domainspecific GPU parallel programming language for parallel multidimensional grid applications. Chestnut is designed to greatly simplify the process of programming on the GPU, making GPU computing accessible to computational scientists who have little or no parallel programming experience, as well as a useful and powerful language for more experienced programmers. In addition, Chestnut has an optional GUI programming interface that makes GPU computing accessible to even novice programmers. Chestnut is intuitive and easy to use, while still powerful in the types of parallelism it can express. The language provides a single simple parallel construct that allows a Chestnut programmer to “think sequentially ” in e...
2014 IEEE Intl Conf on High Performance Computing and Communications, 2014 IEEE 6th Intl Symp on Cyberspace Safety and Security, 2014 IEEE 11th Intl Conf on Embedded Software and Syst (HPCC,CSS,ICESS), 2014
We present the design and implementation of a generic annotation based compiler framework, JolokiaC++, which generates high quality CUDA (Compute Unified Device Architecture) code for GPUs. The framework abstracts the details of the underlying hardware using annotations, thus allowing an end-user to write parallel programs without detailed knowledge about the hardware. The end-user can extract an acceptable level of performance from GPU hardware without worrying about low level details of the hardware like data allocation, memory organization and communication overhead. The ultimate goal of the framework is to increase productivity without compromising performance. The proposed key ingredients to achieve the goals of productivity and performance are implicit and explicit annotations supported by task-level data flow analysis and operation-level data flow analysis. JolokiaC++ can also optimize irregular data applications on GPUs. We developed extensions for the generic parallel constructs that allow portable and efficient programming of codes with irregular accesses on the GPU. We evaluate and show the effectiveness of our framework on kernels with regular and irregular accesses. The regular access kernels include Blackscholes, Matrix-Vector multiplication, Matrix-Matrix multiplication, Jacobi 1D & 2D, Heat 2D, Vector Addition and Convolution. We evaluated the performance of regular kernel on Nvidias GeForce 770 using CUDA version 5.5. The inspector-executor composition for irregular accesses in our framework is evaluated by generating synthetic data for aggregation benchmarks: MOL-DYN, IRREG and NBF. We present experimental results from compiling the irregular xxii Abbreviations
Lecture Notes in Computer Science, 2013
As a consequence of the immense computational power available in GPUs, the usage of these platforms for running data-intensive general purpose programs has been increasing. Since memory and processor architectures of CPUs and GPUs are substantially different, programs designed for each platform are also very different and often resort to a very distinct set of algorithms and data structures. Selecting between the CPU or GPU for a given program is not easy as there are variations in the hardware of the GPU, in the amount of data, and in several other performance factors. AEminiumGPU is a new data-parallel framework for developing and running parallel programs on CPUs and GPUs. AEminiumGPU programs are written in a Java using Map-Reduce primitives and are compiled into hybrid executables which can run in either platforms. Thus, the decision of which platform is going to be used for executing a program is delayed until run-time and automatically performed by the system using Machine-Learning techniques. Our tests show that AEminiumGPU is able to achieve speedups up to 65x and that the average accuracy of the platform selection algorithm, in choosing the best platform for executing a program, is above 92%.
2013 IEEE International Symposium on Parallel & Distributed Processing, Workshops and Phd Forum, 2013
Graphics processing units (GPUs) have the potential to greatly accelerate many applications, yet programming models remain too low level. Many language-based solutions to date have addressed this problem by creating embedded domain-specific languages that compile to CUDA or OpenCL. These targets are meant for human programmers and thus are less than ideal compilation targets. LLVM recently gained a compilation target for PTX, NVIDIA's low-level virtual instruction set for GPUs. This lower-level representation is more expressive than CUDA and OpenCL, making it easier to support advanced language features such as abstract data types or even certain closures. We demonstrate the effectiveness of this approach by extending the Rust programming language with support for GPU kernels. At the most basic level, our extensions provide functionality that is similar to that of CUDA. However, our approach seamlessly integrates with many of Rust's features, making it easy to build a library of ergonomic abstractions for data parallel computing. This approach provides the expressiveness of a high level GPU language like Copperhead or Accelerate, yet also provides the programmer the power needed to create new abstractions when those we have provided are insufficient.
Parallel and Distributed Computing and Networks, 2013
Although general purpose computation on GPU (GPGPU) seems to be a promising method for high-performance computing, current programming frameworks such as CUDA and OpenCL are difficult and not portable enough. Therefore, we propose a new framework MESI-CUDA for easier GPGPU programming. MESI-CUDA provides shared variables which can be accessed from both CPU and GPU. Our compiler translates user's shared-memory-based program into a CUDA program automatically generating the memory allocation and data transfer code. The compiler also overlaps kernel executions and data transfers by optimizing the scheduling. The evaluation results show that programs using MESI-CUDA can achieve the performance close to hand-optimized CUDA programs, largely reducing user's coding cost.
The Journal of Supercomputing, 2011
llc is a C-based language where parallelism is expressed using compiler directives. In this paper, we present a new backend of an llc compiler that produces code for GPUs. We have also implemented a software architecture that eases the development of new backends. Our design represents an intermediate layer between a high-level parallel language and different hardware architectures.
Loading Preview
Sorry, preview is currently unavailable. You can download the paper by clicking the button above.
Lecture Notes in Computer Science, 2011
Science of Computer Programming, 2014
Proceedings of the 32nd Symposium on Implementation and Application of Functional Languages
ACM SIGPLAN Notices, 2012
Concurrency and Computation: Practice and Experience, 2015
Proceedings of the 22nd ACM SIGPLAN International Workshop on Formal Techniques for Java-Like Programs, 2020
Proceedings of the 8th Workshop on General Purpose Processing using GPUs - GPGPU 2015, 2015
2014 IEEE 28th International Parallel and Distributed Processing Symposium, 2014
Parallel Computing, 2012