Academia.edu no longer supports Internet Explorer.
To browse Academia.edu and the wider internet faster and more securely, please take a few seconds to upgrade your browser.
2011, 2011 IEEE International Symposium on Parallel and Distributed Processing Workshops and Phd Forum
Building an effective programming model for manycore processors is challenging. On the one hand, the increasing variety of platforms and their specific programming models force users to take a hardware-centric approach not only for implementing parallel applications, but also for designing them. This approach diminishes portability and, eventually, limits performance. On the other hand, to effectively cope with the increased number of large-scale workloads that require parallelization, a portable, application-centric programming model is desirable. Such a model enables programmers to focus first on extracting and exploiting parallelism from their applications, as opposed to generating parallelism for specific hardware, and only second on platform-specific implementation and optimizations.
2013
Abstract—Building an effective programming model for manycore processors is challenging. On the one hand, the increasing variety of platforms and their specific programming models force users to take a hardware-centric approach not only for implementing parallel applications, but also for designing them. This approach diminishes portability and, eventually, limits performance. On the other hand, to effectively cope with the increased number of large-scale workloads that require parallelization, a portable, application-centric programming model is desirable. Such a model enables programmers to focus first on extracting and exploiting parallelism from their applications, as opposed to generating parallelism for specific hardware, and only second on platform-specific implementation and optimizations. In this paper, we first present a survey of programming models designed for programming three families of many-cores: general purpose many-cores (GPMCs), graphics processing units (GPUs), ...
Proceedings of the 7th International Workshop on Programming Models and Applications for Multicores and Manycores - PMAM'16, 2016
In this work we directly evaluate several emerging parallel programming models: Kokkos, RAJA, OpenACC, and OpenMP 4.0, against the mature CUDA and OpenCL APIs. Each model has been used to port TeaLeaf, a miniature proxy application, or miniapp, that solves the heat conduction equation, and belongs to the Mantevo suite of applications. We find that the best performance is achieved with device-tuned implementations but that, in many cases, the performance portable models are able to solve the same problems to within a 5-20% performance penalty. The models expose varying levels of complexity to the developer, and they all present reasonable performance. We believe that complexity will become the major influencer in the long-term adoption of such models.
Proceedings of the 2015 International Workshop on Code Optimisation for Multi and Many Cores - COSMIC '15, 2015
Many modern high-end embedded systems are designed as heterogeneous systems-on-chip (SoCs), where a powerful general purpose multicore host processor is coupled to a manycore accelerator. The host executes legacy applications on top of standard operating systems, while the accelerator runs highly parallel code kernels within those applications. Several programming models are currently being proposed to program such accelerator-based systems, OpenCL and OpenMP being the most relevant examples. In the near future it will be common to have multiple applications, coded with different programming models, concurrently requiring the use of the manycore accelerator. In this paper we present a runtime system for a cluster-based manycore accelerator, optimized for the concurrent execution of OpenMP and OpenCL kernels. The runtime supports spatial partitioning of the manycore, where clusters can be grouped into several "virtual" accelerator instances. Our runtime design is modular and relies on a "generic" component for resource (cluster) scheduling, plus "specialized" components which efficiently deploy generic offload requests into an implementation of the target programming model's semantics. We evaluate the proposed runtime system on a real heterogeneous system, the STMicroelectronics STHORM development board.
Many-core architectures are now a reality and programming them is still a challenge.
2015 15th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing, 2015
Heterogeneous platforms integrating different types of processing units (such as multi-core CPUs and GPUs) are in high demand in high performance computing. Existing studies have shown that using heterogeneous platforms can improve application performance and hardware utilization. However, systematic methods to design, implement, and map applications to efficiently use heterogeneous computing resources are only very few. The goal of my PhD research is therefore to study such heterogeneous systems and propose systematic methods to allow many (classes of) applications to efficiently use them. After 3.5 years of PhD study, my contributions are (1) a thorough evaluation of a suitable programming model for heterogeneous computing; (2) a workload partitioning framework to accelerate parallel applications on heterogeneous platforms; (3) a modeling-based prediction method to determine the optimal workload partitioning; (4) a systematic approach to decide the best mapping between the application and the platform by choosing the best performing hardware configuration (Only-CPU, Only-GPU, or CPU+GPU with the workload partitioning). In the near future, I plan to apply my approach to large-scale applications and platforms to expand its usability and applicability.
2012 Design, Automation & Test in Europe Conference & Exhibition (DATE), 2012
We discuss three complementary approaches that can provide both portability and an increased level of abstraction for the programming of heterogeneous multicore systems. Together, these approaches also support performance portability, as currently investigated in the EU FP7 project PEPPHER. In particular, we consider (1) a library-based approach, here represented by the integration of the SkePU C++ skeleton programming library with the StarPU runtime system for dynamic scheduling and dynamic selection of suitable execution units for parallel tasks; (2) a language-based approach, here represented by the Offload-C++ high-level language extensions and Offload compiler to generate platform-specific code; and (3) a componentbased approach, specifically the PEPPHER component system for annotating user-level application components with performance metadata, thereby preparing them for performance-aware composition. We discuss the strengths and weaknesses of these approaches and show how they could complement each other in an integrational programming framework for heterogeneous multicore systems.
Parallel Computing, 2012
Heterogeneous many-core systems constitute a viable approach for coping with power constraints in modern computer architectures and can now be found across the whole computing landscape ranging from mobile devices, to desktop systems and servers, all the way to high-end supercomputers and large-scale data centers. While these systems promise to offer superior performance-power ratios, programming heterogeneous manycore architectures efficiently has been shown to be notoriously difficult. Programmers typically are forced to take into account a plethora of low-level architectural details and usually have to resort to a combination of different programming models within a single application. In this paper we propose a platform description language (PDL) that enables to capture key architectural patterns of commonly used heterogeneous computing systems. PDL architecture descriptions support both programmers and toolchains by providing platform-specific information in a well-defined and explicit manner. We have developed a prototype source-to-source compilation framework that utilizes PDL descriptors to transform sequential task-based programs with source code annotations into a form that is convenient for execution on heterogeneous many-core systems. Our framework relies on a component-based approach that accommodates for different implementation variants of tasks, customized for different parts of a heterogeneous platform, and utilizes an advanced runtime system for exploiting parallelism through dynamic task scheduling. We show various usage scenarios of our PDL and demonstrate the effectiveness of our framework for a commonly used scientific kernel and a financial application on different configurations of a state-of-the-art CPU/GPU system.
— Nowadays, multi-core architectures have become mainstream in the microprocessor industry. However, while the number of cores integrated in a single chip growth, more important becomes the need for an adequate programming model. In recent years, the OpenCL programming model has attracted the attention of multi-core designers' community. This paper presents an OpenCL-compliant architecture and demonstrates that such programming model can be successfully used as programming model for general-purpose multi-core architectures.
OpenCL, along with CUDA, is one of the main tools used to program GPGPUs. However, it allows running the same code on multi-core CPUs too, making it a rival for the long-established OpenMP. In this paper we compare OpenCL and OpenMP when developing and running compute-heavy code on a CPU. Both ease of programming and performance aspects are considered. Since, unlike a GPU, no memory copy operation is involved, our comparisons measure the code generation quality, as well as thread management efficiency of OpenCL and OpenMP. We evaluate the performance of these development tools under two conditions: a large number of short-running compute-heavy parallel code executions, when more thread management is performed, and a small number of long-running parallel code executions, when less thread management is required. The results show that OpenCL and OpenMP each win in one of the two conditions. We argue that while using OpenMP requires less setup, OpenCL can be a viable substitute for Open...
International Journal of Computing, 2014
This paper examines the computational programming issues that arise from the introduction of GPUs and multi-core computer systems. The discussions and analyses examine the implication of two principles (spatial and temporal locality) that provide useful metrics to guide programmers in designing and implementing efficient sequential and parallel application programs. Spatial and temporal locality represents a science of information flow and is relevant in the development of highly efficient computational programs. The art of high performance programming is to take combinations of these principles and unravel the bottlenecks and latencies associate with the architecture for each manufacturer computer system, and develop appropriate coding and/or task scheduling schemes to mitigate or eliminate these latencies.
2011
Faced with nearly stagnant clock speed advances, chip manufacturers have turned to parallelism as the source for continuing performance improvements. But even though numerous parallel architectures have already been brought to market, a universally accepted methodology for programming them for general purpose applications has yet to emerge. Existing solutions tend to be hardware-specific, rendering them difficult to use for the majority of application programmers and domain experts, and not providing scalability guarantees for future generations of the hardware. This dissertation advances the validation of the following thesis: it is possible to develop efficient general-purpose programs for a many-core platform using a model recognized for its simplicity. To prove this thesis, we refer to the eXplicit Multi-Threading (XMT) architecture designed and built at the University of Maryland. XMT is an attempt at re-inventing parallel computing with a solid theoretical foundation and an aggressive scalable design. Algorithmically, XMT is inspired by the PRAM (Parallel Random Access Machine) model and the architecture design is focused on reducing inter-task communication and synchronization overheads and providing an easy-to-program parallel model. This thesis builds upon the existing XMT infrastructure to improve support for efficient execution with a focus on ease-of-programming. Our contributions aim at reducing the programmer's effort in developing XMT applications and improving the overall performance. More concretely, we: (1) present a work-flow guiding programmers to produce efficient parallel solutions starting from a high-level problem; (2) introduce an analytical performance model for XMT programs and provide a methodology to project running time from an implementation; (3) propose and evaluate RAP-an improved resource-aware compiler loop prefetching algorithm targeted at fine-grained many-core architectures; we demonstrate performance improvements of up to 34.79% on average over the GCC loop prefetching implementation and up to 24.61% on average over a simple hardware prefetching scheme; and (4) implement a number of parallel benchmarks and evaluate the overall performance of XMT relative to existing serial and parallel solutions, showing speedups of up to 13.89x vs. a serial processor and 8.10x vs. parallel code optimized for an existing many-core (GPU). We also discuss the implementation and optimization of the Max-Flow algorithm on XMT, a problem which is among the more advanced in terms of complexity, benchmarking and research interest in the parallel algorithms community. We demonstrate better speed-ups compared to a best serial solution than previous attempts on other parallel platforms.
parallel computing, 2012
The European FP7 project PEPPHER is addressing programmability and performance portability for current and emerging heterogeneous many-core architectures. As its main idea, the project proposes a multi-level parallel execution model comprised of potentially parallelized components existing in variants suitable for different types of cores, memory configurations, input characteristics, optimization criteria, and couples this with dynamic and static resource and architecture aware scheduling mechanisms. Crucial to PEPPHER is that components can be made performance aware, allowing for more efficient dynamic and static scheduling on the concrete, available resources. The flexibility provided in the software model, combined with a customizable, heterogeneous, memory and topology aware run-time system is key to efficiently exploiting the resources of each concrete hardware configuration. The project takes a holistic approach, relying on existing paradigms, interfaces, and languages for the parallelization of components, and develops a prototype framework, a methodology for extending the framework, and guidelines for constructing performance portable software and systems -including paths to migration of existing software -for heterogeneous many-core processors. This paper gives a high-level project overview, and presents a specific example showing how the PEPPHER component variant model and resource-aware run-time system enable performance portability of a numerical kernel.
2010
In this paper, we present the design and implementation of an Open Computing Language (OpenCL) framework that targets heterogeneous accelerator multicore architectures with local memory. The architecture consists of a general-purpose processor core and multiple accelerator cores that typically do not have any cache. Each accelerator core, instead, has a small internal local memory. Our OpenCL runtime is based on software-managed caches and coherence protocols that guarantee OpenCL memory consistency to overcome the limited size of the local memory. To boost performance, the runtime relies on three source-code transformation techniques, work-item coalescing, web-based variable expansion and preload-poststore buffering, performed by our OpenCL C source-to-source translator. Work-item coalescing is a procedure to serialize multiple SPMD-like tasks that execute concurrently in the presence of barriers and to sequentially run them on a single accelerator core. It requires the webbased variable expansion technique to allocate local memory for private variables. Preload-poststore buffering is a buffering technique that eliminates the overhead of software cache accesses. Together with work-item coalescing, it has a synergistic effect on boosting performance. We show the effectiveness of our OpenCL framework, evaluating its performance with a system that consists of two Cell BE processors. The experimental result shows that our approach is promising.
2014 IEEE 28th International Parallel and Distributed Processing Symposium, 2014
Many of the heterogeneous resources available to modern computers are designed for different workloads. In order to efficiently use GPU resources, the workload must have a greater degree of parallelism than a workload designed for multicore-CPUs. And conceptually, the Intel Xeon Phi coprocessors are capable of handling workloads somewhere in between the two. This multitude of applicable workloads will likely lead to mixing multicore-CPUs, GPUs, and Intel coprocessors in multi-user environments that must offer adequate computing facilities for a wide range of workloads. In this work, we are using a lightweight runtime environment to manage the resource-specific workload, and to control the dataflow and parallel execution in two-way hybrid systems. The lightweight runtime environment uses task superscalar concepts to enable the developer to write serial code while providing parallel execution. In addition, our task abstractions enable unified algorithmic development across all the heterogeneous resources. We provide performance results for dense linear algebra applications, demonstrating the effectiveness of our approach and full utilization of a wide variety of accelerator hardware.
2009
Work in the field of application acceleration devices is showing great promise, but still remains a tool largely for computer scientists with domain knowledge, given the complexity of porting existing algorithms to new architectures or environments. Such porting is hindered by the lack of abstraction available.
Lecture Notes in Electrical Engineering, 2011
The 2PARMA project focuses on the development of parallel programming models and run-time resource management techniques to exploit the features of many-core processor architectures.
Loading Preview
Sorry, preview is currently unavailable. You can download the paper by clicking the button above.