Academia.edu no longer supports Internet Explorer.
To browse Academia.edu and the wider internet faster and more securely, please take a few seconds to upgrade your browser.
2013, Lecture Notes in Computer Science
OpenMP is an explicit parallel programming model that offers reasonable productivity. Its memory model assumes a shared address space, and hence the direct translation -as done by common OpenMP compilers -requires an underlying shared-memory architecture. Many lab machines include 10s of processors, built from commodity components and thus include distributed address spaces. Despite many efforts to provide higher productivity for these platforms, the most common programming model uses message passing, which is substantially more tedious to program than shared-address-space models. This paper presents a compiler/runtime system that translates OpenMP programs into message passing variants and executes them on clusters up to 64 processors. We build on previous work that provided a proof of concept of such translation. The present paper describes compiler algorithms and runtime techniques that provide the automatic translation of a first class of OpenMP applications: those that exhibit regular write array subscripts and repetitive communication. We evaluate the translator on representative benchmarks of this class and compare their performance against hand-written MPI variants. In all but one case, our translated versions perform close to the hand-written variants.
2008
Abstract. OpenMP is a portable shared memory programming interface that promises high programmer productivity for multithreaded applications. It is designed for small and middle sized shared memory systems. We have developed strategies to extend OpenMP to clusters via compiler translation to a Global Arrays program. In this paper, we describe our implementation of the translation in the Open64 compiler, and we focus on the strategies to improve sequential region translations. Our work is based upon the open source Open64 compiler suite for C, C++, and Fortran90/95. 1
International Journal of Parallel Programming - IJPP, 2003
This paper describes compiler techniques that can translate standard OpenMP applications into code for distributed computer systems. OpenMP has emerged as an important model and language extension for shared-memory parallel programming. However, despite OpenMP's success on these platforms, it is not currently being used on distributed system. The long-term goal of our project is to quantify the degree to which such a use is possible and develop supporting compiler techniques. Our present compiler techniques translate OpenMP programs into a form suitable for execution on a Software DSM system. We have implemented a compiler that performs this basic translation, and we have studied a number of hand optimizations that improve the baseline performance. Our approach complements related efforts that have proposed language extensions for efficient execution of OpenMP programs on distributed systems. Our results show that, while kernel benchmarks can show high efficiency of OpenMP prog...
Concurrency and Computation: Practice and Experience, 2004
The rapid rise of OpenMP as the preferred parallel programming paradigm for small-to-medium scale parallelism could slow unless OpenMP can show capabilities for becoming the model-of-choice for large scale high-performance parallel computing in the coming decade.
Scientific Programming, 2000
This paper makes two important contributions. First, the paper investigates the performance implications of data placement in OpenMP programs running on modern NUMA multiprocessors. Data locality and minimization of the rate of remote memory accesses are critical for sustaining high performance on these systems. We show that due to the low remote-to-local memory access latency ratio of contemporary NUMA architectures, reasonably balanced page placement schemes, such as round-robin or random distribution, incur modest performance losses. Second, the paper presents a transparent, user-level page migration engine with an ability to gain back any performance loss that stems from suboptimal placement of pages in iterative OpenMP programs. The main body of the paper describes how our OpenMP runtime environment uses page migration for implementing implicit data distribution and redistribution schemes without programmer intervention. Our experimental results verify the effectiveness of the proposed framework and provide a proof of concept that it is not necessary to introduce data distribution directives in OpenMP and warrant the simplicity or portability of the programming model. © An earlier version of this paper [29] appeared in the IEEE/ACM Supercompting'2000 Conference and won the Best Technical Paper Award.
Scientific Programming, 2001
OpenMP is attracting wide-spread interest because of its easy-to-use parallel programming model for shared memory multiprocessors. We have implemented a "cluster-enabled" OpenMP compiler for a page-based software distributed shared memory system, SCASH, which works on a cluster of PCs. It allows OpenMP programs to run transparently in a distributed memory environment. The compiler transforms OpenMP programs into parallel programs using SCASH so that shared global variables are allocated at run time in the shared address space of SCASH. A set of directives is added to specify data mapping and loop scheduling method which schedules iterations onto threads associated with the data mapping. Our experimental results show that the data mapping may greatly impact on the performance of OpenMP programs in the software distributed shared memory system. The performance of some NAS parallel benchmark programs in OpenMP is improved by using our extended directives.
Lecture Notes in Computer Science, 2002
In this paper, we examine some of the challenges present in providing support for OpenMP applications on a Software Distributed Shared Memory(DSM) based cluster system. We present detailed measurements of the performance characteristics of realistic OpenMP applications from the SPEC OMP2001 benchmarks. Based on these measurements, we discuss application and system characteristics that impede the efficient execution of these programs on a Software DSM system. We point out pitfalls of a naive translation approach from OpenMP into the API provided by a Software DSM system, and we discuss a set of possible program optimization techniques.
International Journal of Innovative Technology and Exploring Engineering, 2019
With the increase in the advent of parallel computing, it has become necessary to write OpenMP programs to achieve better speedup and to exploit parallel hardware efficiently. However, to achieve this, the programmers are required to understand OpenMP directives and clauses, the dependencies in their code, etc. A small mistake made by them, such as wrongly analysing a dependency or wrong data scoping of a variable, can result in an incorrect or inefficient program. In this paper, we propose a system which can automate the process of parallelization of a serial C code. The system accepts a serial program as input and generates the corresponding parallel code in OpenMP without altering the core logic of the program. The system has used different data scoping and work sharing constructs available in OpenMP platform.The system designed here aims at parallelizing “for” loops, “while” loops, nested “for” loops and recursive structures.The system has parallelized “for” loop by considering ...
Lecture Notes in Computer Science, 2000
We present the design and implementation of UPMLIB, a runtime system that provides transparent facilities for dynamically tuning the memory performance of OpenMP programs on scalable shared-memory multiprocessors with hardware cache-coherence. UPMLIB integrates information from the compiler and the operating system, to implement algorithms that perform accurate and timely page migrations. The algorithms and the associated mechanisms correlate memory reference information with the semantics of parallel programs and scheduling events that break the association between threads and data for which threads have memory affinity at runtime. Our experimental evidence shows that UPMLIB makes OpenMP programs immune to the page placement strategy of the operating system, thus obviating the need for introducing data placement directives in OpenMP. Furthermore, UPMlib provides solid improvements of throughput in multiprogrammed execution environments.
2004
In this paper, we present an alternative implementation of the NANOS OpenMP runtime library (NthLib) that targets portability and efficient support of multiple levels of parallelism. We have implemented the runtime libraries of available opensource OpenMP compilers on top of NthLib, reducing thus their overheads and providing them with inherent support for nested parallelism. In addition, we present an experimental implementation of the workqueuing model and the parallelization of a data clustering algorithm using OpenMP directives. The asymmetry and non-determinism of this algorithm necessitate the exploitation of its nested loop-level parallelism. The experimental results on a SMP server with four processors demonstrate our efficient OpenMP runtime support.
Proceedings of the fifteenth annual ACM symposium on Parallel algorithms and architectures - SPAA '03, 2003
When using a shared memory multiprocessor, the programmer faces the selection of the portable programming model which will deliver the best performance. Even if he restricts his choice to the standard programming environments (MPI and OpenMP), he has a choice of a broad range of programming approaches. To help the programmer in his selection, we compare MPI with three OpenMP programming styles (loop level, loop level with large parallel sections, SPMD) using a subset of the NAS benchmark (CG, MG, FT, LU), two dataset sizes (A and B) and two shared memory multiprocessors (IBM SP3 Night Hawk II, SGI Origin 3800). We also present a path from MPI to OpenMP SPMD guiding the programmers starting from an existing MPI code. We present the first SPMD OpenMP version of the NAS benchmark and compare it with other OpenMP versions from independent sources (PBN, SDSC and RWCP). Experimental results demonstrate that OpenMP provides competitive performance compared to MPI for a large set of experimental conditions. However the price of this performance is a strong programming effort on data set adaptation and inter-thread communications. MPI still provides the best performance under some conditions. We present breakdowns of the execution times and measurements of hardware performance counters to explain the performance differences.
Concurrency and Computation: Practice and Experience, 2007
OpenMP has gained wide popularity as an API for parallel programming on shared memory and distributed shared memory platforms. Despite its broad availability, there remains a need for a portable, robust, open source, optimizing OpenMP compiler for C/C++/Fortran 90, especially for teaching and research, e.g. into its use on new target architectures, such as SMPs with chip multithreading, as well as learning how to translate for clusters of SMPs. In this paper, we present our efforts to design and implement such an OpenMP compiler on top of Open64, an open source compiler framework, by extending its existing analysis and optimization and adopting a source-to-source translator approach where a native back end is not available. The compilation strategy we have adopted and the corresponding runtime support are described. The OpenMP validation suite is used to determine the correctness of the translation. The compiler's behavior is evaluated using benchmark tests from the EPCC microbenchmarks and the NAS parallel benchmark.
2004
In this paper, we present an alternative implementation of the NANOS OpenMP runtime library (NthLib) that targets portability and efficient support of multiple levels of parallelism. We have implemented the runtime libraries of available open- source OpenMP compilers on top of NthLib, reducing thus their overheads and providing them with inherent support for nested parallelism. In addition, we present an experimental implementation of the workqueuing model and the parallelization of a data clustering algorithm using OpenMP directives. The asymmetry and non-determinism of this algorithm necessitate the exploitation of its nested loop-level parallelism. The experimental results on a SMP server with four processors demonstrate our efficient OpenMP runtime support.
In this paper, we present OMP2MPI a tool that generates automatically MPI source code from OpenMP. With this transformation the original program can be adapted to be able to exploit a larger number of processors by surpassing the limits of the node level on large HPC clusters. The transformation can also be useful to adapt the source code to execute in distributed memory many-cores with message passing support. In addition, the resulting MPI code can be used as an starting point that still can be further optimized by software engineers. The transformation process is focused on detecting OpenMP parallel loops and distributing them in a master/worker pattern. A set of micro-benchmarks have been used to verify the correctness of the the transformation and to measure the resulting performance. Surprisingly not only the automatically generated code is correct by construction , but also it often performs faster even when executed with MPI.
2007
OpenMP has gained wide popularity as an API for parallel programming on shared memory and distributed shared memory platforms. It is also a promising candidate to exploit the emerging multicore, multithreaded processors. In addition, there is an increasing trend to combine OpenMP with MPI to take full advantage of mainstream supercomputers consisting of clustered SMPs. All of these require that attention be paid to the quality of the compiler's translation of OpenMP and the flexibility of runtime support. Many compilers and runtime libraries have an internal cost model that helps evaluate compiler transformations, guides adaptive runtime systems, and helps achieve load balancing. But existing models are not sufficient to support OpenMP, especially on new platforms. In this paper, we present our experience adapting the cost models in OpenUH, a branch of Open64, to estimate the execution cycles of parallel OpenMP regions using knowledge of both software and hardware. Our OpenMP cost model reuses major components from Open64, along with extensions to consider more OpenMP details. Preliminary evaluations of the model are presented using kernel benchmarks. The challenges and possible extensions for modeling OpenMP on multicore platforms are also discussed.
International Journal of Parallel Programming, 2010
This paper advances the state-of-the-art in programming models for exploiting task-level parallelism on heterogeneous many-core systems, presenting a number of extensions to the OpenMP language inspired in the StarSs programming model. The proposed extensions allow the programmer to write portable code easily for a number of different platforms, relieving him/her from developing the specific code to off-load tasks to the accelerators and the synchronization of tasks. Our results obtained from the StarSs instantiations for SMPs, the Cell, and GPUs report reasonable parallel performance. However, the real impact of our approach in is the productivity gains it yields for the programmer.
Lecture Notes in Computer Science, 2001
The recent parallel language standard for shared memory multiprocessor (SMP) machines, OpenMP, promises a simple and portable interface for programmers who wish to exploit parallelism explicitly. In this paper, we present our effort to develop portable compilers for the OpenMP parallel directive language. Our compiler consists of two parts. Part one is an OpenMP parallelizer, which transforms sequential languages into OpenMP. Part two transforms programs written in OpenMP into thread-based form and links with our runtime library. Both compilers are built on the Polaris compiler infrastructure. We present performance measurements showing that our compiler yields results comparable to those of commercial OpenMP compilers. Our infrastructure is freely available with the intent to enable research projects on OpenMP-related language development and compiler techniques.
2002
The scalability of an OpenMP program in a ccNUMA system with a large number of processors suffers from remote memory accesses, cache misses and false sharing. Good data locality is needed to overcome these problems whereas OpenMP offers limited capabilities to control it on ccNUMA architecture. A so-called SPMD style OpenMP program can achieve data locality by means of array privatization, and this approach has shown good performance in previous research. It is hard to write SPMD OpenMP code; therefore we are building a tool to relieve users from this task by automatically converting OpenMP programs into equivalent SPMD style OpenMP. We show the process of the translation by considering how to modify array declarations, parallel loops, and showing how to handle a variety of OpenMP constructs including REDUCTION, ORDERED clauses and synchronization. We are currently implementing these translations in an interactive tool based on the Open64 compiler.
Concurrency and Computation: Practice and Experience, 2005
When using a shared memory multiprocessor, the programmer faces the issue of selecting the portable programming model which will provide the best performance. Even if they restricts their choice to the standard programming environments (MPI and OpenMP), they have to select a programming approach among MPI and the variety of OpenMP programming styles. To help the programmer in their decision, we compare MPI with three OpenMP programming styles (loop level, loop level with large parallel sections, SPMD) using a subset of the NAS benchmark (CG, MG, FT, LU), two dataset sizes (A and B), and two shared memory multiprocessors (IBM SP3 NightHawk II, SGI Origin 3800). We have developed the first SPMD OpenMP version of the NAS benchmark and gathered other OpenMP versions from independent sources (PBN, SDSC and RWCP). Experimental results demonstrate that OpenMP provides competitive performance compared with MPI for a large set of experimental conditions. Not surprisingly, the two best OpenMP versions are those requiring the strongest programming effort. MPI still provides the best performance under some conditions. We present breakdowns of the execution times and measurements of hardware performance counters to explain the performance differences.
Loading Preview
Sorry, preview is currently unavailable. You can download the paper by clicking the button above.