Academia.edu no longer supports Internet Explorer.
To browse Academia.edu and the wider internet faster and more securely, please take a few seconds to upgrade your browser.
2009
Most client-side applications running on multicore processors are likely to be irregular programs that deal with complex, pointerbased data structures such as large sparse graphs and trees. However, we understand very little about the nature of parallelism in irregular algorithms, let alone how to exploit it effectively on multicore processors.
2009
Until recently, parallel programming has largely focused on the exploitation of data-parallelism in dense matrix programs. However, many important application domains, including meshing, clustering, simulation, and machine learning, have very different algorithmic foundations: they require building, computing with, and modifying large sparse graphs. In the parallel programming literature, these types of applications are usually classified as irregular applications, and relatively little attention has been paid to them. To study and understand the patterns of parallelism and locality in sparse graph computations better, we are in the process of building the Lonestar benchmark suite. In this paper, we characterize the first five programs from this suite, which target domains like data mining, survey propagation, and design automation. We show that even such irregular applications often expose large amounts of parallelism in the form of amorphous data-parallelism. Our speedup numbers demonstrate that this new type of parallelism can successfully be exploited on modern multi-core machines.
2009
Irregular programs are programs organized around pointer-based data structures such as trees and graphs. Recent investigations by the Galois project have shown that many irregular programs have a generalized form of data-parallelism called amorphous data-parallelism. However, in many programs, amorphous dataparallelism cannot be uncovered using static techniques, and its exploitation requires runtime strategies such as optimistic parallel execution. This raises a natural question: how much amorphous data-parallelism actually exists in irregular programs?
2010
Irregular algorithms are organized around pointer-based data structures such as graphs and trees, and they are ubiquitous in applications. Recent work by the Galois project has provided a systematic approach for parallelizing irregular applications based on the idea of optimistic or speculative execution of programs. However, the overhead of optimistic parallel execution can be substantial. In this paper, we show that many irregular algorithms have structure that can be exploited and present three key optimizations that take advantage of algorithmic structure to reduce speculative overheads. We describe the implementation of these optimizations in the Galois system and present experimental results to demonstrate their benefits. To the best of our knowledge, this is the first system to exploit algorithmic structure to optimize the execution of irregular programs.
International Journal of High Speed Computing, 1994
Advances in parallel computing have made it clear that the ability to express computations in a machine-independent manner and the ability to handle dynamic and irregular computations are two necessary features of future programming systems. In this ...
Parallel Computing, 2005
Current compilers show ineffective when optimizing complex applications, both analyzing dependences and exploiting data locality and extracting parallelism. Complex applications may be characterized as irregular and dynamic. Irregular applications arrange data as multidimensional arrays and memory is referenced through array indirections. Dynamic applications organize data as pointer-based structures (lists, trees, . . .) and memory is referenced through pointers. In this paper we discuss a methodology we designed to develop efficient parallelization techniques for irregular and dynamic applications, that proceeds in three stages: recognizing the complex program structure, data analysis and program parallelization based on code/data transformations. Two case examples are analyzed in detail in the context of this methodology: irregular reductions and shape analysis for dynamic data structures.
Memory bounds for different betweenness centrality algorithms. P denotes the number of cores used by each algorithm. K denotes the number of roots used in the the approximation. The dynamic algorithms do not depend on the number of cores as these data structures are maintained as part of the computation.
2000
Traditional parallel compilers do not effectively parallelize irregular applications because they contain little loop-level parallelism. We explore Speculative Task Parallelism (STP), where tasks are full procedures and entire natural loops. Through profiling and compiler analysis, we find tasks that are speculatively memory-and control-independent of their neighboring code. Via speculative futures, these tasks may be executed in parallel with preceding code when there is a high probability of independence.
Fine-grained data parallelism is increasingly common in mainstream processors in the form of longer vectors and on-chip GPUs. This paper develops support for exploiting such data parallelism for a class of non-numeric, non-graphic applications, which perform computations while traversing many independent, irregular data structures. While the traversal of any one irregular data structure does not give opportunity for parallelization, traversing a set of these does. However, mapping such parallelism to SIMD units is non-trivial and not addressed in prior work. We address this problem by developing an intermediate language for specifying such traversals, followed by a run-time scheduler that maps traversals to SIMD units. A key idea in our run-time scheme is converting branches to arithmetic operations, which then allows us to use SIMD hard-ware. In order to make our approach fast, we demonstrate several optimizations including a stream compaction method that aids with control flow in SIMD, a set of layouts that reduce memory latency, and a tiling approach that enables more effective prefetching. Using our approach, we demonstrate significant increases in single-core performance over optimized baselines for two applications.
Proceedings of the 48th International Symposium on Microarchitecture, 2015
We present Swarm, a novel architecture that exploits ordered irregular parallelism, which is abundant but hard to mine with current software and hardware techniques. In this architecture, programs consist of short tasks with programmer-specified timestamps. Swarm executes tasks speculatively and out of order, and efficiently speculates thousands of tasks ahead of the earliest active task to uncover ordered parallelism. Swarm builds on prior TLS and HTM schemes, and contributes several new techniques that allow it to scale to large core counts and speculation windows, including a new execution model, speculation-aware hardware task management, selective aborts, and scalable ordered commits. We evaluate Swarm on graph analytics, simulation, and database benchmarks. At 64 cores, Swarm achieves 51-122× speedups over a single-core system, and outperforms software-only parallel algorithms by 3-18×.
1994
This paper describes how a runtime support library can be used as compiler runtime support in irregular applications. The CHAOS runtime support library carries out optimizations designed to reduce communication costs by performing software caching, communication coalescing and inspector executor preprocessing. CHAOS also supplies special purpose routines to support speci c types of irregular reduction and runtime support for partitioning data and work between processors. A n umber of adaptive irregular codes have been parallelized using the CHAOS library and performance results from these codes are also presented in this paper.
Lecture Notes in Computer Science, 2008
Irregular applications, i.e., programs that manipulate pointer-based data structures such as graphs and trees, constitute a challenging target for parallelization because the amount of parallelism is input dependent and changes dynamically. Traditional dependence analysis techniques are too conservative to expose this parallelism. Even manual parallelization is difficult, time consuming, and error prone. The Galois system parallelizes such applications using an optimistic approach that exploits higher-level semantics of abstract data types.
Sigplan Notices, 2007
Irregular applications, which manipulate large, pointer-based data structures like graphs, are difficult to parallelize manually. Automatic tools and techniques such as restructuring compilers and runtime speculative execution have failed to uncover much parallelism in these applications, in spite of a lot of effort by the research community. These difficulties have even led some researchers to wonder if there is any coarse-grain parallelism worth exploiting in irregular applications.
Irregular applications, i.e., programs that manipulate pointer-based data structures such as graphs and trees, constitute a challenging target for parallelization because the amount of parallelism is input dependent and changes dynamically. Traditional dependence analysis techniques are too conservative to expose this parallelism. Even manual parallelization is difficult, time consuming, and error prone. The Galois system parallelizes such applications using an optimistic approach that exploits higher-level semantics of abstract data types.
Parallel Computing, 2000
Parallel computers are present in a variety of ®elds, having reached a high degree of architectural maturity. However, there is still a lack of convenient software support for implementing ecient parallel applications. This is specially true for the class of irregular applications, whose computational constructs hardly ®t current parallel architectures. In fact, contemporary automatic parallelizers produce, in general, poor parallel code from these applications. This paper discusses techniques and methods to help improve the quality of automatic parallel programs. We focus on two issues: parallelism detection and parallelism implementation. The ®rst issue refers to the detection of speci®c irregular computation constructs or data access patterns. The second issue considers the case that some frequent construct has been detected but has been sub-optimally parallelized. Both issues are dealt with in depth and in the context of sparse computations (for the ®rst issue) and irregular histogram reductions (for the second issue). Ó
2011
For more than thirty years, the parallel programming community has used the dependence graph as the main abstraction for reasoning about and exploiting parallelism in "regular" algorithms that use dense arrays, such as finite-differences and FFTs. In this paper, we argue that the dependence graph is not a suitable abstraction for algorithms in new application areas like machine learning and network analysis in which the key data structures are "irregular" data structures like graphs, trees, and sets.
Lecture Notes in Computer Science, 1996
Data-parallel languages support a single instruction ow; the parallelism is expressed at the instruction level. Actually, data-parallel languages have chosen arrays to support the parallelism. This regular data structure allows a natural development of regular parallel algorithms. The implementation of irregular algorithms necessitates a programming e ort to project the irregular data structures onto regular structures. In this article we present the di erent techniques used to manage the irregularity in data-parallel languages. Each of them will be illustrated with standard or experimental data-parallel language constructions.
1993
Parallelization of any large application can be a di cult task, but when the application contains irregular patterns of communication and control, the parallelization e ort is higher and the likelihood of producing an e cient implementation is lower. The kinds of data structures that appear in irregular applications, for example, trees, graphs, and sets, do not have simple mappings onto distributed memory machines. We are building a library of such distributed data structures that use a combination of replication and partitioning to achieve high performance. Operations on these structures cannot be e ciently implemented as atomic operations, because of the latency of interprocessor communication. We propose a relaxed consistency model for these data structures, which is analogous to a weak consistency model on shared memory. This allows for clean, simple interfaces on the objects, but admits low latency, high throughput implementations. We demonstrate these ideas using a few data structure examples, each of which is being used in at least one application.
2002
promises several orders of magnitude increase in our ability to solve realistic computationally-intensive problems, but relies on their efficient mapping and execution on large-
Lecture Notes in Computer Science, 1998
High Performance Fortran (HPF) is the de facto standard language for writing data parallel programs. In case of applications that use indirect addressing on distributed arrays, HPF compilers have limited capabilities for optimizing such codes on distributed memory architectures, especially for optimizing communication and reusing communication schedules between subroutine boundaries. This paper describes a dynamic approach for optimizing unstructured communication in codes with indirect addressing. The basic idea is that runtime data reflecting the communication patterns will be reused if possible. The user has only to specify which data in the program has to be traced for modifications. The experiments and results show the effectiveness of the chosen approach.
2015 23rd Euromicro International Conference on Parallel, Distributed, and Network-Based Processing, 2015
Although skeletons largely facilitate the parallelization of algorithms, they often provide little support for the work decomposition. Also, while they have been widely applied to regular computations, this has not been case for irregular algorithms that can exploit amorphous data-parallelism, whose parallelization in fact requires much more effort from programmers and thus benefits more from a structured approach. In this paper we improve and evaluate the configurability of a recently proposed skeleton that allows to parallelize this latter kind of algorithms. Namely, the skeleton allows to easily change critical details such as the data structures, the work partitioning algorithm or the task granularity to use. The simple procedures to choose among these possibilities and their influence on performance are described and evaluated. We conclude that the skeleton allows to conveniently explore different possibilities for the parallelization of irregular applications, which can result in substantial performance improvements.
Loading Preview
Sorry, preview is currently unavailable. You can download the paper by clicking the button above.