Academia.edu no longer supports Internet Explorer.
To browse Academia.edu and the wider internet faster and more securely, please take a few seconds to upgrade your browser.
2010
Computations on unstructured graphs are challenging to parallelize because dependences in the underlying algorithms are usually complex functions of runtime data values, thwarting static parallelization. One promising general-purpose parallelization strategy for these algorithms is optimistic parallelization.
ACM SIGPLAN Notices, 2012
ABSTRACT Algorithms in new application areas like machine learning and network analysis use "irregular" data structures such as graphs, trees and sets. Writing efficient parallel code in these problem domains is very challenging because it requires the programmer to make many choices: a given problem can usually be solved by several algorithms, each algorithm may have many implementations, and the best choice of algorithm and implementation can depend not only on the characteristics of the parallel platform but also on properties of the input data such as the structure of the graph. One solution is to permit the application programmer to experiment with different algorithms and implementations without writing every variant from scratch. Auto-tuning to find the best variant is a more ambitious solution. These solutions require a system for automatically producing efficient parallel implementations from high-level specifications. Elixir, the system described in this paper, is the first step towards this ambitious goal. Application programmers write specifications that consist of an operator, which describes the computations to be performed, and a schedule for performing these computations. Elixir uses sophisticated inference techniques to produce efficient parallel code from such specifications. We used Elixir to automatically generate many parallel implementations for three irregular problems: breadth-first search, single source shortest path, and betweenness-centrality computation. Our experiments show that the best generated variants can be competitive with handwritten code for these problems from other research groups; for some inputs, they even outperform the handwritten versions.
Combinatorial problems such as those from graph theory pose serious challenges for parallel machines due to non-contiguous, concurrent accesses to global data structures with low degrees of locality. The hierarchical memory systems of symmetric multiprocessor (SMP) clusters optimize for local, contiguous memory accesses, and so are inefficient platforms for such algorithms. Few parallel graph algorithms outperform their best sequential implementation on SMP clusters due to long memory latencies and high synchronization costs. In this paper, we consider the performance and scalability of two graph algorithms, list ranking and connected components, on two classes of sharedmemory computers: symmetric multiprocessors such as the Sun Enterprise servers and multithreaded architectures (MTA) such as the Cray MTA-2. While previous studies have shown that parallel graph algorithms can speedup on SMPs, the systems' reliance on cache microprocessors limits performance. The MTA's latency tolerant processors and hardware support for fine-grain synchronization makes performance a function of parallelism. Since parallel graph algorithms have an abundance of parallelism, they perform and scale significantly better on the MTA. We describe and give a performance model for each architecture. We analyze the performance of the two algorithms and discuss how the features of each architecture affects algorithm development, ease of programming, performance, and scalability.
Sigplan Notices, 2007
Irregular applications, which manipulate large, pointer-based data structures like graphs, are difficult to parallelize manually. Automatic tools and techniques such as restructuring compilers and runtime speculative execution have failed to uncover much parallelism in these applications, in spite of a lot of effort by the research community. These difficulties have even led some researchers to wonder if there is any coarse-grain parallelism worth exploiting in irregular applications.
We describe a system that uses automated planning to synthesize correct and efficient parallel graph programs from high-level algorithmic specifications. Automated planning allows us to use constraints to declaratively encode program transformations such as scheduling, implementation selection, and insertion of synchronization. Each plan emitted by the planner satisfies all constraints simultaneously, and corresponds to a composition of these transformations. In this way, we obtain an integrated compilation approach for a very challenging problem domain. We have used this system to synthesize parallel programs for four graph problems: triangle counting, maximal independent set computation, preflow-push maxflow, and connected components. Experiments on a variety of inputs show that the synthesized implementations perform competitively with handwritten , highly-tuned code.
The focus of traditional scientific computing has been in solving large systems of PDEs (and the corresponding linear algebra problems that they induce). Hardware architectures, computer systems, and software platforms have evolved together to efficiently support solving these kinds of problems. Similar attention has not been devoted to solving large-scale graph problems. Recently this class of applications has seen increased attention. The irregular, nonlocal, and dynamic characteristics of these problems require new programming techniques to adapt them to modern HPC systems offering multiple levels of parallelism. We describe a library for implementing graph algorithms based on asynchronous execution of fine-grained, concurrent operations. Prototype implementations of two graph kernels which combine lightweight graph metadata transactions with generalized active messages demonstrate that it is possible to implement graph applications which efficiently leverage both shared-and distributed-memory parallelism.
Parallel Processing …, 2007
Graph algorithms are becoming increasingly important for solving many problems in scientific computing, data mining and other domains. As these problems grow in scale, parallel computing resources are required to meet their computational and memory requirements. Unfortunately, the algorithms, software, and hardware that have worked well for developing mainstream parallel scientific applications are not necessarily effective for large-scale graph problems. In this paper we present the interrelationships between graph problems, software, and parallel hardware in the current state of the art and discuss how those issues present inherent challenges in solving large-scale graph problems. The range of these challenges suggests a research agenda for the development of scalable high-performance software for graph problems.
Science Journal of University of Zakho, 2021
Programs that manipulate heaps such as singlylinked lists, doublylinked lists, skiplists, and treesare ubiquitous, and hence ensuring their correctness is of utmost importance. Analysing correctness properties for such programs is not trivial since they induce dynamic data structures, leading to unbounded state spaces with intricate patterns. One approach that has been adopted to tackle this problem is the use of symbolic searching techniques. The state space is encoded using graphs where the nodes represent memory cells, and the edges represent pointers between the cells. It is necessary to prune the search to avoid generating massive numbers of graphs, thus making the procedure unpractical. Pruning strategies are defined based on operations such as graph matching and inclusion. In this paper, a set of algorithms for performing these operations are presented. It is demonstrated that the proposed algorithms can handle typical graphs that arise in the ver...
2009
Irregular applications, which manipulate complex, pointer-based data structures, are a promising target for parallelization. Recent studies have shown that these programs exhibit a kind of parallelism called amorphous data-parallelism. Prior approaches to parallelizing these applications, such as thread-level speculation and transactional memory, often obscure parallelism because they do not distinguish between the concrete representation of a data structure and its semantic state; they conflate metadata and data.
Lecture Notes in Computer Science, 1991
Graph rewriting models are very suited to serve as the basic computational model for functional languages and their implementation. Graphs are used to share computations which is needed to make efficient implementations of functional languages on sequential hardware possible. When graphs are rewritten (reduced) on parallel loosely coupled machine architectures, subgraphs have to be copied from one processor to another such that sharing is lost. In this paper we introduce the notion of lazy copying. With lazy copying it is possible to duplicate a graph without duplicating work. Lazy copying can be combined with simple mmotations which control the order of reduction. In principle, only interleaved execution of the individual reduction steps is possible. However, a condition is deduced under which parallel execution is allowed. When only certain combinations of lazy copying and annotations are used it is guarantied that this so-called non-interference condition is fulfilled. Abbreviations for these combinations are introduced. Now complex process behavlours, such as process communication on a loosely coupled parallel machine architecture, can be modelled. This also includes a special case: modelling mnltlprocessing on a single processor. Arbitrary process topologies can be created. Synchronous and asyncbronons process communication can be modelled. The implementation of the language Concurrent Clean, which is based on the proposed graph rewriting model, has shown that complicated parallel algorithms which can go far beyond divide-and-conquar like applications can be expressed.
Extracting information from graphs, from finding shortest paths to complex graph mining, is essential for many ap-plications. Due to the shear size of modern graphs (e.g., social networks), processing must be done on large paral-lel computing infrastructures (e.g., the cloud). Earlier ap-proaches relied on the MapReduce framework, which was proved inadequate for graph algorithms. More recently, the message passing model (e.g., Pregel) has emerged. Although the Pregel model has many advantages, it is agnostic to the graph properties and the architecture of the underlying com-puting infrastructure, leading to suboptimal performance. In this paper, we propose Mizan, a layer between the users' code and the computing infrastructure. Mizan considers the structure of the input graph and the architecture of the in-frastructure in order to: (i) decide whether it is beneficial to generate a near-optimal partitioning of the graph in a pre-processing step, and (ii) choose between typical po...
International Journal of Computer and Information Engineering, 2021
Recently, graph-based computations have become more important in large-scale scientific computing as they can provide a methodology to model many types of relations between independent objects. They are being actively used in fields as varied as biology, social networks, cybersecurity, and computer networks. At the same time, graph problems have some properties such as irregularity and poor locality that make their performance different than regular applications performance. Therefore, parallelizing graph algorithms is a hard and challenging task. Initial evidence is that standard computer architectures do not perform very well on graph algorithms. Little is known exactly what causes this. The Graph500 benchmark is a representative application for parallel graph-based computations, which have highly irregular data access and are driven more by traversing connected data than by computation. In this paper, we present results from analyzing the performance of various example implementations of Graph500, including a shared memory (OpenMP) version, a distributed (MPI) version, and a hybrid version. We measured and analyzed all the factors that affect its performance in order to identify possible changes that would improve its performance. Results are discussed in relation to what factors contribute to performance degradation.
2017 IEEE International Conference on Cluster Computing (CLUSTER), 2017
The rapidly growing number of large network analysis problems has led to the emergence of many parallel and distributed graph processing systems-one survey in 2014 identified over 80. Since then, the landscape has evolved; some packages have become inactive while more are being developed. Determining the best approach for a given problem is infeasible for most developers. To enable easy, rigorous, and repeatable comparison of the capabilities of such systems, we present an approach and associated software for analyzing the performance and scalability of parallel, open-source graph libraries. We demonstrate our approach on five graph processing packages: Graph-Mat, the Graph500, the Graph Algorithm Platform Benchmark Suite, GraphBIG, and PowerGraph using synthetic and real-world datasets. We examine previously overlooked aspects of parallel graph processing performance such as phases of execution and energy usage for three algorithms: breadth first search, single source shortest paths, and PageRank and compare our results to Graphalytics.
2010
Irregular algorithms are organized around pointer-based data structures such as graphs and trees, and they are ubiquitous in applications. Recent work by the Galois project has provided a systematic approach for parallelizing irregular applications based on the idea of optimistic or speculative execution of programs. However, the overhead of optimistic parallel execution can be substantial. In this paper, we show that many irregular algorithms have structure that can be exploited and present three key optimizations that take advantage of algorithmic structure to reduce speculative overheads. We describe the implementation of these optimizations in the Galois system and present experimental results to demonstrate their benefits. To the best of our knowledge, this is the first system to exploit algorithmic structure to optimize the execution of irregular programs.
In this paper, we describe our experience with Autographa compiler for synthesizing concurrent graph data structure implementations in Java. The input to Autograph is a high-level declarative specification of an abstract data type (ADT), expressed in terms of relations and relational operations. The output is a concurrent implementation of that ADT with conflict detection and rollback baked in. Autograph also provides application programmers with a high-level constraint language for tuning the performance of the generated implementation. We present a case study in which we use Autograph to synthesize a parallel graph data structure for a parallel implementation of Boruvka's minimal spanning tree algorithm taken from the Lonestar benchmark suite. In particular, we discuss a number of techniques used to specialize the graph data structure implementation for the application and the performance benefit from these specializations. Our results show that these techniques improve overall performance by 38% when compared to a library implementation in the Lonestar benchmark suite, and produce an implementation that runs as fast as a handwritten version of this benchmark that uses fine-grain locking.
Communications in Computer and Information Science, 2017
Pattern matching on large graphs is the foundation for a variety of application domains. Strict latency requirements and continuously increasing graph sizes demand the usage of highly parallel in-memory graph processing engines that need to consider non-uniform memory access (NUMA) and concurrency issues to scale up on modern multiprocessor systems. To tackle these aspects, graph partitioning becomes increasingly important. Hence, we present a technique to process graph pattern matching on NUMA systems in this paper. As a scalable pattern matching processing infrastructure, we leverage a data-oriented architecture that preserves data locality and minimizes concurrency-related bottlenecks on NUMA systems. We show in detail, how graph pattern matching can be asynchronously processed on a multiprocessor system.
ACM SIGPLAN Notices, 2005
This paper describes the process used to extend the Boost Graph Library (BGL) for parallel operation with distributed memory. The BGL consists of a rich set of generic graph algorithms and supporting data structures, but it was not originally designed with parallelism in mind. In this paper, we revisit the abstractions comprising the BGL in the context of distributed-memory parallelism, lifting away the implicit requirements of sequential execution and a single shared address space. We illustrate our approach by describing the process as applied to one of the core algorithms in the BGL, breadth-first search. The result is a generic algorithm that is unchanged from the sequential algorithm, requiring only the introduction of external (distributed) data structures for parallel execution. More importantly, the generic implementation retains its interface and semantics, such that other distributed algorithms can be built upon it, just as algorithms are layered in the sequential case. By...
14th Twente Student conference on IT January 21st, 2011
Despite the wide availability of multi-core processors and the popularity of Java, there is currently no library for parallel graph algorithms in Java available. Such a library would enable all Java programmers, especially those who work on the verification of programs and model checking to efficiently use graph algorithms. To find out what would be a good design for this library, we have made a start on it. We have created a general design for algorithms and graphs in the library. We have also implemented the reachability and connected components algorithms, both sequential and parallel. In this paper we describe the algorithms we have implemented and our design of the library, focusing on the design choices we made.
Graph theoretic problems are representative of fundamental computations in traditional and emerging scientific disciplines like scientific computing and computational biology, as well as applications in national security. We present our design and implementation of a graph theory application that supports the kernels from the Scalable Synthetic Compact Applications (SSCA) benchmark suite, developed under the DARPA High Productivity Computing Systems (HPCS) program. This synthetic benchmark consists of four kernels that require irregular access to a large, directed, weighted multi-graph. We have developed a parallel implementation of this benchmark in C using the POSIX thread library for commodity symmetric multiprocessors (SMPs). In this paper, we primarily discuss the data layout choices and algorithmic design issues for each kernel, and also present execution time and benchmark validation results.
2007 IEEE International Parallel and Distributed Processing Symposium, 2007
Search-based graph queries, such as finding short paths and isomorphic subgraphs, are dominated by memory latency. If input graphs can be partitioned appropriately, large cluster-based computing platforms can run these queries. However, the lack of compute-bound processing at each vertex of the input graph and the constant need to retrieve neighbors implies low processor utilization. Furthermore, graph classes such as scale-free social networks lack the locality to make partitioning clearly effective. Massive multithreading is an alternative architectural paradigm, in which a large shared memory is combined with processors that have extra hardware to support many thread contexts. The processor speed is typically slower than normal, and there is no data cache. Rather than mitigating memory latency, multithreaded machines tolerate it. This paradigm is well aligned with the problem of graph search, as the high ratio of memory requests to computation can be tolerated via multithreading. In this paper, we introduce the MultiThreaded Graph Library (MTGL), generic graph query software for processing semantic graphs on multithreaded computers. This library currently runs on serial machines and the Cray MTA-2, but Sandia is developing a run-time system that will make it possible to run MTGL-based code on Symmetric MultiProcessors. We also introduce a multithreaded algorithm for connected
In Memory Data Management and Analysis, 2015
Dissatisfaction with relational databases for large-scale graph processing has motivated a new class of graph databases that offer fast graph processing but sacrifice the ability to express basic relational idioms. However, we hypothesize that the performance benefits amount to implementation details, not a fundamental limitation of the relational model. To evaluate this hypothesis, we are exploring code-generation to produce fast in-memory algorithms and data structures for graph patterns that are inaccessible to conventional relational optimizers. In this paper, we present preliminary results for this approach on path-counting queries, which includes triangle counting as a special case. We compile Datalog queries into main-memory pipelined hash-join plans in C++, and show that the resulting programs easily outperform PostgreSQL on real graphs with different degrees of skew. We then produce analogous parallel programs for Grappa, a runtime system for distributed memory architectures. Grappa is a good target for building a parallel query system as its shared memory programming model and communication mechanisms provide productivity and performance when building communication-intensive applications. Our experiments suggest that Grappa programs using hash joins have competitive performance with queries executed on Greenplum, a commercial parallel database. We find preliminary evidence that a code generation approach simplifies the design of a query engine for graph analysis and improves performance over conventional relational databases.
Loading Preview
Sorry, preview is currently unavailable. You can download the paper by clicking the button above.