Academia.edu no longer supports Internet Explorer.
To browse Academia.edu and the wider internet faster and more securely, please take a few seconds to upgrade your browser.
2009
The Graphics Processing Units (GPUs) provide high computation power at a low cost and is an important compute accelerator with a massively multithreaded architecture. In this paper, we present fast implementations of common graph operations like breadth-first search, st-connectivity, single-source shortest path, all-pairs shortest path, minimum spanning tree, and maximum flow for undirected graphs on the GPU using the CUDA programming model. Our implementations exhibit high performance, especially on large graphs. We use two data-parallel programming methodologies for these algorithms. One is an iterative, mask-based approach that processes valid data elements like vertices and edges using independent threads for each. The other is a divide-and-conquer approach that reduces the problem into smaller problems that are handled later using the same approach. Parallel algorithms for such problems have been reported in the literature before, especially on supercomputers. The massively mul...
Large graphs involving millions of vertices are common in many practical applications and are challenging to process. Practical-time implementations using high-end computers are reported but are accessible only to a few. Graphics Processing Units (GPUs) of today have high computation power and low price. They have a restrictive programming model and are tricky to use. The G80 line of Nvidia GPUs can be treated as a SIMD processor array using the CUDA programming model. We present a few fundamental algorithms-including breadth first search, single source shortest path, and all-pairs shortest path-using CUDA on large graphs. We can compute the single source shortest path on a 10 million vertex graph in 1.5 seconds using the Nvidia 8800GTX GPU costing $600. In some cases optimal sequential algorithm is not the fastest on the GPU architecture. GPUs have great potential as high-performance co-processors.
2010
The basic operations on the graphs with millions of vertices are common in various applications. To have faster execution of such operations is very essential to reduce overall computation time. Today's Graphics processing units (GPUs) have high computation power and low price. This device can be treated as an array of Single Instruction Multiple Data (SIMD) processors using CUDA software interface by Nvidia. Massively Multithreaded architecture of a CUDA device makes various threads to run in parallel and hence making optimum use of available computation power of GPU. In case of graph algorithms, vertices of the graphs are processed in parallel by mapping them to various threads on device. By making thousands of threads to run in parallel, computation time required for these algorithms is drastically decreased as compared to their CPU implementation.
Many practical applications include image processing, space searching, network analysis, graph partitioning etc. in that large graphs having a millions of vertices are commonly used and to process on that vertices is difficult task. Using high-end computers practical-time implementations are reported but are accessible only to a few. Efficient performance of those applications requires fast implementation of graph processing and hence Graphics Processing Units (GPUs) of today having a high computational power of accelerating capacity are deployed. The NVIDIA GPU can be treated as a SIMD processor array using the CUDA programming model. In this paper Breadth-First Search and All Pair shortest path and traveling salesmen problem graph algorithms are performed on GPU capabilities. The algorithms are introduced to optimize such that they can efficiently adopt GPU. Also an optimization technique that reduce data transfer rate CPU to GPU and reduce access of global memory is designed to reduce latency. Analysis of All pair shortest path algorithm by performing on different memories of GPU which shows that using shared memory can reduce execution time and increase speedup over CPU than global memory and coalescing access of data. TSP algorithm shows that increasing number of blocks and iteration obtained optimized tour length.
Proceedings of the 1st ACM conference on High Performance Graphics - HPG '09, 2009
Graphics Processor Units are used for many general purpose processing due to high compute power available on them. Regular, data-parallel algorithms map well to the SIMD architecture of current GPU. Irregular algorithms on discrete structures like graphs are harder to map to them. Efficient data-mapping primitives can play crucial role in mapping such algorithms onto the GPU. In this paper, we present a minimum spanning tree algorithm on Nvidia GPUs under CUDA, as a recursive formulation of Borůvka's approach for undirected graphs. We implement it using scalable primitives such as scan, segmented scan and split. The irregular steps of supervertex formation and recursive graph construction are mapped to primitives like split to categories involving vertex ids and edge weights. We obtain 30 to 50 times speedup over the CPU implementation on most graphs and 3 to 10 times speedup over our previous GPU implementation. We construct the minimum spanning tree on a 5 million node and 30 million edge graph in under 1 second on one quarter of the Tesla S1070 GPU.
There is the significant interest nowadays in developing the frameworks of parallelizing the processing for the large graphs such as social networks, Web graphs, etc. Most parallel graph processing frameworks employ iterative processing model. However, by benchmarking the state-of-art GPUbased graph processing frameworks, we observed that the performance of iterative traversing-based graph algorithms (such as Bread First Search, Single Source Shortest Path and so on) on GPU is limited by the frequent data exchange between host and GPU. In order to tackle the problem, we develop a GPU-based graph framework called WolfPath to accelerate the processing of iterative traversing-based graph processing algorithms. In WolfPath, the iterative process is guided by the graph diameter to eliminate the frequent data exchange between host and GPU. To accomplish this goal, WolfPath proposes a data structure called Layered Edge list to represent the graph, from which the graph diameter is known before the start of graph processing. In order to enhance the applicability of our WolfPath framework, a graph preprocessing algorithm is also developed in this work to convert any graph into the format of the Layered Edge list. We conducted extensive experiments to verify + .
The increasing scale and wealth of inter-connected data, such as those accrued by social network applications, demand the design of new techniques and platforms to efficiently derive actionable knowledge from large-scale graphs. However, large real-world graphs are famously difficult to process efficiently. Not only they have a large memory footprint, but also most graph algorithms entail memory access patterns with poor locality, data-dependent parallelism and a low compute-to-memory access ratio. To complicate matters further, most real-world graphs have a highly heterogeneous node degree distribution, hence partitioning these graphs for parallel processing and simultaneously achieving access locality and load-balancing is difficult if not impossible.
Computation, 2020
Many modern applications are modeled using graphs of some kind. Given a graph, reachability, that is, discovering whether there is a path between two given nodes, is a fundamental problem as well as one of the most important steps of many other algorithms. The rapid accumulation of very large graphs (up to tens of millions of vertices and edges) from a diversity of disciplines demand efficient and scalable solutions to the reachability problem. General-purpose computing has been successfully used on Graphics Processing Units (GPUs) to parallelize algorithms that present a high degree of regularity. In this paper, we extend the applicability of GPU processing to graph-based manipulation, by re-designing a simple but efficient state-of-the-art graph-labeling method, namely the GRAIL (Graph Reachability Indexing via RAndomized Interval) algorithm, to many-core CUDA-based GPUs. This algorithm firstly generates a label for each vertex of the graph, then it exploits these labels to answer...
Algorithmica, 2002
In this paper, we present deterministic parallel algorithms for the coarse grained multicomputer (CGM) and bulk-synchronous parallel computer (BSP) models which solve the following well known graph problems: (1) list ranking, (2) Euler tour construction, (3) computing the connected components and spanning forest, (4) lowest common ancestor preprocessing, (5) tree contraction and expression tree evaluation, (6) computing an ear decomposition or open ear decomposition, (7) 2-edge connectivity and biconnectivity (testing and component computation), and (8) cordal graph recognition ( nding a perfect elimination ordering). The algorithms for Problems 1-7 require O(log p) communication rounds and linear sequential work per round. Our results for Problems 1 and 2 hold for arbitrary ratios n p , i.e. they are fully scalable, and for Problems 3-8 it is assumed that n p p , > 0, which is true for all commercially available multiprocessors. We view the algorithms presented as an important step towards the nal goal of O(1) communication rounds. Note that, the number of communication rounds obtained in this paper is independent of n and grows only very slowly with respect to p. Hence, for most practical purposes, the number of communication rounds can be considered as constant. The result for Problem 1 is a considerable improvement over those previously reported. The algorithms for Problems 2-7 are the rst practically relevant deterministic parallel algorithms for these problems to be used for commercially available coarse grained parallel machines. ? Research partially supported by the Natural Sciences and Engineering Research Council of Canada, FAPESP (Brasil), CNPq (Brasil), PROTEM-2-TCPAC (Brasil), the Commission of the European Communities (ESPRIT Long Term Research Project 20244, ALCOM-IT), DFG-SFB 376 \Massive Parallelit at" (Germany), and the R egion Rhône-Alpes (France).
Proceedings of the VLDB Endowment
Due to the limited capacity of GPU memory, the majority of prior work on graph applications on GPUs has been restricted to graphs of modest sizes that fit in memory. Recent hardware and software advances make it possible to address much larger host memory transparently as a part of a feature known as unified virtual memory. While accessing host memory over an interconnect is understandably slower, the problem space has not been sufficiently explored in the context of a challenging workload with low computational intensity and an irregular data access pattern such as graph traversal. We analyse the performance of breadth first search (BFS) for several large graphs in the context of unified memory and identify the key factors that contribute to slowdowns. Next, we propose a lightweight offline graph reordering algorithm, HALO (Harmonic Locality Ordering), that can be used as a pre-processing step for static graphs. HALO yields speedups of 1.5x-1.9x over baseline in subsequent traversa...
SIAM Journal on Computing, 1984
In this paper, we present efficient parallel algorithms for the following graph problems: finding the lowest common ancestors for vertex pairs of a directed tree; finding all fundamental cycles, a directed spanning forest, all bridges, all bridge-connected components, all separation vertices, all biconnected components, and testing the biconnectivity of an undirected graph. All these algorithms achieve the O(lg n) time bound, with the first two algorithms using n[n/lg n] processors and the remaining algorithms using n[n/lg n] processors. In all cases, our algorithms are better than the previously known algorithms and in most cases reduce the number of processors used by a factor of n lg n. Moreover, our algorithms are optimal with respect to the time-processor product for dense graphs, with the exception of the first two algorithms. The machine model we use is the PRAM which is a SIMD model allowing simultaneous reads but not simultaneous writes to the same memory location.
2007 Proceedings of the Ninth Workshop on Algorithm Engineering and Experiments (ALENEX), 2007
We present an experimental study of the single source shortest path problem with non-negative edge weights (NSSP) on large-scale graphs using the ∆-stepping parallel algorithm. We report performance results on the Cray MTA-2, a multithreaded parallel computer. The MTA-2 is a high-end shared memory system offering two unique features that aid the efficient parallel implementation of irregular algorithms: the ability to exploit fine-grained parallelism, and low-overhead synchronization primitives. Our implementation exhibits remarkable parallel speedup when compared with competitive sequential algorithms, for low-diameter sparse graphs. For instance, ∆-stepping on a directed scalefree graph of 100 million vertices and 1 billion edges takes less than ten seconds on 40 processors of the MTA-2, with a relative speedup of close to 30. To our knowledge, these are the first performance results of a shortest path problem on realistic graph instances in the order of billions of vertices and edges.
2011 International Conference on Parallel Architectures and Compilation Techniques, 2011
Graphs are a fundamental data representation that have been used extensively in various domains. In graph-based applications, a systematic exploration of the graph such as a breadth-first search (BFS) often serves as a key component in the processing of their massive data sets. In this paper, we present a new method for implementing the parallel BFS algorithm on multi-core CPUs which exploits a fundamental property of randomly shaped real-world graph instances. By utilizing memory bandwidth more efficiently, our method shows improved performance over the current state-of-the-art implementation and increases its advantage as the size of the graph increases. We then propose a hybrid method which, for each level of the BFS algorithm, dynamically chooses the best implementation from: a sequential execution, two different methods of multicore execution, and a GPU execution. Such a hybrid approach provides the best performance for each graph size while avoiding poor worst-case performance on high-diameter graphs. Finally, we study the effects of the underlying architecture on BFS performance by comparing multiple CPU and GPU systems; a high-end GPU system performed as well as a quad-socket highend CPU system.
ACM Transactions on Parallel Computing, 2015
Breadth-First Search (BFS) is a core primitive for graph traversal and a basis for many higher-level graph analysis algorithms. It is also representative of a class of parallel computations whose memory accesses and work distribution are both irregular and data dependent. Recent work has demonstrated the plausibility of GPU sparse graph traversal, but has tended to focus on asymptotically inefficient algorithms that perform poorly on graphs with nontrivial diameter. We present a BFS parallelization focused on fine-grained task management constructed from efficient prefix sum computations that achieves an asymptotically optimal O(|V| + |E|) gd work complexity. Our implementation delivers excellent performance on diverse graphs, achieving traversal rates in excess of 3.3 billion and 8.3 billion traversed edges per second using single- and quad-GPU configurations, respectively. This level of performance is several times faster than state-of-the-art implementations on both CPU and GPU p...
ACM SIGPLAN Notices, 2012
Breadth-first search (BFS) is a core primitive for graph traversal and a basis for many higher-level graph analysis algorithms. It is also representative of a class of parallel computations whose memory accesses and work distribution are both irregular and data-dependent. Recent work has demonstrated the plausibility of GPU sparse graph traversal, but has tended to focus on asymptotically inefficient algorithms that perform poorly on graphs with non-trivial diameter. We present a BFS parallelization focused on fine-grained task management constructed from efficient prefix sum that achieves an asymptotically optimal O (| V |+| E |) work complexity. Our implementation delivers excellent performance on diverse graphs, achieving traversal rates in excess of 3.3 billion and 8.3 billion traversed edges per second using single and quad-GPU configurations, respectively. This level of performance is several times faster than state-of-the-art implementations both CPU and GPU platforms.
—Graph 1 processing has always been a challenge, as there are inherent complexities in it. These include scalability to larger data sets and clusters, dependencies between vertices in the graph, irregular memory accesses during processing and traversals, minimal locality of reference, etc. In literature, there are several implementations for parallel graph processing on single GPU systems but only few for single and multi-node multi-GPU systems. In this paper, the prospects of improvement in large graph traversals by utilizing multi-GPU cluster for Breadth First Search algorithm has been studied. In this regard, a DiGPU, a CUDA-based implementation for graph traversal in shared memory multi-GPU and distributed memory multi-GPU systems has been proposed. In this work, an open source software module has also been developed and verified through set of experiments. Further, evaluations have been demonstrated on local cluster as well as on CDER cluster. Finally, experimental analysis has been performed on several graph data sets using different system configurations to study the impact of load distribution with respect to GPU specification on performance of our implementation.
2017 IEEE International Conference on Cluster Computing (CLUSTER), 2017
The rapidly growing number of large network analysis problems has led to the emergence of many parallel and distributed graph processing systems-one survey in 2014 identified over 80. Since then, the landscape has evolved; some packages have become inactive while more are being developed. Determining the best approach for a given problem is infeasible for most developers. To enable easy, rigorous, and repeatable comparison of the capabilities of such systems, we present an approach and associated software for analyzing the performance and scalability of parallel, open-source graph libraries. We demonstrate our approach on five graph processing packages: Graph-Mat, the Graph500, the Graph Algorithm Platform Benchmark Suite, GraphBIG, and PowerGraph using synthetic and real-world datasets. We examine previously overlooked aspects of parallel graph processing performance such as phases of execution and energy usage for three algorithms: breadth first search, single source shortest paths, and PageRank and compare our results to Graphalytics.
Parallel Processing …, 2007
Graph algorithms are becoming increasingly important for solving many problems in scientific computing, data mining and other domains. As these problems grow in scale, parallel computing resources are required to meet their computational and memory requirements. Unfortunately, the algorithms, software, and hardware that have worked well for developing mainstream parallel scientific applications are not necessarily effective for large-scale graph problems. In this paper we present the interrelationships between graph problems, software, and parallel hardware in the current state of the art and discuss how those issues present inherent challenges in solving large-scale graph problems. The range of these challenges suggests a research agenda for the development of scalable high-performance software for graph problems.
2020
When working on graphs, reachability is among the most common problems to address, since it is the base for many other algorithms. As with the advent of distributed systems, which process large amounts of data, many applications must quickly explore graphs with millions of vertices, scalable solutions have become of paramount importance. Modern GPUs provide highly parallel systems based on many-core architectures and have gained popularity in parallelizing algorithms that run on large data sets. In this paper, we extend a very efficient state-of-the-art graph-labeling method, namely the GRAIL algorithm, to architectures which exhibit a great amount of data parallelism, i.e., many-core CUDA-based GPUs. GRAIL creates a scalable index for answering reachability queries, and it heavily relies on depth-first searches. As depth-first visits are intrinsically recursive and they cannot be efficiently implemented on parallel systems, we devise an alternative approach based on a sequence of breadth-first visits. The paper explores our efforts in this direction, and it analyzes the difficulties encountered and the solutions chosen to overcome them. It also presents a comparison (in terms of times to create the index and to use it for reachability queries) between the CPU and the GPUbased versions.
2016
GraphBLAS is an emerging paradigm for graph computation that makes it easy to program new graph algorithms in a highly abstract language of linear algebra. The promise of GraphBLAS is that an abstract graph program will execute in a wide variety of programming environments, ranging from embedded environments to distributed memory computers. In this paper we present our initial implementation of GraphBLAS primitives for graphics processing unit (GPU) systems called GraphBLAS Template Library (GBTL). Our implementation is an ongoing effort in the context of GraphBLAS standardization efforts by a diverse group of academics and representatives of the industry. Our implementation consists of a high-level C ++ frontend, and the GPU functionality is implemented with a combination of the CUSP library for sparse-matrix computation on GPU and the NVIDIA Thrust framework for abstract GPU programs. We give initial performance results of our implementations, and we discuss solutions to the problems we encountered when providing a low-level implementation for a high-level generic interface.
2007 IEEE International Parallel and Distributed Processing Symposium, 2007
We present a study of multithreaded implementations of Thorup's algorithm for solving the Single Source Shortest Path (SSSP) problem for undirected graphs. Our implementations leverage the fledgling MultiThreaded Graph Library (MTGL) to perform operations such as finding connected components and extracting induced subgraphs. To achieve good parallel performance from this algorithm, we deviate from several theoretically optimal algorithmic steps. In this paper, we present simplifications that perform better in practice, and we describe details of the multithreaded implementation that were necessary for scalability. We study synthetic graphs that model unstructured networks, such as social networks and economic transaction networks. Most of the recent progress in shortest path algorithms relies on structure that these networks do not have. In this work, we take a step back and explore the synergy between an elegant theoretical algorithm and an elegant computer architecture. Finally, we conclude with a prediction that this work will become relevant to shortest path computation on structured networks.
Loading Preview
Sorry, preview is currently unavailable. You can download the paper by clicking the button above.