no longer supports Internet Explorer.
To browse and the wider internet faster and more securely, please take a few seconds to upgrade your browser.
2020, 2020 Proceedings of the SIAM Workshop on Combinatorial Scientific Computing
SuiteSparse:GraphBLAS is a complete implementation of the GraphBLAS standard. It provides a powerful and expressive framework for creating graph algorithms based on the elegant mathematics of sparse matrix operations on a semiring. Algorithms written with the GraphBLAS achieve high performance with minimal development time. Multithreaded parallelism through OpenMP provides additional speedup, which we illustrate on a 20-core Intel ® Xeon ® E5-2698 CPU system when solving various problems (triangle counting, k-truss, breadth-first search, Bellman-Ford, local clustering coefficient, and a sparse deep neural network problem). This wide variety of algorithms illustrates the expressiveness of the GraphBLAS API to create new graph algorithms. We present performance results with these algorithms on a set of large real-world graphs, using the newly developed Suite-Sparse:GraphBLAS v3.0.1.
Field-Programmable Custom Computing Machines, 2006
Many important applications are organized around long-lived, irregular sparse graphs (e.g., data and knowledge bases, CAD optimization, numerical problems, simulations). The graph structures are large, and the applications need regular access to a large, data-dependent portion of the graph for each operation (e.g., the algorithm may need to walk the graph, visiting all nodes, or propagate changes through many nodes in the graph). On conventional microprocessors, the graph structures exceed on-chip cache capacities, making main-memory bandwidth and latency the key performance limiters. To avoid this "memory wall," we introduce a concurrent system architecture for sparse graph algorithms that places graph nodes in small distributed memories paired with specialized graph processing nodes interconnected by a lightweight network. This gives us a scalable way to map these applications so that they can exploit the high-bandwidth and low-latency capabilities of embedded memories (e.g., FPGA Block RAMs). On typical spreadingactivation queries on the ConceptNet Knowledge Base, a sample application, this translates into an order of magnitude speedup per FPGA compared to a state-of-the-art Pentium processor.
ACM Transactions on Mathematical Software
High-performance implementations of graph algorithms are challenging to implement on new parallel hardware such as GPUs because of three challenges: (1) the difficulty of coming up with graph building blocks, (2) load imbalance on parallel hardware, and (3) graph problems having low arithmetic intensity. To address some of these challenges, GraphBLAS is an innovative, on-going effort by the graph analytics community to propose building blocks based on sparse linear algebra, which allow graph algorithms to be expressed in a performant, succinct, composable, and portable manner. In this paper, we examine the performance challenges of a linear-algebra-based approach to building graph frameworks and describe new design principles for overcoming these bottlenecks. Among the new design principles is exploiting input sparsity , which allows users to write graph algorithms without specifying push and pull direction. Exploiting output sparsity allows users to tell the backend which values of...
Given the growing importance of large-scale graph analytics, there is a need to improve the performance of graph analysis frameworks without compromising on productivity. GraphMat is our solution to bridge this gap between a user-friendly graph analytics framework and native, hand-optimized code. GraphMat functions by taking vertex programs and mapping them to high performance sparse matrix operations in the backend. We get the productivity benefits of a vertex programming framework without sacrificing performance. GraphMat is in C++, and we have been able to write a diverse set of graph algorithms in this framework with the same effort compared to other vertex programming frameworks. GraphMat performs 1.2-7X faster than high performance frameworks such as GraphLab, CombBLAS and Galois. It achieves better multicore scalability (13-15X on 24 cores) than other frameworks and is 1.2X off native, hand-optimized code on a variety of different graph algorithms. Since GraphMat performance ...
Graph problems are significantly harder to solve with large graphs residing on disk compared to main memory only. In this work, we study how to solve four important graph problems: reachability from a source vertex, single source shortest path, weakly connected components, and PageRank. It is well known that the aforementioned algorithms can be expressed as an iteration of matrix-vector multiplications under different semi-rings. Based on this mathematical foundation, we show how to express the computation with standard relational queries and then we study how to efficiently evaluate them in parallel in a sharednothing architecture. We identify a common algorithmic pattern that unifies the four graph algorithms, considering a common mathematical foundation based on sparse matrix-vector multiplication. The net gain is that our SQL-based approach enables solving ”big data” graph problems on parallel database systems, debunking common wisdom that they are cumbersome and slow. Using lar...
The focus of traditional scientific computing has been in solving large systems of PDEs (and the corresponding linear algebra problems that they induce). Hardware architectures, computer systems, and software platforms have evolved together to efficiently support solving these kinds of problems. Similar attention has not been devoted to solving large-scale graph problems. Recently this class of applications has seen increased attention. The irregular, nonlocal, and dynamic characteristics of these problems require new programming techniques to adapt them to modern HPC systems offering multiple levels of parallelism. We describe a library for implementing graph algorithms based on asynchronous execution of fine-grained, concurrent operations. Prototype implementations of two graph kernels which combine lightweight graph metadata transactions with generalized active messages demonstrate that it is possible to implement graph applications which efficiently leverage both shared-and distributed-memory parallelism.
Abstract In many application domains, data are represented using large graphs involving millions of vertices and edges. Graph analysis algorithms, such as finding short paths and isomorphic subgraphs, are largely dominated by memory latency. Large cluster-based computing platforms can process graphs efficiently if the graph data can be partitioned, and on a smaller scale partitioning can be used to allocate graphs to low-latency on-chip RAMs in reconfigurable devices.
GraphBLAS is an emerging paradigm for graph computation that makes it easy to program new graph algorithms in a highly abstract language of linear algebra. The promise of GraphBLAS is that an abstract graph program will execute in a wide variety of programming environments, ranging from embedded environments to distributed memory computers. In this paper we present our initial implementation of GraphBLAS primitives for graphics processing unit (GPU) systems called GraphBLAS Template Library (GBTL). Our implementation is an ongoing effort in the context of GraphBLAS standardization efforts by a diverse group of academics and representatives of the industry. Our implementation consists of a high-level C ++ frontend, and the GPU functionality is implemented with a combination of the CUSP library for sparse-matrix computation on GPU and the NVIDIA Thrust framework for abstract GPU programs. We give initial performance results of our implementations, and we discuss solutions to the problems we encountered when providing a low-level implementation for a high-level generic interface.
Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis on - SC '13, 2013
Scalable parallel computing is essential for processing large scale-free (power-law) graphs. The distribution of data across processes becomes important on distributed-memory computers with thousands of cores. It has been shown that twodimensional layouts (edge partitioning) can have significant advantages over traditional one-dimensional layouts. However, simple 2D block distribution does not use the structure of the graph, and more advanced 2D partitioning methods are too expensive for large graphs. We propose a new two-dimensional partitioning algorithm that combines graph partitioning with 2D block distribution. The computational cost of the algorithm is essentially the same as 1D graph partitioning. We study the performance of sparse matrix-vector multiplication (SpMV) for scale-free graphs from the web and social networks using several different partitioners and both 1D and 2D data layouts. We show that SpMV run time is reduced by exploiting the graph's structure. Contrary to popular belief, we observe that current graph and hypergraph partitioners often yield relatively good partitions on scale-free graphs. We demonstrate that our new 2D partitioning method consistently outperforms the other methods considered, for both SpMV and an eigensolver, on matrices with up to 1.6 billion nonzeros using up to 16,384 cores.
arXiv (Cornell University), 2016
Sparse matrix multiplication is traditionally performed in memory and scales to large matrices using the distributed memory of multiple nodes. In contrast, we scale sparse matrix multiplication beyond memory capacity by implementing sparse matrix dense matrix multiplication (SpMM) in a semi-external memory (SEM) fashion; i.e., we keep the sparse matrix on commodity SSDs and dense matrices in memory. Our SEM-SpMM incorporates many in-memory optimizations for large power-law graphs. It outperforms the in-memory implementations of Trilinos and Intel MKL and scales to billion-node graphs, far beyond the limitations of memory. Furthermore, on a single large parallel machine, our SEM-SpMM operates as fast as the distributed implementations of Trilinos using five times as much processing power. We also run our implementation in memory (IM-SpMM) to quantify the overhead of keeping data on SSDs. SEM-SpMM achieves almost 100% performance of IM-SpMM on graphs when the dense matrix has more than four columns; it achieves at least 65% performance of IM-SpMM on all inputs. We apply our SpMM to three important data analysis tasks-PageRank, eigensolving, and non-negative matrix factorization-and show that our SEM implementations significantly advance the state of the art.
Finding the number of triangles in a network is an important problem in the analysis of complex networks. The number of triangles also has important applications in data mining. Existing distributed memory parallel algorithms for counting triangles are either Map-Reduce based or message passing interface (MPI) based and work with overlapping partitions of the given network. These algorithms are designed for very sparse networks and do not work well when the degrees of the nodes are relatively larger. For networks with larger degrees, Map-Reduce based algorithm generates prohibitively large intermediate data, and in MPI based algorithms with overlapping partitions, each partition can grow as large as the original network, wiping out the benefit of partitioning the network. In this paper, we present two efficient MPI-based parallel algorithms for counting triangles in massive networks with large degrees. The first algorithm is a space-efficient algorithm for networks that do not fit in the main memory of a single compute node. This algorithm divides the network into non-overlapping partitions. The second algorithm is for the case where the main memory of each node is large enough to contain the entire network. We observe that for such a case, computation load can be balanced dynamically and present a dynamic load balancing scheme which improves the performance significantly. Both of our algorithms scale well to large networks and to a large number of processors.
IEEE Transactions on Parallel and Distributed Systems, 2017
Sparse matrix multiplication is traditionally performed in memory and scales to large matrices using the distributed memory of multiple nodes. In contrast, we scale sparse matrix multiplication beyond memory capacity by implementing sparse matrix dense matrix multiplication (SpMM) in a semi-external memory (SEM) fashion; i.e., we keep the sparse matrix on commodity SSDs and dense matrices in memory. Our SEM-SpMM incorporates many in-memory optimizations for large power-law graphs. It outperforms the in-memory implementations of Trilinos and Intel MKL and scales to billion-node graphs, far beyond the limitations of memory. Furthermore, on a single large parallel machine, our SEM-SpMM operates as fast as the distributed implementations of Trilinos using five times as much processing power. We also run our implementation in memory (IM-SpMM) to quantify the overhead of keeping data on SSDs. SEM-SpMM achieves almost 100% performance of IM-SpMM on graphs when the dense matrix has more than four columns; it achieves at least 65% performance of IM-SpMM on all inputs. We apply our SpMM to three important data analysis tasks-PageRank, eigensolving, and non-negative matrix factorization-and show that our SEM implementations significantly advance the state of the art.
The Graphics Processing Units (GPUs) provide high computation power at a low cost and is an important compute accelerator with a massively multithreaded architecture. In this paper, we present fast implementations of common graph operations like breadth-first search, st-connectivity, single-source shortest path, all-pairs shortest path, minimum spanning tree, and maximum flow for undirected graphs on the GPU using the CUDA programming model. Our implementations exhibit high performance, especially on large graphs. We use two data-parallel programming methodologies for these algorithms. One is an iterative, mask-based approach that processes valid data elements like vertices and edges using independent threads for each. The other is a divide-and-conquer approach that reduces the problem into smaller problems that are handled later using the same approach. Parallel algorithms for such problems have been reported in the literature before, especially on supercomputers. The massively mul...
Procedia Computer Science, 2015
The analysis of graphs has become increasingly important to a wide range of applications. Graph analysis presents a number of unique challenges in the areas of (1) software complexity, (2) data complexity, (3) security, (4) mathematical complexity, (5) theoretical analysis, (6) serial performance, and (7) parallel performance. Implementing graph algorithms using matrix-based approaches provides a number of promising solutions to these challenges. The GraphBLAS standard ( is being developed to bring the potential of matrix based graph algorithms to the broadest possible audience. The GraphBLAS mathematically defines a core set of matrix-based graph operations that can be used to implement a wide class of graph algorithms in a wide range of programming environments. This paper provides an introduction to the GraphBLAS and describes how the GraphBLAS can be used to address many of the challenges associated with analysis of graphs.
The availability and utility of large numbers of Graphical Processing Units (GPUs) have enabled parallel computations using extensive multi-threading. Sequential access to global memory and contention at the size-limited shared memory have been main impediments to fully exploiting potential performance in architectures having a massive number of GPUs. After performing extensive study of data structures and complexity analysis of various data access methodologies, we propose novel memory storage and retrieval techniques that enable parallel graph computations to overcome the above issues. More specifically, given a graph G = (V, E) and an integer k <= |V |, we provide both storage techniques and algorithms to count the number of: a) connected subgraphs of size k; b) k cliques; and c) k independent sets, all of which can be exponential in number. Our storage techniques are based on creating a breadth-first search tree and storing it along with non-tree edges in a novel way. Our experiments solve the above mentioned problems by using both naïve and advanced data structures on the CPU and GPU. Speedup is achieved by solving the problems on the GPU even using a brute-force approach as compared to the implementations on the CPU. Utilizing the knowledge of BFS-tree properties, the performance gain on the GPU increases and ultimately outperforms the CPU by a factor of at least 5 for graphs that completely fit in the shared memory and by a factor of 10 for larger graphs stored using the global memory. The counting problems mentioned above have many uses, including the analysis of social networks.
SIAM Journal on Scientific Computing, 2012
Generalized sparse matrix-matrix multiplication (or SpGEMM) is a key primitive for many high performance graph algorithms as well as for some linear solvers, such as algebraic multigrid. Here we show that SpGEMM also yields efficient algorithms for general sparse-matrix indexing in distributed memory, provided that the underlying SpGEMM implementation is sufficiently flexible and scalable. We demonstrate that our parallel SpGEMM methods, which use two-dimensional block data distributions with serial hypersparse kernels, are indeed highly flexible, scalable, and memoryefficient in the general case. This algorithm is the first to yield increasing speedup on an unbounded number of processors; our experiments show scaling up to thousands of processors in a variety of test scenarios.
ACM Transactions on Knowledge Discovery from Data, 2020
Big graphs (networks) arising in numerous application areas pose significant challengesfor graph analysts as these graphs grow to billions of nodes and edges and are prohibitively large to fit in the main memory. Finding the number of triangles in a graph is an important problem in the mining and analysis of graphs. In this article, we present two efficient MPI-based distributed memory parallel algorithms for counting triangles in big graphs. The first algorithm employs overlapping partitioning and efficient load balancing schemes to provide a very fast parallel algorithm. The algorithm scales well to networks with billions of nodes and can compute the exact number of triangles in a network with 10 billion edges in 16 minutes. The second algorithm divides the network into non-overlapping partitions leading to a space-efficient algorithm. Our results on both artificial and real-world networks demonstrate a significant space saving with this algorithm. We also present a novel approach...
ArXiv, 2017
Big graphs (networks) arising in numerous application areas pose significant challenges for graph analysts as these graphs grow to billions of nodes and edges and are prohibitively large to fit in the main memory. Finding the number of triangles in a graph is an important problem in the mining and analysis of graphs. In this paper, we present two efficient MPI-based distributed memory parallel algorithms for counting triangles in big graphs. The first algorithm employs overlapping partitioning and efficient load balancing schemes to provide a very fast parallel algorithm. The algorithm scales well to networks with billions of nodes and can compute the exact number of triangles in a network with 10 billion edges in 16 minutes. The second algorithm divides the network into non-overlapping partitions leading to a space-efficient algorithm. Our results on both artificial and real-world networks demonstrate a significant space saving with this algorithm. We also present a novel approach ...
2017 IEEE International Conference on Cluster Computing (CLUSTER), 2017
The rapidly growing number of large network analysis problems has led to the emergence of many parallel and distributed graph processing systems-one survey in 2014 identified over 80. Since then, the landscape has evolved; some packages have become inactive while more are being developed. Determining the best approach for a given problem is infeasible for most developers. To enable easy, rigorous, and repeatable comparison of the capabilities of such systems, we present an approach and associated software for analyzing the performance and scalability of parallel, open-source graph libraries. We demonstrate our approach on five graph processing packages: Graph-Mat, the Graph500, the Graph Algorithm Platform Benchmark Suite, GraphBIG, and PowerGraph using synthetic and real-world datasets. We examine previously overlooked aspects of parallel graph processing performance such as phases of execution and energy usage for three algorithms: breadth first search, single source shortest paths, and PageRank and compare our results to Graphalytics.
2012 IEEE 26th International Parallel and Distributed Processing Symposium Workshops & PhD Forum, 2012
The availability and utility of large numbers of Graphical Processing Units (GPUs) have enabled parallel computations using extensive multi-threading. Sequential access to global memory and contention at the size-limited shared memory have been main impediments to fully exploiting potential performance in architectures having a massive number of GPUs. We propose novel memory storage and retrieval techniques that enable parallel graph computations to overcome the above issues. More specifically, given a graph G = (V, E) and an integer k <= |V |, we provide both storage techniques and algorithms to count the number of: a) connected subgraphs of size k; b) k cliques; and c) k independent sets, all of which can be exponential in number. Our storage technique is based on creating a breadth-first search tree and storing it along with non-tree edges in a novel way. The counting problems mentioned above have many uses, including the analysis of social networks.
The increasing scale and wealth of inter-connected data, such as those accrued by social network applications, demand the design of new techniques and platforms to efficiently derive actionable knowledge from large-scale graphs. However, large real-world graphs are famously difficult to process efficiently. Not only they have a large memory footprint, but also most graph algorithms entail memory access patterns with poor locality, data-dependent parallelism and a low compute-to-memory access ratio. To complicate matters further, most real-world graphs have a highly heterogeneous node degree distribution, hence partitioning these graphs for parallel processing and simultaneously achieving access locality and load-balancing is difficult if not impossible.
Loading Preview
Sorry, preview is currently unavailable. You can download the paper by clicking the button above.