Academia.edu no longer supports Internet Explorer.
To browse Academia.edu and the wider internet faster and more securely, please take a few seconds to upgrade your browser.
…
10 pages
1 file
Studying properties of graphs is essential to various applications, and recent growth of online social networks has spurred interests in analyzing their structures using Graphical Processing Units (GPUs). Utilizing the faster available shared memory on GPUs have provided tremendous speed-up for solving many general-purpose problems. However, when data required for processing is large and needs to be stored in the global memory instead of the shared memory, simultaneous memory accesses by threads in execution becomes the bottleneck for achieving higher throughput. In this paper, for storing large graphs, we propose and evaluate techniques to efficiently utilize the different levels of the memory hierarchy of GPUs, with the focus being on the larger global memory. Given a graph G = (V , E), we provide an algorithm to count the number of triangles in G, while storing the adjacency information on the global memory. Our computation techniques and data structure for retrieving the adjacency information is derived from processing the breadth-first-search tree of the input graph. Also, techniques to generate combinations of nodes for testing the properties of graphs induced by the same are discussed in detail. Our methods can be extended to solve other combinatorial counting problems on graphs, such as finding the number of connected subgraphs of size k, number of cliques (resp. independent sets) of size k, and related problems for large data sets. In the context of the triangle counting algorithm, we analyze and utilize primitives such as memory access coalescing and avoiding partition camping that offset the increase in access latency of using a slower but larger global memory. Our experimental results for the GPU implementation show at least 10 times speedup for triangle counting over the CPU counterpart. Another 6-8 % increase in performance is obtained by utilizing the above mentioned primitives as compared to the naïve implementation of the program on the GPU.
The availability and utility of large numbers of Graphical Processing Units (GPUs) have enabled parallel computations using extensive multi-threading. Sequential access to global memory and contention at the size-limited shared memory have been main impediments to fully exploiting potential performance in architectures having a massive number of GPUs. After performing extensive study of data structures and complexity analysis of various data access methodologies, we propose novel memory storage and retrieval techniques that enable parallel graph computations to overcome the above issues. More specifically, given a graph G = (V, E) and an integer k <= |V |, we provide both storage techniques and algorithms to count the number of: a) connected subgraphs of size k; b) k cliques; and c) k independent sets, all of which can be exponential in number. Our storage techniques are based on creating a breadth-first search tree and storing it along with non-tree edges in a novel way. Our experiments solve the above mentioned problems by using both naïve and advanced data structures on the CPU and GPU. Speedup is achieved by solving the problems on the GPU even using a brute-force approach as compared to the implementations on the CPU. Utilizing the knowledge of BFS-tree properties, the performance gain on the GPU increases and ultimately outperforms the CPU by a factor of at least 5 for graphs that completely fit in the shared memory and by a factor of 10 for larger graphs stored using the global memory. The counting problems mentioned above have many uses, including the analysis of social networks.
Triangle counting in a graph is a building block for clustering coefficients which is a widely used social network analytic for finding key players in a network based on their local connectivity. In this paper we show the first scalable GPU implementation for triangle counting. Our approach uses a new list intersection algorithm called Intersect Path (named after the Merge Path algorithm). This algorithm has two levels of parallelism. The first level partitions the vertices to the streaming multiprocessors on the GPU. The second level is responsible for parallelizing the work across the GPU's streaming processors and utilizing different block sizes. For testing purposes, we used graphs taken from the DIMACS 10 Graph Challenge. Our experiments were conducted on NVIDIA's K40 GPU. Our GPU triangle counting implementation achieves speedups in the range of 9X − 32X over a CPU sequential implementation.
2012 IEEE 26th International Parallel and Distributed Processing Symposium Workshops & PhD Forum, 2012
The availability and utility of large numbers of Graphical Processing Units (GPUs) have enabled parallel computations using extensive multi-threading. Sequential access to global memory and contention at the size-limited shared memory have been main impediments to fully exploiting potential performance in architectures having a massive number of GPUs. We propose novel memory storage and retrieval techniques that enable parallel graph computations to overcome the above issues. More specifically, given a graph G = (V, E) and an integer k <= |V |, we provide both storage techniques and algorithms to count the number of: a) connected subgraphs of size k; b) k cliques; and c) k independent sets, all of which can be exponential in number. Our storage technique is based on creating a breadth-first search tree and storing it along with non-tree edges in a novel way. The counting problems mentioned above have many uses, including the analysis of social networks.
ACM SIGGRAPH 2005 Courses on - SIGGRAPH '05, 2005
We present new algorithms on commodity graphics processors for performing fast computation of several common database operations. Specifically, we consider operations such as conjunctive selections, aggregations, and semi-linear queries, which are essential computational components of typical database, data warehousing, and data mining applications. While graphics processing units (GPUs) have been designed for fast display of geometric primitives, we utilize the inherent pipelining and parallelism, single instruction and multiple data (SIMD) capabilities, and vector processing functionality of GPUs, for evaluating boolean predicate combinations and semi-linear queries on attributes and executing database operations efficiently. Our algorithms take into account some of the limitations of the programming model of current GPUs and perform no data rearrangements. Our algorithms have been implemented on a programmable GPU (e.g. NVIDIA's GeForce FX 5900) and applied to databases consisting of up to a million records. We have compared their performance with an optimized implementation of CPU-based algorithms. Our experiments indicate that the graphics processor available on commodity computer systems is an effective co-processor for performing database operations. While CPUs are used for general purpose computation, GPUs have been primarily designed for transforming, rendering, and texturing geometric primitives, such as triangles. The driving application of GPUs has been fast rendering for visual simulation, virtual reality, and computer gaming. GPUs are increasingly being used as co-processors to CPUs. GPUs are extremely fast and are capable of processing tens of millions of geometric primitives per second. The peak performance of GPUs has been increasing at the rate of 2.5 − 3.0 times a year, much faster than the Moore's law for CPUs. At this rate, the GPU's peak performance may move into the teraflop range by 2005 [19]. Most of this performance arises from multiple processing units and stream processing. The GPU treats the vertices and pixels constituting graphics primitives as streams. Multiple vertex and pixel processing engines on a GPU are connected via data flows. These processing engines perform simple operations in parallel.
Proceedings of the ACM Workshop on High Performance Graph Processing, 2016
We implement exact triangle counting in graphs on the GPU using three different methodologies: subgraph matching to a triangle pattern; programmable graph analytics, with a set-intersection approach; and a matrix formulation based on sparse matrix-matrix multiplies. All three deliver bestof-class performance over CPU implementations and over comparable GPU implementations, with the graph-analytic approach achieving the best performance due to its ability to exploit efficient filtering steps to remove unnecessary work and its high-performance set-intersection core.
2009
The success of the gaming industry is now pushing processor technology like we have never seen before. Since recent graphics processors (GPU’s) have been improving both their programmability as well as have been adding more and more floating point processing, it makes them very appealing as accelerators for generalpurpose computing. This minisymposium gives an overview of some of these advancements by bringing together experts working on the development of techniques and tools that improve the programmability of GPU’s as well as the experts interested in utilizing the computational power of GPU’ scientific applications. This first EuroGPU Minisymposium brought together severl experts working on the development of techniques and tools that improve the programmability of GPU’s as well as the experts interested in utilizing the computational power of GPU’s for scientific applications. This short summary thus gives a very useful, but quick overview of some of the major recent advancemen...
Pollack Periodica, 2008
The evolution of GPUs (graphics processing units) has been enormous in the past few years. Their calculation power has improved exponentially, while the range of the tasks computable on GPUs has got significantly wider. The milestone of GPU development of the recent years is the appearance of the unified architecture-based devices. These GPUs implement a massively parallel design, which led them be capable not only of processing the common computer graphics tasks, but qualifies them for performing highly parallel mathematical algorithms effectively. Recognizing this availability GPU providers have issued developer platforms, which let the programmers manage computations on the GPU as a data-parallel computing device without the need of mapping them to a graphics API. Researchers salute this initiative, and the application of the new technology is quickly spreading in various branches of science
Journal of Parallel and Distributed Computing, 2008
We report on our experience with integrating and using graphics processing units (GPUs) as fast parallel floating-point co-processors to accelerate two fundamental computational scientific kernels on the GPU: sparse direct factorization and nonlinear interior-point optimization. Since a full re-implementation of these complex kernels is typically not feasible, we identify the matrix-matrix multiplication as a first natural entry-point for a minimally invasive integration of GPUs. We investigate the performance on the NVIDIA GeForce 8800 multicore chip initially architectured for intensive gaming applications. We exploit the architectural features of the GeForce 8800 GPU to design an efficient GPU-parallel sparse matrix solver. A prototype approach to leverage the bandwidth and computing power of GPUs for these matrix kernel operation is demonstrated resulting in an overall performance of over 110 GFlops/s on the desktop for large matrices and over 38 GFlops/s for sparse matrices arising in real applications. We use our GPU algorithm for PDE-constrained optimization problems and demonstrate that the commodity GPU is a useful co-processor for scientific applications.
Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data, 2020
Subgraph enumeration is important for many applications such as network motif discovery and community detection. Recent works utilize graphics processing units (GPUs) to parallelize subgraph enumeration, but they can only handle graphs that fit into the GPU memory. In this paper, we propose a new approach for GPU-accelerated subgraph enumeration that can efficiently scale to large graphs beyond the GPU memory. Our approach divides the graph into partitions, each of which fits into the GPU memory. The GPU processes one partition at a time and searches the matched subgraphs of a given pattern (i.e., instances) within the partition as in the small graph. The key challenge is on enumerating the instances across different partitions, because this search would enumerate considerably redundant subgraphs and cause the expensive data transfer cost via the PCI-e bus. Therefore, we propose a novel shared execution approach to eliminate the redundant subgraph searches and correctly generate all the instances across different partitions. The experimental evaluation shows that our approach can scale to large graphs and achieve significantly better performance than the existing single-machine solutions.
2015 IEEE International Symposium on Workload Characterization, 2015
We identify several factors that are critical to high-performance GPU graph analytics: efficient building block operators, synchronization and data movement, workload distribution and load balancing, and memory access patterns. We analyze the impact of these critical factors through three GPU graph analytic frameworks, Gunrock, MapGraph, and VertexAPI2. We also examine their effect on different workloads: four common graph primitives from multiple graph application domains, evaluated through real-world and synthetic graphs. We show that efficient building block operators enable more powerful operations for fast information propagation and result in fewer device kernel invocations, less data movement, and fewer global synchronizations, and thus are key focus areas for efficient largescale graph analytics on the GPU.
Loading Preview
Sorry, preview is currently unavailable. You can download the paper by clicking the button above.
Journal of Structural Biology, 2008
2022 IEEE 34th International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD)
Proceedings of the 36th ACM International Conference on Supercomputing
Bulletin of Electrical Engineering and Informatics, 2021
Journal of Computer Science and Technology, 2012
Proceedings of the 21st spring conference on Computer graphics - SCCG '05, 2005