Academia.edu no longer supports Internet Explorer.
To browse Academia.edu and the wider internet faster and more securely, please take a few seconds to upgrade your browser.
2020, Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data
Subgraph enumeration is important for many applications such as network motif discovery and community detection. Recent works utilize graphics processing units (GPUs) to parallelize subgraph enumeration, but they can only handle graphs that fit into the GPU memory. In this paper, we propose a new approach for GPU-accelerated subgraph enumeration that can efficiently scale to large graphs beyond the GPU memory. Our approach divides the graph into partitions, each of which fits into the GPU memory. The GPU processes one partition at a time and searches the matched subgraphs of a given pattern (i.e., instances) within the partition as in the small graph. The key challenge is on enumerating the instances across different partitions, because this search would enumerate considerably redundant subgraphs and cause the expensive data transfer cost via the PCI-e bus. Therefore, we propose a novel shared execution approach to eliminate the redundant subgraph searches and correctly generate all the instances across different partitions. The experimental evaluation shows that our approach can scale to large graphs and achieve significantly better performance than the existing single-machine solutions.
2022 IEEE 34th International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD)
Graph Pattern Mining (GPM) is an important, rapidly evolving, and computation demanding area. GPM computation relies on subgraph enumeration, which consists in extracting subgraphs that match a given property from an input graph. Graphics Processing Units (GPUs) have been an effective platform to accelerate applications in many areas. However, the irregularity of subgraph enumeration makes it challenging for efficient execution on GPU due to typical uncoalesced memory access, divergence, and load imbalance. Unfortunately, these aspects have not been fully addressed in previous work. Thus, this work proposes novel strategies to design and implement subgraph enumeration efficiently on GPU. We support a depthfirst search style search (DFS-wide) that maximizes memory performance while providing enough parallelism to be exploited by the GPU, along with a warp-centric design that minimizes execution divergence and improves utilization of the computing capabilities. We also propose a low-cost load balancing layer to avoid idleness and redistribute work among thread warps in a GPU. Our strategies have been deployed in a system named DuMato, which provides a simple programming interface to allow efficient implementation of GPM algorithms. Our evaluation has shown that DuMato is often an order of magnitude faster than state-of-the-art GPM systems and can mine larger subgraphs (up to 12 vertices).
2022
Estimating the frequency of sub-graphs is of importance for many tasks, including sub-graph isomorphism, kernel-based anomaly detection, and network structure analysis. While multiple algorithms were proposed for full enumeration or sampling-based estimates, these methods fail in very large graphs. Recent advances in parallelization allow for estimates of total sub-graphs counts in very large graphs. The task of counting the frequency of each sub-graph associated with each vertex also received excellent solutions for undirected graphs. However, there is currently no good solution for very large directed graphs. We here propose VDMC (Vertex specific Distributed Motif Counting) -- a fully distributed algorithm to optimally count all the 3 and 4 vertices connected directed graphs (sub-graph motifs) associated with each vertex of a graph. VDMC counts each motif only once and its efficacy is linear in the number of counted motifs. It is fully parallelized to be efficient in GPU-based com...
The availability and utility of large numbers of Graphical Processing Units (GPUs) have enabled parallel computations using extensive multi-threading. Sequential access to global memory and contention at the size-limited shared memory have been main impediments to fully exploiting potential performance in architectures having a massive number of GPUs. After performing extensive study of data structures and complexity analysis of various data access methodologies, we propose novel memory storage and retrieval techniques that enable parallel graph computations to overcome the above issues. More specifically, given a graph G = (V, E) and an integer k <= |V |, we provide both storage techniques and algorithms to count the number of: a) connected subgraphs of size k; b) k cliques; and c) k independent sets, all of which can be exponential in number. Our storage techniques are based on creating a breadth-first search tree and storing it along with non-tree edges in a novel way. Our experiments solve the above mentioned problems by using both naïve and advanced data structures on the CPU and GPU. Speedup is achieved by solving the problems on the GPU even using a brute-force approach as compared to the implementations on the CPU. Utilizing the knowledge of BFS-tree properties, the performance gain on the GPU increases and ultimately outperforms the CPU by a factor of at least 5 for graphs that completely fit in the shared memory and by a factor of 10 for larger graphs stored using the global memory. The counting problems mentioned above have many uses, including the analysis of social networks.
2012 IEEE 26th International Parallel and Distributed Processing Symposium Workshops & PhD Forum, 2012
The availability and utility of large numbers of Graphical Processing Units (GPUs) have enabled parallel computations using extensive multi-threading. Sequential access to global memory and contention at the size-limited shared memory have been main impediments to fully exploiting potential performance in architectures having a massive number of GPUs. We propose novel memory storage and retrieval techniques that enable parallel graph computations to overcome the above issues. More specifically, given a graph G = (V, E) and an integer k <= |V |, we provide both storage techniques and algorithms to count the number of: a) connected subgraphs of size k; b) k cliques; and c) k independent sets, all of which can be exponential in number. Our storage technique is based on creating a breadth-first search tree and storing it along with non-tree edges in a novel way. The counting problems mentioned above have many uses, including the analysis of social networks.
Studying properties of graphs is essential to various applications, and recent growth of online social networks has spurred interests in analyzing their structures using Graphical Processing Units (GPUs). Utilizing the faster available shared memory on GPUs have provided tremendous speed-up for solving many general-purpose problems. However, when data required for processing is large and needs to be stored in the global memory instead of the shared memory, simultaneous memory accesses by threads in execution becomes the bottleneck for achieving higher throughput. In this paper, for storing large graphs, we propose and evaluate techniques to efficiently utilize the different levels of the memory hierarchy of GPUs, with the focus being on the larger global memory. Given a graph G = (V , E), we provide an algorithm to count the number of triangles in G, while storing the adjacency information on the global memory. Our computation techniques and data structure for retrieving the adjacency information is derived from processing the breadth-first-search tree of the input graph. Also, techniques to generate combinations of nodes for testing the properties of graphs induced by the same are discussed in detail. Our methods can be extended to solve other combinatorial counting problems on graphs, such as finding the number of connected subgraphs of size k, number of cliques (resp. independent sets) of size k, and related problems for large data sets. In the context of the triangle counting algorithm, we analyze and utilize primitives such as memory access coalescing and avoiding partition camping that offset the increase in access latency of using a slower but larger global memory. Our experimental results for the GPU implementation show at least 10 times speedup for triangle counting over the CPU counterpart. Another 6-8 % increase in performance is obtained by utilizing the above mentioned primitives as compared to the naïve implementation of the program on the GPU.
Database Systems for Advanced Applications, 2015
Subgraph matching is the task of finding all matches of a query graph in a large data graph, which is known as an NP-complete problem. Many algorithms are proposed to solve this problem using CPUs. In recent years, Graphics Processing Units (GPUs) have been adopted to accelerate fundamental graph operations such as breadthfirst search and shortest path, owing to their parallelism and high data throughput. The existing subgraph matching algorithms, however, face challenges in mapping backtracking problems to the GPU architectures. Moreover, the previous GPU-based graph algorithms are not designed to handle intermediate and final outputs. In this paper, we present a simple and GPU-friendly method for subgraph matching, called GpSM, which is designed for massively parallel architectures. We show that GpSM outperforms the state-of-the-art algorithms and efficiently answers subgraph queries on large graphs.
Proceedings of the 36th ACM International Conference on Supercomputing
Counting-cliques in a graph is an important problem in graph analysis with many applications such as community detection and graph partitioning. Counting-cliques is typically done by traversing search trees starting at each vertex in the graph. Parallelizing-clique counting has been well-studied on CPUs and many solutions exist. However, there are no performant solutions for-clique counting on GPUs. Parallelizing-clique counting on GPUs comes with numerous challenges such as the need for extracting fine-grain multi-level parallelism, sensitivity to load imbalance, and constrained physical memory capacity. While there has been work on related problems such as finding maximal cliques and generalized sub-graph matching on GPUs,-clique counting in particular has yet to be explored in depth. In this paper, we present the first parallel GPU solution specialized for the-clique counting problem. Our solution supports both graph orientation and pivoting for eliminating redundant clique discovery. It incorporates both vertex-centric and edge-centric parallelization schemes for distributing work across thread blocks, and further partitions work within each thread block to extract fine-grain multi-level parallelism while tolerating load imbalance. It also includes optimizations such as binary encoding of induced sub-graphs and sub-warp partitioning to limit memory consumption and improve the utilization of execution resources. Our evaluation shows that our best GPU implementation outperforms the best state-of-the-art parallel CPU implementation by a geometric mean of 12.39×, 6.21×, and 18.99× for = 4, 7, and 10, respectively. We also perform a detailed evaluation of the trade-offs involved in the choice of parallelization scheme, and the incremental speedup of each optimization to provide an in-depth understanding of the optimization space. The insights from our optimization flow can be useful for optimizing other clique finding and graph mining solutions on GPUs. Our code will be open-sourced to enable further research on GPU parallelization of-clique counting and other similar graph mining algorithms.
Frontiers of Computer Science in China, 2009
Despite several algorithms for searching subgraphs in motif detection presented in the literature, no effort has been done for characterizing their performance till now. This paper presents a methodology to evaluate the performance of three algorithms: edge sampling algorithm (ESA), enumerate subgraphs (ESU) and randomly enumerate subgraphs (RAND-ESU). A series of experiments are performed to test sampling speed and sampling quality. The results show that RAND-ESU is more efficient and has less computational cost than other algorithms for large-size motif detection, and ESU has its own advantage in small-size motif detection.
Communications in Computer and Information Science, 2011
Many natural and artificial structures can be represented as complex networks. Computing the frequency of all subgraphs of a certain size can give a very comprehensive structural characterization of these networks. This is known as the subgraph census problem, and it is also important as an intermediate step in the computation of other features of the network, such as network motifs. The subgraph census problem is computationally hard and most associated algorithms for it are sequential. Here we present several increasingly efficient parallel strategies for, culminating in a scalable and adaptive parallel algorithm. We applied our strategies to a representative set of biological networks and achieved almost linear speedups up to 128 processors, paving the way for making it possible to compute the census for bigger networks and larger subgraph sizes.
Lecture Notes in Computer Science, 2014
Counting the occurrences of small subgraphs in large networks is a fundamental graph mining metric with several possible applications. Computing frequencies of those subgraphs is also known as the subgraph census problem, which is a computationally hard task. In this paper we provide a parallel multicore algorithm for this purpose. At its core we use FaSE, an efficient network-centric sequential subgraph census algorithm, which is able to substantially decrease the number of isomorphism tests needed when compared to past approaches. We use one thread per core and employ a dynamic load balancing scheme capable of dealing with the highly unbalanced search tree induced by FaSE and effectively redistributing work during execution. We assessed the scalability of our algorithm on a varied set of representative networks and achieved near linear speedup up to 32 cores while obtaining a high efficiency for the total 64 cores of our machine.
Proceedings of the 2010 ACM Symposium on …, 2010
In this paper we propose a novel specialized data structure that we call g-trie, designed to deal with collections of subgraphs. The main conceptual idea is akin to a prefix tree in the sense that we take advantage of common topology by constructing a multiway tree where the descendants of a node share a common substructure. We give algorithms to construct a g-trie, to list all stored subgraphs, and to find occurrences on another graph of the subgraphs stored in the g-trie. We evaluate the implementation of this structure and its associated algorithms on a set of representative benchmark biological networks in order to find network motifs. To assess the efficiency of our algorithms we compare their performance with other known network motif algorithms also implemented in the same common platform. Our results show that indeed, g-tries are a feasible, adequate and very efficient data structure for network motifs discovery, clearly outperforming previous algorithms and data structures.
IEEE Transactions on Parallel and Distributed Systems, 2015
This paper presents a massively parallel implementation of a prominent network clustering algorithm, the structural clustering algorithm for networks (SCAN), on a graphical processing unit (GPU). SCAN is a fast and efficient clustering technique for finding hidden communities and isolating hubs/outliers within a network. However, for very large networks, it still takes considerable amount of time. With the introduction of massively parallel Compute Unified Device Architecture (CUDA) by Nvidia, applications properly employing GPUs are demonstrating high speed up. In current study, GPUSCAN, a CUDA based parallel implementation of SCAN, is presented. SCAN's computation steps have been carefully redesigned to run very efficiently on the GPU by transforming SCAN into a series of highly regular and independent concurrent operations. All intermediate data structures are created in the GPU to efficiently benefit from GPU's memory hierarchy. How these structures reformed and represented in the GPU memory hierarchy is illustrated. Now, through GPUSCAN, a large network or a batch of disjoint networks can be offloaded to the GPU for very fast and equivalent structural clustering. The performance of the GPU accelerated structural clustering has been shown to be much faster than the sequential CPU implementation. Both GPUSCAN and SCAN are tested on different size artificial and real-world networks. Results indicate that network becomes larger GPUSCAN significantly over performs SCAN. In tested datasets, speed-up of over 500-fold is achieved. For instance, calculating structural similarity and clustering of 5.5 million edges of the California road network in GPUSCAN is 513-fold faster than the serial version of SCAN.
Lecture Notes in Computer Science, 2007
The study of biological networks and network motifs can yield significant new insights into systems biology. Previous methods of discovering network motifs -network-centric subgraph enumeration and sampling -have been limited to motifs of 6 to 8 nodes, revealing only the smallest network components. New methods are necessary to identify larger network sub-structures and functional motifs.
The increasing scale and wealth of inter-connected data, such as those accrued by social network applications, demand the design of new techniques and platforms to efficiently derive actionable knowledge from large-scale graphs. However, large real-world graphs are famously difficult to process efficiently. Not only they have a large memory footprint, but also most graph algorithms entail memory access patterns with poor locality, data-dependent parallelism and a low compute-to-memory access ratio. To complicate matters further, most real-world graphs have a highly heterogeneous node degree distribution, hence partitioning these graphs for parallel processing and simultaneously achieving access locality and load-balancing is difficult if not impossible.
Journal of Data Mining in Genomics & Proteomics, 2016
Network motif is a pattern of inter-connections occurring in complex network in numbers that are significantly higher than those in similar randomized network. The basic premise of finding network motifs lie in the ability to compute the frequency of the subgraphs. In order to discover network motif, one has to compute a subgraph census on the original network that calculates the frequency of all the subgraphs of certain type. Then there is a need to compute the frequency of a set of subgraphs on the randomized similar network. The bottleneck of the entire motif discovery process is therefore to compute the subgraph frequencies and this is the core computational problem. The proposed work is to present the Suffix-Graph, a data structure that store graphs efficiently and to design an algorithm to retrieve subgraph efficiently that detects network motifs and apply them to transcriptional interactions in Escherichia coli.
2020 IEEE International Parallel and Distributed Processing Symposium (IPDPS), 2020
There are relatively few studies of distributed GPU graph analytics systems in the literature and they are limited in scope since they deal with small data-sets, consider only a few applications, and do not consider the interplay between partitioning policies and optimizations for computation and communication. In this paper, we present the first detailed analysis of graph analytics applications for massive real-world datasets on a distributed multi-GPU platform and the first analysis of strong scaling of smaller real-world datasets. We use D-IrGL, the state-of-the-art distributed GPU graph analytical framework, in our study. Our evaluation shows that (1) the Cartesian vertex-cut partitioning policy is critical to scale computation out on GPUs even at a small scale, (2) static load imbalance is a key factor in performance since memory is limited on GPUs, (3) device-host communication is a significant portion of execution time and should be optimized to gain performance, and (4) asynchronous execution is not always better than bulksynchronous execution.
Journal of Parallel and Distributed Computing, 2011
Many natural structures can be naturally represented by complex networks. Discovering network motifs, which are overrepresented patterns of inter-connections, is a computationally hard task related to graph isomorphism. Sequential methods are hindered by an exponential execution time growth when we increase the size of motifs and networks. In this article we study the opportunities for parallelism in existing methods and propose new parallel strategies that adapt and extend one of the most efficient serial methods known from the Fanmod tool. We propose both a master-worker strategy and one with distributed control, in which we employ a randomized receiver initiated methodology capable of providing dynamic load balancing during the whole computation process. Our strategies are capable of dealing both with exact and approximate network motifs discovery. We implement and apply our algorithms to a set of representative networks and examine their scalability up to 128 processor cores. We obtain almost linear speedups, showcasing the efficiency of our proposed approach and are able to reach motif sizes that were not previously achievable using conventional serial algorithms.
Proceedings of the VLDB Endowment
Due to the limited capacity of GPU memory, the majority of prior work on graph applications on GPUs has been restricted to graphs of modest sizes that fit in memory. Recent hardware and software advances make it possible to address much larger host memory transparently as a part of a feature known as unified virtual memory. While accessing host memory over an interconnect is understandably slower, the problem space has not been sufficiently explored in the context of a challenging workload with low computational intensity and an irregular data access pattern such as graph traversal. We analyse the performance of breadth first search (BFS) for several large graphs in the context of unified memory and identify the key factors that contribute to slowdowns. Next, we propose a lightweight offline graph reordering algorithm, HALO (Harmonic Locality Ordering), that can be used as a pre-processing step for static graphs. HALO yields speedups of 1.5x-1.9x over baseline in subsequent traversa...
2013
Plenty of structural patterns in real world have been represented as graph like molecules, chemical compounds, social network, road network etc. Mining this graph for extracting some useful information is of special interest and has many applications. The application includes drug discovery, compound synthesis, anomaly detection in network, social network analysis for finding groups etc. One of the most interesting problems in graph mining is graph containment problem. In graph containment problem ,given a query graph q ,it is asked to find all graph in given graph dataset containing this query (query graph as subgraph).This means finding all graph which is isomorphic to query graph. As in real world there is vast number of graph in graph dataset so this task of subgraph isomorphism test become tedious, complex, time and space consuming. So it is necessary to create an index of graphs present in dataset for cost efficient query processing. In this paper we proposed a time efficient ...
2011 International Conference on Parallel Architectures and Compilation Techniques, 2011
Graphs are a fundamental data representation that have been used extensively in various domains. In graph-based applications, a systematic exploration of the graph such as a breadth-first search (BFS) often serves as a key component in the processing of their massive data sets. In this paper, we present a new method for implementing the parallel BFS algorithm on multi-core CPUs which exploits a fundamental property of randomly shaped real-world graph instances. By utilizing memory bandwidth more efficiently, our method shows improved performance over the current state-of-the-art implementation and increases its advantage as the size of the graph increases. We then propose a hybrid method which, for each level of the BFS algorithm, dynamically chooses the best implementation from: a sequential execution, two different methods of multicore execution, and a GPU execution. Such a hybrid approach provides the best performance for each graph size while avoiding poor worst-case performance on high-diameter graphs. Finally, we study the effects of the underlying architecture on BFS performance by comparing multiple CPU and GPU systems; a high-end GPU system performed as well as a quad-socket highend CPU system.
Loading Preview
Sorry, preview is currently unavailable. You can download the paper by clicking the button above.