Academia.edu no longer supports Internet Explorer.
To browse Academia.edu and the wider internet faster and more securely, please take a few seconds to upgrade your browser.
We discuss the problem of extending data mining approaches to cases in which data points arise in the form of individual graphs. Being able to find the intrinsic low-dimensionality in ensembles of graphs can be useful in a variety of modeling contexts, especially when coarse-graining the detailed graph information is of interest. One of the main challenges in mining graph data is the definition of a suitable pairwise similarity metric in the space of graphs. We explore two practical solutions to solving this problem: one based on finding subgraph densities, and one using spectral information. The approach is illustrated on three test data sets (ensembles of graphs); two of these are obtained from standard graph generating algorithms, while the graphs in the third example are sampled as dynamic snapshots from an evolving network simulation.
2009
In the early years of data mining and knowledge discovery in databases, method development focused on rigidly and plainly structured data. Most often efforts were even confined to data that can be represented as a simple table, which describes a set of sample cases by attribute-value pairs. Recent years, however, have seen a constantly growing interest in the analysis of more complex data, with a less rigid and/or more sophisticated structure.
Computer Networks, 2011
This paper proposes a novel non-parametric technique for clustering networks based on their structure. Many topological measures have been introduced in the literature to characterize topological properties of networks. These measures provide meaningful information about the structural properties of a network, but many networks share similar values of a given measure . Furthermore, strong correlation between these measures occur on real-world graphs [2], so that using them to distinguish arbitrary graphs is difficult in practice .
2008 IEEE International Conference on Data Mining Workshops, 2008
We propose a dynamic graph-based relational mining approach using graph-rewriting rules to learns patterns in networks that structurally change over time. A dynamic graph containing a sequence of graphs over time represents dynamic properties as well as structural properties of the network. Our approach discovers graph-rewriting rules, which describe the structural transformations between two sequential graphs over time, and also learns description rules that generalize over the discovered graph-rewriting rules. The discovered graph-rewriting rules show how networks change over time, and the description rules in the graph-rewriting rules show temporal patterns in the structural changes. We apply our approach to biological networks to understand how the biosystems change over time. Our compression-based discovery of the description rules is compared with the frequent subgraph mining approach using several evaluation metrics.
Expert Systems with Applications, 2014
and sharing with colleagues.
2012
This paper proposes a novel non-parametric technique for clustering networks based on their structure. Many topological measures have been introduced in the literature to characterize topological properties of networks. These measures provide meaningful information about the structural properties of a network, but many networks share similar values of a given measure . Furthermore, strong correlation between these measures occur on realworld graphs [2], so that using them to distinguish arbitrary graphs is difficult in practice .
Computer Science Review, Elsevier, Vol. 7, Feb. 2013, pp. 1–34., 2013
In this survey we review the literature and concepts of the data mining of social networks, with special emphasis on their representation as a graph structure. The survey is divided into two principal parts: first we conduct a survey of the literature which forms the ‘basis’ and background for the field; second we define a set of ‘hot topics’ which are currently in vogue in congresses and the literature. The ‘basis’ or background part is divided into four major themes: graph theory, social networks, online social networks and graph mining. The graph mining theme is organized into ten subthemes. The second, ‘hot topic’ part, is divided into five major themes: communities, influence and recommendation, models metrics and dynamics, behaviour and relationships, and information diffusion.
Pattern Recognition Letters
In spite of the simple linear relationship between the adjacency A and the Laplacian L matrices, L=D-A where D is the degrees matrix, these matrices seem to reveal informations about the graph in different ways, where it appears that some details are detected only by one of them, as in the case of cospectral graphs. Based on this observation, a new graphs similarity measure, referred to as joint spectral similarity (JSS) incorporating both spectral information from A and L is introduced. A weighting parameter to control the relative influence of each matrix is used. Furthermore, to highlight the overlapping and the unequal contributions of these matrices for graph representation, they are compared in terms of the so called Von Neumann entropy (VN), connectivity and complexity measures. The graph is viewed as a quantum system and thus, the calculated VN entropy of its perturbed density matrix emphasizes the overlapping in terms of information quantity of A and L matrices. The impact of matrix representation is strongly illustrated by classification findings on real and conceptual graphs based on JSS measure. The obtained results show the effectiveness of the JSS measure in terms of graph classification accuracies and also highlight varying information overlapping rates of A and L, and point out their different ways in recovering structural information of the graph.
Journal of Complex Networks, 2021
Based on a large dataset containing thousands of real-world networks ranging from genetic, protein interaction and metabolic networks to brain, language, ecology and social networks, we search for defining structural measures of the different complex network domains (CND). We calculate 208 measures for all networks, and using a comprehensive and scrupulous workflow of statistical and machine learning methods, we investigated the limitations and possibilities of identifying the key graph measures of CNDs. Our approach managed to identify well distinguishable groups of network domains and confer their relevant features. These features turn out to be CND specific and not unique even at the level of individual CNDs. The presented methodology may be applied to other similar scenarios involving highly unbalanced and skewed datasets.
2011
Abstract The Eighth Workshop on Mining and Learning with Graphs (MLG) 1was held at KDD 2010 in Washington DC. It brought together a variete of researchers interested in analyzing data that is best represented as a graph. Examples include the WWW, social networks, biological networks, communication networks, and many others. The importance of being able to effectively mine and learn from such data is growing, as more and more structured and semi-structured data is becoming available.
ACM SIGKDD Explorations Newsletter, 2003
The need for mining structured data has increased in the past few years. One of the best studied data structures in computer science and discrete mathematics are graphs. It can therefore be no surprise that graph based data mining has become quite popular in the last few years.This article introduces the theoretical basis of graph based data mining and surveys the state of the art of graph-based data mining. Brief descriptions of some representative approaches are provided as well.
Correlation mining is recognized as one of the most important data mining tasks for its capability to identify underlying dependencies between objects. Nowadays, data mining techniques are increasingly applied to such non-traditional domains, where existing approaches to obtain knowledge from large volume of data cannot be used, as they are not capable to model the requirement of the domains. In particular, the graph modeling based data mining techniques are advantageous in modeling various real life complex scenarios. However, existing graph based data mining techniques cannot efficiently capture actual correlations and behave like a searching algorithm based on user provided query. Eventually, for extracting some very useful knowledge from large amount of spurious patterns, correlation measures are used. Hence, we have focused on correlation mining in graph databases and this paper proposed a new graph correlation measure, gConfidence, to efficiently extract useful graph patterns along with a method CGM (Correlated Graph M ining), to find the underlying correlations among graphs in graph databases using the proposed measure. Finally, extensive performance analysis of our scheme proved two times improvement on speed and efficiency in mining correlation compared to existing algorithms.
Lecture Notes in Computer Science, 2011
Graph classification has become an increasingly important research topic in recent years due to its wide applications. However, one interesting problem about how to classify graphs based on the implicit properties of graphs has not been studied yet. To address it, this paper first conducts an extensive study on existing graph theoretical metrics and also propose various novel metrics to discover implicit graph properties. We then apply feature selection techniques to discover a subset of discriminative metrics by considering domain knowledge. Two classifiers are proposed to classify the graphs based on the subset of features. The feasibility of graph classification based on the proposed graph metrics and techniques has been experimentally studied.
IEEE Intelligent Systems, 2000
at Arlington THE LARGE AMOUNT OF DATA collected today is quickly overwhelming researchers' abilities to interpret the data and discover interesting patterns in it. In response to this problem, researchers have developed techniques and systems for discovering concepts in databases. 1-3 Much of the collected data, however, has an explicit or implicit structural component (spatial or temporal), which few discovery systems are designed to handle. 4 So, in addition to the need to accelerate data mining of large databases, there is an urgent need to develop scalable tools for discovering concepts in structural databases. One method for discovering knowledge in structural data is the identification of common substructures within the data. Substructure discovery is the process of identifying concepts describing interesting and repetitive substructures within structural data. The discovered substructure concepts allow abstraction from the detailed data structure and provide relevant attributes for interpreting the data. The substructure discovery method is the basis of Subdue, which performs data mining on databases represented as graphs. The system performs two key data-mining techniques: unsupervised pattern discovery and supervised concept learning from examples. Our test applications have demonstrated the scalability and effectiveness of these techniques on a variety of structural databases.
IEEE Transactions on Knowledge and Data Engineering, 2013
In this article, we propose to mine the graph topology of a large attributed graph by finding regularities among vertex descriptors. Such descriptors are of two types: (1) the vertex attributes that correspond to the information conveyed by the vertices themselves and (2) some topological properties, used to describe the connectivity of each vertex in the graph. Such topological properties and attributes are mostly of numerical or ordinal types and their similarity can be captured by quantifying their co-variation, that is, if their largest or smallest values are supported mostly by the same set of vertices. A topological pattern is thus defined as a set of vertex attributes and topological properties that strongly co-vary over the vertices of the graph. Such pattern mining task relies on frequent pattern mining and graph topology analysis to reveal the links that exist between the relation encoded by the graph and the vertex attributes. For instance, a topological pattern in a co-authorship graph, where vertices represent authors, edges encode coauthorship, and vertex attributes reveal the number of publications in several journals, could be "the higher the number of publications in IEEE TKDE, the higher the closeness centrality of the vertex within the graph". Hence, such pattern discloses the fact that the number of times an author publishes at IEEE TKDE is positively correlated to the fact she has co-authored papers with other central authors, inducing a rather short distance to other graph vertices. We propose several interestingness measures of topological patterns that are different w.r.t. the pairs of vertices considered while evaluating up and down co-variations between properties and attributes: (1) considering all the pairs of vertices enables to find patterns that are true all over the graph; (2) taking into account only the vertex pairs that are in a specific order w.r.t. a selected attribute reveals the topological patterns that emerge with respect to this attribute; (3) examining the vertex pairs that are connected in the graph makes it possible to identify patterns that are structurally correlated to the relationship encoded by the graph. An efficient algorithm that combines searching and pruning strategies in the identification of the most relevant topological patterns is presented. Besides a classical empirical study, we report case studies on four real-life networks showing that our approach provides valuable knowledge in a feasible time.
Distributed and Parallel Databases, 2015
Graph is an extremely versatile data structure in terms of its expressiveness and flexibility to model a range of real life phenomenon. Various networks like social networks, sensor networks and computer networks are represented and stored in the form of graphs. The analysis of these kind of graphs has an immense importance from quite a long time. It is performed from various aspects to get maximum out of such multifaceted information repository. When the analysis is targeted towards finding groups of vertices based on their similarity in a graph, clustering is the most conspicuous option. Previous graph clustering approaches either focus on the topological structure or attributes likeness, however, few recent methods constitutes both aspects simultaneously. Due to enormous computation requirements for similarity estimation, these methods are often suffered from scalability issues. In order to overcome this limitation, we introduce Collaborative Similarity Measure (CSM) for Intra-Graph Clustering (IGC). CSM is based on shortest path strategy, instead of all paths, to define structural and semantic relevance among vertices. First, we calculate the pair-wise similarity among vertices using CSM. Second, vertices are grouped together based on calculated similarity under k-Medoid framework. Empirical analysis, based on density, entropy and f-measure, proves the efficacy of CSM over existing measures. Moreover, CSM becomes a potential candidate for medium scaled graph analysis due to an order of magnitude less computations.
Techniques and Applications
Graph is a mathematical framework that allows us to represent and manage many real-world data such as relational data, multimedia data and biomedical data. When each data point is represented as a graph and we are given a number of graphs, a task is to extract a few common patterns that capture the property of each population. A frequent graph mining algorithm such as AGM, gSpan and Gaston can enumerate all the frequent patterns in graph data, however, the number of patterns grows exponentially, therefore it is essential to output only discriminative patterns. There are many existing researches on this topic, but this chapter focus on the use of matrix decomposition techniques, and explains the two general cases where either i) no target label is available, or ii) target label is available for each data point. The reuslting method is a branch and bound pattern mining algorithm with efficient pruning condition, and we evaluate its effectiveness on cheminformatics data.
Mathematical Problems in Engineering, 2014
Due to rapid development of the Internet technology and new scientific advances, the number of applications that model the data as graphs increases, because graphs have highly expressive power to model a complicated structure. Graph mining is a wellexplored area of research which is gaining popularity in the data mining community. A graph is a general model to represent data and has been used in many domains such as cheminformatics, web information management system, computer network, and bioinformatics, to name a few. In graph mining the frequent subgraph discovery is a challenging task. Frequent subgraph mining is concerned with discovery of those subgraphs from graph dataset which have frequent or multiple instances within the given graph dataset. In the literature a large number of frequent subgraph mining algorithms have been proposed; these included FSG, AGM, gSpan, CloseGraph, SPIN, Gaston, and Mofa. The objective of this research work is to perform quantitative comparison of the above listed techniques. The performances of these techniques have been evaluated through a number of experiments based on three different state-of-the-art graph datasets. This novel work will provide base for anyone who is working to design a new frequent subgraph discovery technique.
Data Engineering, 2009. ICDE'09. IEEE …, 2009
Graphs are being increasingly used to model a wide range of scientific data. Such widespread usage of graphs has generated considerable interest in mining patterns from graph databases. While an array of techniques exists to mine frequent patterns, we still lack a scalable approach to mine statistically significant patterns, specifically patterns with low p-values, that occur at low frequencies. We propose a highly scalable technique, called GraphSig, to mine significant subgraphs from large graph databases. We convert each graph into a set of feature vectors where each vector represents a region within the graph. Domain knowledge is used to select a meaningful feature set. Prior probabilities of features are computed empirically to evaluate statistical significance of patterns in the feature space. Following analysis in the feature space, only a small portion of the exponential search space is accessed for further analysis. This enables the use of existing frequent subgraph mining techniques to mine significant patterns in a scalable manner even when they are infrequent. Extensive experiments are carried out on the proposed techniques, and empirical results demonstrate that GraphSig is effective and efficient for mining significant patterns. To further demonstrate the power of significant patterns, we develop a classifier using patterns mined by GraphSig. Experimental results show that the proposed classifier achieves superior performance, both in terms of quality and computation cost, over state-of-the-art classifiers.
2008
Graph clustering methods such as spectral clustering are defined for general weighted graphs. In machine learning, however, data often is not given in form of a graph, but in terms of similarity (or distance) values between points. In this case, first a neighborhood graph is constructed using the similarities between the points and then a graph clustering algorithm is applied to this graph. In this paper we investigate the influence of the construction of the similarity graph on the clustering results. We first study the convergence of graph clustering criteria such as the normalized cut (Ncut) as the sample size tends to infinity. We find that the limit expressions are different for different types of graph, for example the r-neighborhood graph or the k-nearest neighbor graph. In plain words: Ncut on a kNN graph does something systematically different than Ncut on an r-neighborhood graph! This finding shows that graph clustering criteria cannot be studied independently of the kind of graph they are applied to. We also provide examples which show that these differences can be observed for toy and real data already for rather small sample sizes.
Machine Learning and Knowledge Discovery in Databases, 2021
When searching for interesting structures in graphs, it is often important to take into account not only the graph connectivity, but also the metadata available, such as node and edge labels, or temporal information. In this paper we are interested in settings where such metadata is used to define a similarity between edges. We consider the problem of finding subgraphs that are dense and whose edges are similar to each other with respect to a given similarity function. Depending on the application, this function can be, for example, the Jaccard similarity between the edge label sets, or the temporal correlation of the edge occurrences in a temporal graph. We formulate a Lagrangian relaxation-based optimization problem to search for dense subgraphs with high pairwise edge similarity. We design a novel algorithm to solve the problem through parametric mincut [15, 17], and provide an efficient search scheme to iterate through the values of the Lagrangian multipliers. Our study is complemented by an evaluation on real-world datasets, which demonstrates the usefulness and efficiency of the proposed approach.
Loading Preview
Sorry, preview is currently unavailable. You can download the paper by clicking the button above.