Academia.edu no longer supports Internet Explorer.
To browse Academia.edu and the wider internet faster and more securely, please take a few seconds to upgrade your browser.
2000, IEEE Intelligent Systems
at Arlington THE LARGE AMOUNT OF DATA collected today is quickly overwhelming researchers' abilities to interpret the data and discover interesting patterns in it. In response to this problem, researchers have developed techniques and systems for discovering concepts in databases. 1-3 Much of the collected data, however, has an explicit or implicit structural component (spatial or temporal), which few discovery systems are designed to handle. 4 So, in addition to the need to accelerate data mining of large databases, there is an urgent need to develop scalable tools for discovering concepts in structural databases. One method for discovering knowledge in structural data is the identification of common substructures within the data. Substructure discovery is the process of identifying concepts describing interesting and repetitive substructures within structural data. The discovered substructure concepts allow abstraction from the detailed data structure and provide relevant attributes for interpreting the data. The substructure discovery method is the basis of Subdue, which performs data mining on databases represented as graphs. The system performs two key data-mining techniques: unsupervised pattern discovery and supervised concept learning from examples. Our test applications have demonstrated the scalability and effectiveness of these techniques on a variety of structural databases.
International Conference on Artificial Intelligence, 2000
With the increasing amount of structural data being collected, there arises a need to efficiently mine infor- mation from this type of data. The goal of this re- search is to provide a system that performs data min- ing on structural data represented as a labeled graph. We demonstrate how the graph-based discovery system Subdue can be used to perform
Advanced Information and Knowledge Processing
We describe an approach to learning patterns in relational data represented as a graph. The approach, implemented in the Subdue system, searches for patterns that maximally compress the input graph. Subdue can be used for supervised learning, as well as unsupervised pattern discovery and clustering. Mining graph-based data raises challenges not found in linear attribute-value data. However, additional requirements can further complicate the problem. In particular, we describe how Subdue can incrementally process structured data that arrives as streaming data. We also employ these techniques to learn structural concepts from examples embedded in a single large connected graph.
ACM SIGKDD Explorations Newsletter, 2003
The need for mining structured data has increased in the past few years. One of the best studied data structures in computer science and discrete mathematics are graphs. It can therefore be no surprise that graph based data mining has become quite popular in the last few years.This article introduces the theoretical basis of graph based data mining and surveys the state of the art of graph-based data mining. Brief descriptions of some representative approaches are provided as well.
International Journal on Artificial Intelligence Tools, 2005
Much of current data mining research is focused on discovering sets of attributes that discriminate data entities into classes, such as shopping trends for a particular demographic group. In contrast, we are working to develop data mining techniques to discover patterns consisting of complex relationships between entities. Our research is particularly applicable to domains in which the data is event-driven or relationally structured. In this paper we present approaches to address two related challenges; the need to assimilate incremental data updates and the need to mine monolithic datasets. Many realistic problems are continuous in nature and therefore require a data mining approach that can evolve discovered knowledge over time. Similarly, many problems present data sets that are too large to fit into dynamic memory on conventional computer systems. We address incremental data mining by introducing a mechanism for summarizing discoveries from previous data increments so that the g...
2009
In the early years of data mining and knowledge discovery in databases, method development focused on rigidly and plainly structured data. Most often efforts were even confined to data that can be represented as a simple table, which describes a set of sample cases by attribute-value pairs. Recent years, however, have seen a constantly growing interest in the analysis of more complex data, with a less rigid and/or more sophisticated structure.
Graphs become increasingly important in modeling complicated structures, such as circuits, images, chemical compounds, protein structures, biological networks, social networks, the web, workflows, and XML documents. Many graph search algorithms have been developed in chemical informatics, computer vision, video indexing and text retrieval with the increasing demand on the analysis of large amounts of structured data; graph mining has become an active and important theme in data mining.
Journal of Parallel and Distributed Computing, 2001
The large amount of data collected today is quickly overwhelming researchers' abilities to interpret the data and discover interesting patterns. Knowledge discovery and data mining systems contain the potential to automate the interpretation process, but these approaches frequently utilize computationally expensive algorithms. In particular, scientific discovery systems focus on the utilization of richer data representation, sometimes without regard for scalability. This research investigates approaches for scaling a particular knowledge discovery data mining system, Subdue, using parallel and distributed resources. Subdue has been used to discover interesting and repetitive concepts in graph-based databases from a variety of domains, but requires a substantial amount of processing time. Experiments that demonstrate scalability of parallel versions of the Subdue system are performed using CAD circuit databases, satellite images, and artificially-generated databases, and potential achievements and obstacles are discussed. 2001
2008
This paper presents our investigation into graph mining methods to help users understand large graphs. Our approach is a two-step process: First calculate subgraph labels and then calculate distribution statistics on these labels. Our approach is flexible in that it can identify a range of patterns from very abstract to very specific (e.g., isomorphisms). The statistics that we calculate can be used to find rare and common patterns, patterns that are (dis)similar to the distribution of induced subgraphs of the same size, patterns that are (dis)similar to each other, as well as variance of graph patterns given a specific set of input node types. We also investigate a method to understand structural characteristics by analyzing clusters that are created by "collapsing" overlapping instances of user-specified patterns. We evaluated our approach on two publicly available networks-the Texas CS web-site from WebKB and the internet movie database.
Journal of Intelligent Information Systems, 1995
Discovering repetitive substructure in a structural database improves the ability to interpret and compress the data. This paper describes the Subdue system that uses domain-independent and domain-dependent heuristics to nd interesting and repetitive structures in structural data. This substructure discovery technique can be used to discover fuzzy concepts, compress the data description, and formulate hierarchical substructure de nitions. Examples from the domains of scene analysis, chemical compound analysis, computer-aided design, and program analysis demonstrate the bene ts of the discovery technique.
Data Mining and Knowledge Discovery, 2008
Graph mining is gaining importance due to the numerous applications that rely on graph-based data. Some example applications are: (i) analysis of microarray data in bioinformatics, (ii) pattern discovery in social networks, (iii) analysis of transportation networks, (iv) community discovery in Web data. Existing pattern discovery approaches operate by using simple constraints on the mined patterns. For example, given a database of graphs, a typical graph mining task is to report all subgraphs that appear in at least s graphs, where s is the frequency support threshold. In other cases, we are interested in the discovery of dense or highly-connected subgraphs. In such a case, a threshold is defined for the density or the connectivity of the returned patterns. Other constraints may be defined as well, towards restricting the number of mined patterns. There are three important limitations with this approach: (i) there is an on-off decision regarding the eligibility of patterns, i.e., a pattern either satisfies the constraints or not, (ii) in the case where the constraints are very strict we risk an empty answer or an answer with only a few patterns, and (iii) in the case where the constraints are too weak the number of patterns may be huge.
A graph is a basic data structure which, can be used to model complex structures and the relationships between them, such as XML documents, social networks, communication networks, chemical informatics, biology networks, and structure of web pages. Frequent subgraph pattern mining is one of the most important fields in graph mining. In light of many applications for it, there are extensive researches in this area, such as analysis and processing of XML documents, documents clustering and classification, images and video indexing, graph indexing for graph querying, routing in computer networks, web links analysis, drugs design, and carcinogenesis. Several frequent pattern mining algorithms have been proposed in recent years and every day a new one is introduced. The fact that these algorithms use various methods on different datasets, patterns mining types, graph and tree representations, it is not easy to study them in terms of features and performance. This paper presents a brief report of an intensive investigation of actual frequent subgraphs and subtrees mining algorithms. The algorithms were also categorized based on different features.
Proceedings of the 2004 ACM SIGMOD international conference on Management of data, 2004
Graph has become increasingly important in modelling complicated structures and schemaless data such as proteins, chemical compounds, and XML documents. Given a graph query, it is desirable to retrieve graphs quickly from a large database via graph-based indices. In this paper, we investigate the issues of indexing graphs and propose a novel solution by applying a graph mining technique. Different from the existing path-based methods, our approach, called gIndex, makes use of frequent substructure as the basic indexing feature. Frequent substructures are ideal candidates since they explore the intrinsic characteristics of the data and are relatively stable to database updates. To reduce the size of index structure, two techniques, size-increasing support constraint and discriminative fragments, are introduced. Our performance study shows that gIndex has 10 times smaller index size, but achieves 3-10 times better performance in comparison with a typical path-based method, GraphGrep. The gIndex approach not only provides an elegant solution to the graph indexing problem, but also demonstrates how database indexing and query processing can benefit from data mining, especially frequent pattern mining. Furthermore, the concepts developed here can be applied to indexing sequences, trees, and other complicated structures as well.
2009
In graph-based data mining (GBDM) tasks, an accurate data representation is fundamental for finding hidden patterns. However, there does not exist a standard representation to describe structural data because of the specific domain characteristics. Then, different graph topologies could be used as data representation, which is a challenge for GBDM tools. In this paper we explore a methodology for discovering hidden patterns in domains where is used a graph based representation. Our methodology is divided in three phases: first, we propose a formal graph notation used to symbolizes our graphs; second, we perform the data mining phase using the SI-COBRA and the SUBDUE tools; finally, we show how could be interpreted the outputs of these tools. We performed a set of experiments in two different domains where our methodology was applied: the web log and the SAT domains. With these examples we show how it is possible to symbolize our graphs with our notation, and also perform the GBDM task with the selected tools.
International Journal of Computational Intelligence Systems, 2021
Graph mining is a well-established research field, and lately it has drawn in considerable research communities. It allows to process, analyze, and discover significant knowledge from graph data. In graph mining, one of the most challenging tasks is frequent subgraph mining (FSM). FSM consists of applying the data mining algorithms to extract interesting, unexpected, and useful graph patterns from the graphs. FSM has been applied to many domains, such as graphical data management and knowledge discovery, social network analysis, bioinformatics, and security. In this context, a large number of techniques have been suggested to deal with the graph data. These techniques can be classed into two primary categories: (i) a priori-based FSM approaches and (ii) pattern growth-based FSM approaches. In both of these categories, an extensive research work is available. However, FSM approaches are facing some challenges, including enormous numbers of frequent subgraph patterns (FSPs); no suitab...
2016
Graph theory is becoming progressively important as it is applied to other fields of mathematics, science and technology. It is being actively used in areas as varied as biochemistry, electrical engineering, computer science and operations research. The main application of graph theory in data mining is graph mining. The need for mining structured data has increased in the past few years. Graphs are one of the best studied data structures in computer science and discrete mathematics. The relational aspect of data is explained by graph mining. The main aim of graph mining is to provide new principles and effective algorithms to mine topological substructures embedded in graph data. This article provides a brief review on four theoretical based approaches of graph based data mining. Brief description of application of graph mining is also provided.
Machine Learning - ML, 2003
Basket Analysis, which is a standard method for data mining, derives frequent itemsets from database. However, its mining ability is limited to transaction data consisting of items. In reality, there are many applications where data are described in a more structural way, e.g. chemical compounds and Web browsing history. There are a few approaches that can discover characteristic patterns from graph-structured data in the field of machine learning. However, almost all of them are not suitable for such applications that require a complete search for all frequent subgraph patterns in the data. In this paper, we propose a novel principle and its algorithm that derive the characteristic patterns which frequently appear in graph-structured data. Our algorithm can derive all frequent induced subgraphs from both directed and undirected graph structured data having loops (including self-loops) with labeled or unlabeled nodes and links. Its performance is evaluated through the applications t...
Cook/Mining Graph Data, 2006
The success of machine learning and data mining for business and scientific purposes has fueled the expansion of its scope to new representations and techniques. Much collected data is structural in nature, containing entities as well as relationships between these entities. Compelling data in bioinformatics [32], network intrusion detection [15], web analysis [2, 8], and social network analysis [7, 27] has become available that requires effective handling of structural data. The ability to learn 1 This work is partially supported by the National Science Foundation grants IIS-0505819 and IIS-0097517.
Data Engineering, 2009. ICDE'09. IEEE …, 2009
Graphs are being increasingly used to model a wide range of scientific data. Such widespread usage of graphs has generated considerable interest in mining patterns from graph databases. While an array of techniques exists to mine frequent patterns, we still lack a scalable approach to mine statistically significant patterns, specifically patterns with low p-values, that occur at low frequencies. We propose a highly scalable technique, called GraphSig, to mine significant subgraphs from large graph databases. We convert each graph into a set of feature vectors where each vector represents a region within the graph. Domain knowledge is used to select a meaningful feature set. Prior probabilities of features are computed empirically to evaluate statistical significance of patterns in the feature space. Following analysis in the feature space, only a small portion of the exponential search space is accessed for further analysis. This enables the use of existing frequent subgraph mining techniques to mine significant patterns in a scalable manner even when they are infrequent. Extensive experiments are carried out on the proposed techniques, and empirical results demonstrate that GraphSig is effective and efficient for mining significant patterns. To further demonstrate the power of significant patterns, we develop a classifier using patterns mined by GraphSig. Experimental results show that the proposed classifier achieves superior performance, both in terms of quality and computation cost, over state-of-the-art classifiers.
Proceedings of the 3rd workshop on Ph.D. students in information and knowledge management - PIKM '10, 2010
Frequent subgraph mining is an important problem in data mining with wide application in science. For instance, graphs can be used to represent structural relationships in problems related to network topology, chemical compound, protein structures, and so on. Searching for patterns from graph databases is difficult since graph-related operations generally have higher time complexity than equivalent operations on frequent itemsets. From a practical standpoint, databases keep growing with lots of opportunities and need to mine graphs. Even though there is a significant body of work on graph mining, most techniques work outside the database system. Programming frequent graph mining in SQL is more difficult than traditional approaches because the graph must be represented as a table and algorithmic steps must be written as relational queries. In our research, we study three fundamental problems under a database approach: graph storage and indexing, frequent subgraph search, and identifying subgraph isomorphism. We outline main research issues and our solution towards solving them. We also present preliminary experimental validation focusing on query optimizations and time complexity.
Loading Preview
Sorry, preview is currently unavailable. You can download the paper by clicking the button above.