Academia.edu no longer supports Internet Explorer.
To browse Academia.edu and the wider internet faster and more securely, please take a few seconds to upgrade your browser.
1994
Because many databases contain or can be embellished with structural information, a method for identifying interesting and repetitive substructures is an essential component to discovering knowledge in such databases. This paper describes the SUBDUE system, which uses the minimum description length (MDL) principle to discover substructures that compress the database and represent structural concepts in the data. By replacing previously-discovered substructures in the data, multiple passes of SUBDUE produce a hierarchical description of the structural regularities in the data. Inclusion of background knowledgeguides SUBDUE toward appropriate substructures for a particular domain or discovery goal, and the use of an inexact graph match allows a controlled amount of deviations in the instance of a substructure concept. We describe the application of SUBDUE to a variety of domains. We also discuss approaches to combining SUBDUE with non-structural discovery systems.
Knowledge Discovery and Data Mining, 1994
Because many databases contain or can be embellished with structural information, a method for identifying interesting and repetitive substructures is an essential component to discovering knowledge in such databases. This paper describes the SUBDUE system, which uses the minimum description length (MDL) principle to discover substructures that compress the database and represent structural concepts in the data. By replacing previously-discovered substructures in the data, multiple passes of SUBDUE produce a hierarchical description of the structural regularities in the data. Inclusion of background knowledgeguides SUBDUE toward appropriate substructures for a particular domain or discovery goal, and the use of an inexact graph match allows a controlled amount of deviations in the instance of a substructure concept. We describe the application of SUBDUE to a variety of domains. We also discuss approaches to combining SUBDUE with non-structural discovery systems.
Journal of Artificial Intelligence Research, 1994
The ability to identify interesting and repetitive substructures is an essential component to discovering knowledge in structural data. We describe a new version of our SUBDUE substructure discovery system based on the minimum description length principle. The SUBDUE system discovers substructures that compress the original data and represent structural concepts in the data. By replacing previously-discovered substructures in the data, multiple passes of SUBDUE produce a hierarchical description of the structural regularities in the data. SUBDUE uses a computationally-bounded inexact graph match that identifies similar, but not identical, instances of a substructure and finds an approximate measure of closeness of two substructures when under computational constraints. In addition to the minimum description length principle, other background knowledge can be used by SUBDUE to guide the search towards more appropriate substructures. Experiments in a variety of domains demonstrate SUB...
Journal of Intelligent Information Systems, 1995
Discovering repetitive substructure in a structural database improves the ability to interpret and compress the data. This paper describes the Subdue system that uses domain-independent and domain-dependent heuristics to nd interesting and repetitive structures in structural data. This substructure discovery technique can be used to discover fuzzy concepts, compress the data description, and formulate hierarchical substructure de nitions. Examples from the domains of scene analysis, chemical compound analysis, computer-aided design, and program analysis demonstrate the bene ts of the discovery technique.
IEEE Expert / IEEE Intelligent Systems, 1996
Discovering repetitive, and functional substructures in large structural databases improves the ability to interpret and compress the data. However, scientists working with a database in their area of expertise often search for predetermined types of structures, or for structures exhibiting characteristics speci c to the domain. This paper presents a method for guiding the discovery process with domain-speci c knowledge. In this paper, the Subdue discovery system is used to evaluate the bene ts of using domain knowledge to guide the discovery process. Results show that domain-speci c knowledge improves the search for substructures which are useful to the domain, and leads to greater compression of the data. Empirical and theoretical results also indicate the scalability of the algorithm to increasingly large structural databases.
IEEE Transactions on Knowledge and Data Engineering, 1999
Discovering repetitive, interesting, and functional substructures in a structural database improves the ability to interpret and compress the data. However, scientists working with a database in their area of expertise often search for predetermined types of structures or for structures exhibiting characteristics specific to the domain. This paper presents a method for guiding the discovery process with domain-specific knowledge. In this paper, the SUBDUE discovery system is used to evaluate the benefits of using domain knowledge to guide the discovery process. Domain knowledge is incorporated into SUBDUE following a single general methodology to guide the discovery process. Results show that domain-specific knowledge improves the search for substructures that are useful to the domain and leads to greater compression of the data. To illustrate these benefits, examples and experiments from the computer programming, computer-aided design circuit, and artificially generated domains are presented.
2000
Abstract. Recently, there have been several proposals of formalisms for modeling semistructured data, which is data that is neither raw, nor strictly typed as in conventional database systems. Semistructured data models are graph-based models, where graphs are used to represent both databases and schemas.
Proceedings of the 1998 ACM SIGMOD international conference on Management of data - SIGMOD '98, 1998
Semistructured data is characterized by the lack of any fixed and rigid schema, although typically the data has some implicit structure.
1992
Knowledge discovery in databases, or data mining, is an important issue in the development of data-and knowledge-base systems. An attribute-oriented induction method has been developed for knowledge discovery in databases. The method integrates a machine learning paradigm, especially learning-from-examples techniques, with set-oriented database operations and extracts generalized data from actual data in databases. An attribute-oriented concept tree ascension technique is applied in generalization, which substantially reduces the computational complex@ of database learning processes. Different kinas of knowledge rules, including characteristic rules, discrimination rules, quantitative rules, and data evolution regularities can be discovered efficiently using the attribute-oriented approach. In addition to learning in relational databases, the approach can be applied to knowledge discovery in nested relational and deductive databases. Learning can also be performed with databases containing noisy data and exceptional cases using database statistics. Furthermore, the rules discovered can be used to query database knowledge, answer cooperative queries and facilitate semantic query optimization. Based upon these principles, a prototyped database learning system, DBLEARN, has been constructed for experimentation.
International Conference on Artificial Intelligence, 2000
With the increasing amount of structural data being collected, there arises a need to efficiently mine infor- mation from this type of data. The goal of this re- search is to provide a system that performs data min- ing on structural data represented as a labeled graph. We demonstrate how the graph-based discovery system Subdue can be used to perform
1997
We develop a new schema for unstructured data. Traditional schemas resemble the type systems of programming languages. For unstructured data, however, the underlying type may be much less constrained and hence an alternative way of expressing constraints on the data is needed. Here, we propose that both data and schema be represented as edge-labeled graphs. We develop notions of conformance between a graph database and a graph schema and show that there is a natural and efficiently computable ordering on graph schemas. We then examine certain subclasses of schemas and show that schemas are closed under query applications. Finally, we discuss how they may be used in query decomposition and optimization.
Knowledge Discovery and Data Mining, 1995
Discovering repetitive, interesting, and functional substructures in a structural database improves the ability to interpret and compress the data. However, scientists working with a database in their area of expertise often search for predetermined types of structures, or for structures exhibiting characteristics specific to the domain. This paper presents a method for guiding the discovery process with domain-specific knowledge. In this paper, the %JBDUFi discovery system is used to evaluate the benefits of using domain knowledge to guide the discovery process. The domain knowledge is incorporated into SUBDUE following a single general methodology to guide the discovery process. Results show that domain-specific knowledge improves the search for substructures which are useful to the domain, and leads to greater compression of the data. To illustrate these benefits, examples and experiments from the computer programming, computer aided design circuit, and artificially-generated domains are presented.
IEEE Transactions on Knowledge and Data Engineering, 1997
Discovering repetitive, interesting, and functional substructures in a structural database improves the ability to interpret and compress the data. However, scientists working with a database in their area of expertise often search for predetermined types of structures or for structures exhibiting characteristics specific to the domain. This paper presents a method for guiding the discovery process with domain-specific knowledge. In this paper, the SUBDUE discovery system is used to evaluate the benefits of using domain knowledge to guide the discovery process. Domain knowledge is incorporated into SUBDUE following a single general methodology to guide the discovery process. Results show that domain-specific knowledge improves the search for substructures that are useful to the domain and leads to greater compression of the data. To illustrate these benefits, examples and experiments from the computer programming, computer-aided design circuit, and artificially generated domains are presented.
2008
The increasing complexity of enterprise databases and the prevalent lack of documentation incur significant cost in both understanding and integrating the databases. Existing solutions addressed mining for keys and foreign keys, but paid little attention to more high-level structures of databases. In this paper, we consider the problem of discovering topical structures of databases to support semantic browsing and large-scale data integration. We describe iDisc, a novel discovery system based on a multi-strategy learning framework. iDisc exploits varied evidence in database schema and instance values to construct multiple kinds of database representations. It employs a set of base clusterers to discover preliminary topical clusters of tables from database representations, and then aggregate them into final clusters via meta-clustering. To further improve the accuracy, we extend iDisc with novel multiple-level aggregation and clusterer boosting techniques. We introduce a new measure on table importance and propose an approach to discovering cluster representatives to facilitate semantic browsing. An important feature of our framework is that it is highly extensible, where additional database representations and base clusterers may be easily incorporated into the framework. We have extensively evaluated iDisc using large real-world databases and results show that it discovers topical structures with a high degree of accuracy.
IEEE Intelligent Systems, 2000
at Arlington THE LARGE AMOUNT OF DATA collected today is quickly overwhelming researchers' abilities to interpret the data and discover interesting patterns in it. In response to this problem, researchers have developed techniques and systems for discovering concepts in databases. 1-3 Much of the collected data, however, has an explicit or implicit structural component (spatial or temporal), which few discovery systems are designed to handle. 4 So, in addition to the need to accelerate data mining of large databases, there is an urgent need to develop scalable tools for discovering concepts in structural databases. One method for discovering knowledge in structural data is the identification of common substructures within the data. Substructure discovery is the process of identifying concepts describing interesting and repetitive substructures within structural data. The discovered substructure concepts allow abstraction from the detailed data structure and provide relevant attributes for interpreting the data. The substructure discovery method is the basis of Subdue, which performs data mining on databases represented as graphs. The system performs two key data-mining techniques: unsupervised pattern discovery and supervised concept learning from examples. Our test applications have demonstrated the scalability and effectiveness of these techniques on a variety of structural databases.
Annals of Mathematics and Artificial Intelligence, 2007
Relational datasets, i.e., datasets in which individuals are described both by their own features and by their relations to other individuals, arise from various sources such as databases, both relational and object-oriented, knowledge bases, or software models, e.g., UML class diagrams. When processing such complex datasets, it is of prime importance for an analysis tool to hold as much as possible to the initial format so that the semantics is preserved and the interpretation of the final results eased. Therefore, several attempts have been made to introduce relations into the formal concept analysis field which otherwise generated a large number of knowledge discovery methods and tools. However, the proposed approaches invariably look at relations as an intra-concept construct, typically relating two parts of the concept description, and therefore can only lead to the discovery of coarse-grained patterns. As an approach towards the discovery of finer-grain relational concepts, we propose to enhance the classical (object × attribute) data representations with a new dimension that is made out of inter-object links (e.g., spouse, friend, manager-of, etc.). Consequently, the discovered concepts are linked by relations which, like associations in conceptual data models such as the entity-relation diagrams, abstract from existing links between concept instances. The borders for the application of the relational mining task are provided by what we call a relational context family, a set of binary data tables representing individuals of various sorts (e.g., human beings, companies, vehicles, etc.) related by additional binary relations. As we impose no restrictions on the relations in the dataset, a major challenge is the processing of relational loops among data items. We present a method for constructing concepts on top of circular descriptions which is based on an iterative approximation of the final solution. The underlying construction methods are illustrated through their application to the restructuring of class hierarchies in object-oriented software engineering, which are described in UML.
… in Medicine and …, 2001
Artificial Intelligence, 1992
This paper presents several investigations into the prospects for identifying meaningful structures in empirical data, namely, structures permitting eective organization of the data to meet requirements of future queries. We propose a general framework whereby the notion of identiability is given a precise formal denition similar to that of learnability. Using this framework, we then explore if a tractable procedure exists for deciding whether a given relation is decomposable into a constraint network or a CNF theory with desirable topology and, if the answer is positive, identifying the desired decomposition.
1997
When learning froIn very large databases, the reduction of complexity is of highest importance. Two extremes of making knowledge discovery in databases (KDD) feasible have been put forward. One extreme is to choose a most simple hypothesis language and so to be capable of very fast learning on real-world databases. The opposite extreme is to select a small data set and be capable of learning very expressive (firstorder logic) hypotheses. A multistrategy approach allows to combine most of the advantages and exclude most of the disadvantages. More simple learning algorithms detect hierarchies that are used in order to structure the hypothesis space for a more complex learning algorithm. The better structured the hypothesis space is, the better can learning prune away uninteresting or losing hypotheses and the faster it becomes. We have combined inductive logic programming (ILP) directly with a relational database. The ILP algorithm is controlled in a model-driven way by the user and in a data-driven way by structures that are induced by three simple learning algorithms.
… Workshops 2006. Sixth …, 2006
The general idea of discovering knowledge in large amounts of data is both ap- pealing and intuitive. Typically we focus our attention on learning algorithms, which provide the core capability of generalizing from large numbers of small, very specific facts to useful high-level rules; these learning techniques seem to hold the most excite- ment and perhaps the most substantive scientific content in the knowledge discovery in databases (KDD) enterprise. However, when we engage in real-world discovery tasks, we find that they can be extremely complex, and that induction of rules is only one small part of the overall process. While others have written overviews of "the concept of KDD, and even provided block diagrams for "knowledge discovery systems," no one has begun to identify all of the building blocks in a realistic KDD process. This is what we attempt to do here. Besides bringing into the discussion several parts of the process that have received inadequate attenti...
Loading Preview
Sorry, preview is currently unavailable. You can download the paper by clicking the button above.