Academia.edu no longer supports Internet Explorer.
To browse Academia.edu and the wider internet faster and more securely, please take a few seconds to upgrade your browser.
…
15 pages
1 file
With the vastly growing data resources on the Internet, XML is one of the most important standards for document management. Not only does it provide enhancements to document exchange and storage, but it is also helpful in a variety of information retrieval tasks. Document clustering is one of the most interesting research areas that utilize XML's semi-structural nature. In this paper, we put forward a new XML clustering algorithm that relies solely on document structure. We propose the use of maximal frequent subtrees and an operator called Satisfy/Violate to divide documents into groups. The algorithm is experimentally evaluated on real and synthetic data sets with promising results.
Lecture Notes in Computer Science, 2009
This paper presents an experimental study conducted over the INEX 2008 Document Mining Challenge corpus using both the structure and the content of XML documents for clustering them. The concise common substructures known as the closed frequent subtrees are generated using the structural information of the XML documents. The closed frequent subtrees are then used to extract the constrained content from the documents. A matrix containing the term distribution of the documents in the dataset is developed using the extracted constrained content. The k-way clustering algorithm is applied to the matrix to obtain the required clusters. In spite of the large number of documents in the INEX 2008 Wikipedia dataset, the proposed frequent subtree-based clustering approach was successful in clustering the documents. This approach significantly reduces the dimensionality of the terms used for clustering without much loss in accuracy.
Information Systems, 2006
The processing and management of XML data are popular research issues. However, operations based on the structure of XML data have not received strong attention. These operations involve, among others, the grouping of structurally similar XML documents. Such grouping results from the application of clustering methods with distances that estimate the similarity between tree structures. This paper presents a framework for clustering XML documents by structure. Modeling the XML documents as rooted ordered labeled trees, we study the usage of structural distance metrics in hierarchical clustering algorithms to detect groups of structurally similar XML documents. We suggest the usage of structural summaries for trees to improve the performance of the distance calculation and at the same time to maintain or even improve its quality. Our approach is tested using a prototype testbed.
Lecture Notes in Computer Science, 2007
This paper reports the results and experiments performed on the INEX 2006 Document Mining Challenge Corpus with the PCXSS clustering method. The PCXSS method is a progressive clustering method that computes the similarity between a new XML document and existing clusters by considering the structures within documents. We conducted the clustering task on the INEX and Wikipedia data sets.
2007
XML has become a popular method of data representation both on the web and in databases in recent years. One of the reasons for the popularity of XML has been its ability to encode structural information about data records. However, this structural characteristic of data sets also makes it a challenging problem for a variety of data mining problems. One such problem is that of clustering, in which the structural aspects of the data result in a high implicit dimensionality of the data representation. As a result, it becomes more difficult to cluster the data in a meaningful way. In this paper, we propose an effective clustering algorithm for XML data which uses substructures of the documents in order to gain insights about the important underlying structures. We propose new ways of using multiple sub-structural information in XML documents to evaluate the quality of intermediate cluster solutions, and guide the algorithms to a final solution which reflects the true structural behavior in individual partitions. We test the algorithm on a variety of real and synthetic data sets.
2013
With the growing number of XML documents on the Web it becomes essential to effectively organize these XML documents in order to retrieve useful information from them. A possible solution is to apply clustering on the XML documents to discover knowledge that promotes effective data management, information retrieval and query processing. This paper presents a framework for clustering XML documents by structure. Modelling the XML documents as rooted ordered labeled trees, we study the usage of structural distance metrics in hierarchical clustering algorithms to detect groups of structurally similar XML documents. We suggest the usage of structural summaries for trees to improve the performance of the distance calculation and at the same time to maintain or even improve its quality.
Advances in Knowledge Discovery and Data Mining, 2006
This paper presents the incremental clustering algorithm, XML documents Clustering with Level Similarity (XCLS), that groups the XML documents according to structural similarity. A level structure format is introduced to represent the structure of XML documents for efficient processing. A global criterion function that measures the similarity between the new document and existing clusters is developed. It avoids the need to compute the pair-wise similarity between two individual documents and hence saves a huge amount of computing effort. XCLS is further modified to incorporate the semantic meanings of XML tags for investigating the trade-offs between accuracy and efficiency. The empirical analysis shows that the structural similarity overplays the semantic similarity in the clustering process of the structured data such as XML. The experimental analysis shows that the XCLS method is fast and accurate in clustering the heterogeneous documents by structures.
Knowledge and Information Systems, 2008
This paper presents the incremental clustering algorithm XCLS that groups the XML documents according to structural similarity. A Level structure format is introduced to represent the structure of XML documents for efficient processing. A global criterion function that measures the similarity between the new document and existing clusters is developed. It avoids the need to compute the pair-wise similarity between two individual documents and hence saves a huge amount of computing effort. XCLS is further modified to incorporate the semantic meanings of XML tags for investigating the trade-offs between accuracy and efficiency. The empirical analysis shows that the structural similarity overplays the semantic similarity in the clustering process of the structured data such as XML. The experimental analysis shows that the XCLS method is fast and accurate in clustering the heterogeneous documents by structures.
Data & Knowledge Engineering, 2013
Clustering XML documents by structure is the task of grouping them by common structural components. Hitherto, this has been accomplished by looking at the occurrence of one preestablished type of structural components in the structures of the XML documents. However, the a-priori chosen structural components may not be the most appropriate for effective clustering. Moreover, it is likely the resulting clusters exhibit a certain extent of inner structural inhomogeneity, because of uncaught differences in the structures of the XML documents, due to further neglected forms of structural components.
Knowledge and Information Systems, 2015
Now that the use of XML is prevalent, methods for mining semi-structured documents have become even more important. In particular, one of the areas that could greatly benefit from in-depth analysis of XML's semi-structured nature is cluster analysis. Most of the XML clustering approaches developed so far employ pairwise similarity measures. In this paper, we study clustering algorithms, which use patterns to cluster documents without the need for pairwise comparisons. We investigate the shortcomings of existing approaches and establish a new pattern-based clustering framework called XPattern, which tries to address these shortcomings. The proposed framework consists of four steps: choosing a pattern definition, pattern mining, pattern clustering, and document assignment. The framework's distinguishing feature is the combination of pattern clustering and document-cluster assignment, which allows to group documents according to their characteristic features rather than their direct similarity. We experimentally evaluate the proposed approach by implementing an algorithm called PathXP, which mines maximal frequent paths and groups them into profiles. PathXP was found to match, in terms of accuracy, other XML clustering approaches, while requiring less parametrization and providing easily interpretable cluster representatives. Additionally, the results of an in-depth experimental study lead to general suggestions concerning pattern-based XML clustering.
The Knowledge Engineering Review, 2014
With its presence in data integration, chemistry, biological, and geographic systems, eXtensible Markup Language (XML) has become an important standard not only in computer science. A common problem among the mentioned applications involves structural clustering of XML documents—an issue that has been thoroughly studied and led to the creation of a myriad of approaches. In this paper, we present a comprehensive review of structural XML clustering. First, we provide a basic introduction to the problem and highlight the main challenges in this research area. Subsequently, we divide the problem into three subtasks and discuss the most common document representations, structural similarity measures, and clustering algorithms. In addition, we present the most popular evaluation measures, which can be used to estimate clustering quality. Finally, we analyze and compare 23 state-of-the-art approaches and arrange them in an original taxonomy. By providing an up-to-date analysis of existing ...
Loading Preview
Sorry, preview is currently unavailable. You can download the paper by clicking the button above.
International Journal of Pattern Recognition and Artificial Intelligence, 2007
International Journal of Computer and Electrical Engineering, 2010
Successes and New Directions in Data Mining, 2008
Malaysian Journal of Computer Science, 2023
International Journal on Artificial Intelligence Tools, 2008
Computación y Sistemas, 2015
19th IEEE International Conference on Tools with Artificial Intelligence(ICTAI 2007), 2007
Lecture Notes in Computer Science, 2010
Machine Learning, 2006
ACM Symposium on Applied Computing, 2008
Lecture Notes in Computer Science, 2011
Lecture Notes in Computer Science, 2006