Academia.edu no longer supports Internet Explorer.
To browse Academia.edu and the wider internet faster and more securely, please take a few seconds to upgrade your browser.
2003
A novel methodology for clustering XML documents is discussed. The underlying idea is grouping documents which exhibit structural similarities. To this purpose, a suitable technique for identifying meaningful matchings among the nodes of two XML document trees is investigated. The proposed technique also allows to associate to each set of related documents a single prototype XML document, i.e. a representative subsuming the most relevant features of the documents in the set. Suitable techniques for both building and refining cluster-specific representatives are analyzed. Some initial experimental results show the effectiveness of our approach.
Information Systems, 2006
The processing and management of XML data are popular research issues. However, operations based on the structure of XML data have not received strong attention. These operations involve, among others, the grouping of structurally similar XML documents. Such grouping results from the application of clustering methods with distances that estimate the similarity between tree structures. This paper presents a framework for clustering XML documents by structure. Modeling the XML documents as rooted ordered labeled trees, we study the usage of structural distance metrics in hierarchical clustering algorithms to detect groups of structurally similar XML documents. We suggest the usage of structural summaries for trees to improve the performance of the distance calculation and at the same time to maintain or even improve its quality. Our approach is tested using a prototype testbed.
2007
Abstract The large amount and heterogeneity of XML documents on the Web requires the development of clustering techniques to group together similar documents. Documents can be grouped together according to their content, their structure, and the links inside and among the documents.
The Knowledge Engineering Review, 2014
With its presence in data integration, chemistry, biological, and geographic systems, eXtensible Markup Language (XML) has become an important standard not only in computer science. A common problem among the mentioned applications involves structural clustering of XML documents—an issue that has been thoroughly studied and led to the creation of a myriad of approaches. In this paper, we present a comprehensive review of structural XML clustering. First, we provide a basic introduction to the problem and highlight the main challenges in this research area. Subsequently, we divide the problem into three subtasks and discuss the most common document representations, structural similarity measures, and clustering algorithms. In addition, we present the most popular evaluation measures, which can be used to estimate clustering quality. Finally, we analyze and compare 23 state-of-the-art approaches and arrange them in an original taxonomy. By providing an up-to-date analysis of existing ...
Knowledge and Information Systems, 2008
This paper presents the incremental clustering algorithm XCLS that groups the XML documents according to structural similarity. A Level structure format is introduced to represent the structure of XML documents for efficient processing. A global criterion function that measures the similarity between the new document and existing clusters is developed. It avoids the need to compute the pair-wise similarity between two individual documents and hence saves a huge amount of computing effort. XCLS is further modified to incorporate the semantic meanings of XML tags for investigating the trade-offs between accuracy and efficiency. The empirical analysis shows that the structural similarity overplays the semantic similarity in the clustering process of the structured data such as XML. The experimental analysis shows that the XCLS method is fast and accurate in clustering the heterogeneous documents by structures.
Advances in Knowledge Discovery and Data Mining, 2006
This paper presents the incremental clustering algorithm, XML documents Clustering with Level Similarity (XCLS), that groups the XML documents according to structural similarity. A level structure format is introduced to represent the structure of XML documents for efficient processing. A global criterion function that measures the similarity between the new document and existing clusters is developed. It avoids the need to compute the pair-wise similarity between two individual documents and hence saves a huge amount of computing effort. XCLS is further modified to incorporate the semantic meanings of XML tags for investigating the trade-offs between accuracy and efficiency. The empirical analysis shows that the structural similarity overplays the semantic similarity in the clustering process of the structured data such as XML. The experimental analysis shows that the XCLS method is fast and accurate in clustering the heterogeneous documents by structures.
Knowledge and Information Systems, 2015
Now that the use of XML is prevalent, methods for mining semi-structured documents have become even more important. In particular, one of the areas that could greatly benefit from in-depth analysis of XML's semi-structured nature is cluster analysis. Most of the XML clustering approaches developed so far employ pairwise similarity measures. In this paper, we study clustering algorithms, which use patterns to cluster documents without the need for pairwise comparisons. We investigate the shortcomings of existing approaches and establish a new pattern-based clustering framework called XPattern, which tries to address these shortcomings. The proposed framework consists of four steps: choosing a pattern definition, pattern mining, pattern clustering, and document assignment. The framework's distinguishing feature is the combination of pattern clustering and document-cluster assignment, which allows to group documents according to their characteristic features rather than their direct similarity. We experimentally evaluate the proposed approach by implementing an algorithm called PathXP, which mines maximal frequent paths and groups them into profiles. PathXP was found to match, in terms of accuracy, other XML clustering approaches, while requiring less parametrization and providing easily interpretable cluster representatives. Additionally, the results of an in-depth experimental study lead to general suggestions concerning pattern-based XML clustering.
2011
Abstract In the last few years we have observed a proliferation of approaches for clustering XML documents and schemas based on their structure and content. The presence of such a huge amount of approaches is due to the different applications requiring the clustering of XML data. These applications need data in the form of similar contents, tags, paths, structures, and semantics.
Data & Knowledge Engineering, 2013
Clustering XML documents by structure is the task of grouping them by common structural components. Hitherto, this has been accomplished by looking at the occurrence of one preestablished type of structural components in the structures of the XML documents. However, the a-priori chosen structural components may not be the most appropriate for effective clustering. Moreover, it is likely the resulting clusters exhibit a certain extent of inner structural inhomogeneity, because of uncaught differences in the structures of the XML documents, due to further neglected forms of structural components.
International Journal of Pattern Recognition and Artificial Intelligence, 2007
Since the emergence in the popularity of XML for data representation and exchange over the Web, the distribution of XML documents has rapidly increased. Therefore it is a new challenge for the field of data mining to turn these documents into a more useful information utility. We present a novel clustering algorithm PCXSS that keeps the heterogeneous XML documents into various groups according to the similar structural and semantic representations. We introduce a global criterion function CPSim that progressively measures the similarity between a XML document and existing clusters, ignoring the need to compute the similarity between two individual documents. The experimental analysis shows the method to be fast and accurate.
With the vastly growing data resources on the Internet, XML is one of the most important standards for document management. Not only does it provide enhancements to document exchange and storage, but it is also helpful in a variety of information retrieval tasks. Document clustering is one of the most interesting research areas that utilize XML's semi-structural nature. In this paper, we put forward a new XML clustering algorithm that relies solely on document structure. We propose the use of maximal frequent subtrees and an operator called Satisfy/Violate to divide documents into groups. The algorithm is experimentally evaluated on real and synthetic data sets with promising results.
2007
XML has become a popular method of data representation both on the web and in databases in recent years. One of the reasons for the popularity of XML has been its ability to encode structural information about data records. However, this structural characteristic of data sets also makes it a challenging problem for a variety of data mining problems. One such problem is that of clustering, in which the structural aspects of the data result in a high implicit dimensionality of the data representation. As a result, it becomes more difficult to cluster the data in a meaningful way. In this paper, we propose an effective clustering algorithm for XML data which uses substructures of the documents in order to gain insights about the important underlying structures. We propose new ways of using multiple sub-structural information in XML documents to evaluate the quality of intermediate cluster solutions, and guide the algorithms to a final solution which reflects the true structural behavior in individual partitions. We test the algorithm on a variety of real and synthetic data sets.
Lecture Notes in Computer Science, 2007
This paper reports the results and experiments performed on the INEX 2006 Document Mining Challenge Corpus with the PCXSS clustering method. The PCXSS method is a progressive clustering method that computes the similarity between a new XML document and existing clusters by considering the structures within documents. We conducted the clustering task on the INEX and Wikipedia data sets.
2000
Abstract Self-organization or clustering of data objects can be a powerful aid towards knowledge discovery in distributed databases. The web presents opportunities for such clustering of documents and other data objects. This potential will be even more pronounced when XML becomes widely used over the next few years. Based on clustering of XML links, we explore a visualization approach for discovering knowledge on the web.© 2000 Elsevier Science BV All rights reserved.
Computación y Sistemas, 2015
Every day more digital data in semi-structured format are available on the World Wide Web, corporate intranets, and other media. Knowledge management using information search and processing is essential in the field of academic writing. This task becomes increasingly complex and defiant, mainly because collections of documents are usually heterogeneous, big, diverse, and dynamic. To resolve these challenges it is essential to improve management of time necessary to process scientific information. In this paper, we propose a new method of automatic clustering of XML documents based on their content and structure, as well as on a new similarity function OverallSimSUX which facilitates capturing the degree of similarity among documents. Evaluation of our proposal by means of experiments with data sets showed better results than those in previous work.
Malaysian Journal of Computer Science, 2023
As textually published information is increasing in digital libraries, efficient retrieval methods are required. Textual documents in a digital library are available in various structures and contents. It is possible to represent these documents with hierarchical levels of granularity when these are organized in XML structure to improve precision by focused retrieval. By this means, contextual elements of each document can be retrieved from a known structure. One solution for retrieving these elements is clustering from a combination of Content and Structural similarities. To achieve this, a novel two-level clustering framework based on Content and Structure is proposed. The framework decomposes a document into meaningful structural units and analyzes all its rich text in its own structure. The quality of the proposed framework was experimented on a heterogeneous XML document collection, having varieties of data sources, structures, and content, be represented as a sample of a real digital library. This collection was made with capabilities to test all of our objectives. The clustering results were evaluated by the Entropy criterion. Finally, the Content and Structure clustering was compared with the usual clustering based on the Content Only to prove the efficacy of considering structural features against the existing Content Only methods in the retrieval process. The total Entropy results of the two-level Content and Structural clustering are almost twice better than the Content Only clustering approach. Consequently, the proposed framework has the ability to improve Information Retrieval systems from two points of view: i) considering the structural aspect of textrich documents in the retrieval process, and ii) replacing the document-level retrieval with the element-level retrieval.
2013
With the growing number of XML documents on the Web it becomes essential to effectively organize these XML documents in order to retrieve useful information from them. A possible solution is to apply clustering on the XML documents to discover knowledge that promotes effective data management, information retrieval and query processing. This paper presents a framework for clustering XML documents by structure. Modelling the XML documents as rooted ordered labeled trees, we study the usage of structural distance metrics in hierarchical clustering algorithms to detect groups of structurally similar XML documents. We suggest the usage of structural summaries for trees to improve the performance of the distance calculation and at the same time to maintain or even improve its quality.
As Extensible markup language (XML) documents are now widely used in the Web World, improving the speed and accuracy of search engines based on these documents is important. Clustering is a way that can be effective in improving the speed of the search engine. Clustering of XML documents can be divided into pair wise and incremental algorithms. The main challenge in the class of incremental algorithms such as Level Structure (XCLS), XCLS+ and XCLS++ is that the order of input XML documents influences the clustering. In this paper, the sensitivity of incremental XML clustering algorithms is introduced by a representative algorithm i.e. XCLS+. A typical solution to this problem has been proposed which includes two interleaved phases: online and semioffline. Experimental results show that the proposed algorithm has a higher speed with a relatively higher precision for large number of documents compared to previous incremental algorithms such as XCLS+.
International Journal of Computer and Electrical Engineering, 2010
The concern of this paper is to extract knowledge from XML documents. The motivation is the existence of large amount of XML documents and its exploding trend and the need to summarize this information into a more abstract and usable documentations. Data mining is one of the most effective ways of knowledge extraction which has received fresh attentions. Many algorithms have been developed for the clustering of XML documents. We have focused on one of the most promising algorithm called XCLS and have directed our efforts on improving its clustering quality and performance. An improved adaptation of the algorithm called XCLS+ is devised. Both algorithms are implemented and evaluated It is shown that the performance of the new algorithm is enhanced in comparison to the previous one.
Lecture Notes in Computer Science, 2009
This paper presents an experimental study conducted over the INEX 2008 Document Mining Challenge corpus using both the structure and the content of XML documents for clustering them. The concise common substructures known as the closed frequent subtrees are generated using the structural information of the XML documents. The closed frequent subtrees are then used to extract the constrained content from the documents. A matrix containing the term distribution of the documents in the dataset is developed using the extracted constrained content. The k-way clustering algorithm is applied to the matrix to obtain the required clusters. In spite of the large number of documents in the INEX 2008 Wikipedia dataset, the proposed frequent subtree-based clustering approach was successful in clustering the documents. This approach significantly reduces the dimensionality of the terms used for clustering without much loss in accuracy.
Proceedings of the 10th …, 2008
Loading Preview
Sorry, preview is currently unavailable. You can download the paper by clicking the button above.