A general framework for XML Document Clustering

Giuseppe Manco

A general framework for XML Document Clustering

Giuseppe Manco

2003

Sign up for access to the world's latest research

checkGet notified about relevant papers

checkSave papers to use in your research

checkJoin the discussion with peers

checkTrack your impact

Abstract

A novel methodology for clustering XML documents is discussed. The underlying idea is grouping documents which exhibit structural similarities. To this purpose, a suitable technique for identifying meaningful matchings among the nodes of two XML document trees is investigated. The proposed technique also allows to associate to each set of related documents a single prototype XML document, i.e. a representative subsuming the most relevant features of the documents in the set. Suitable techniques for both building and refining cluster-specific representatives are analyzed. Some initial experimental results show the effectiveness of our approach.

Vijay Sonawane

Information Systems, 2006

The processing and management of XML data are popular research issues. However, operations based on the structure of XML data have not received strong attention. These operations involve, among others, the grouping of structurally similar XML documents. Such grouping results from the application of clustering methods with distances that estimate the similarity between tree structures. This paper presents a framework for clustering XML documents by structure. Modeling the XML documents as rooted ordered labeled trees, we study the usage of structural distance metrics in hierarchical clustering algorithms to detect groups of structurally similar XML documents. We suggest the usage of structural summaries for trees to improve the performance of the distance calculation and at the same time to maintain or even improve its quality. Our approach is tested using a prototype testbed.

Log In

A general framework for XML Document Clustering

Sign up for access to the world's latest research

Abstract

Related papers