Academia.edu no longer supports Internet Explorer.
To browse Academia.edu and the wider internet faster and more securely, please take a few seconds to upgrade your browser.
Models, Methods, and Applications
…
3 pages
1 file
XML similarity detection plays an important role in facilitating many applications such as data integration, document classification/clustering, querying, and change management. In this chapter, we present an overview on XML document syntactic and semantic similarity/distance measures along with existing research related to XML similarity detection. The measures are classified into two main categories: structural similarity, and structural and content similarity. We review similarity detection approaches proposed in the literature and discuss some of the challenges and future directions for research on XML similarity detection and related fields.
Lecture Notes in Computer Science, 2010
XML becomes a standard for data representation and exchange over the Internet. Due to the widespread use of XML, XML similarity detection plays an important role in facilitating many applications such as data integration, document classification/clustering, XML query and change management. In this paper we present a discussion on XML documents syntactic and semantic similarity measures along with existing research related to XML similarity detection. XML similarity measures could broadly be classified into two main categories: (1) structural similarity and (2) structural and content similarity. We review similarity detection approaches proposed in the literature and discuss some of the challenges and future directions for research on XML similarity detection and measurements.
Computer Science Review, 2009
In recent years, XML has been established as a major means for information management, and has been broadly utilized for complex data representation (e.g. multimedia objects). Owing to an unparalleled increasing use of the XML standard, developing efficient techniques for comparing XML-based documents becomes essential in the database and information retrieval communities. In this paper, we provide an overview of XML similarity/comparison by presenting existing research related to XML similarity. We also detail the possible applications of XML comparison processes in various fields, ranging over data warehousing, data integration, classification/clustering and XML querying, and discuss some required and emergent future research directions.
Lecture Notes in Computer Science, 2007
In the past few years, XML has been established as an effective means for information management, and has been widely exploited for complex data representation. Owing to an unparalleled increasing use of the XML standard, developing efficient techniques for comparing XML-based documents becomes essential in information retrieval (IR) research. Various algorithms for comparing hierarchically structured data, e.g. XML documents, have been proposed in the literature. However, to our knowledge, most of them focus exclusively on comparing documents based on structural features, overlooking the semantics involved. In this paper, we integrate IR semantic similarity assessment in an edit distance algorithm, seeking to amend similarity judgments when comparing XML-based documents. Our approach comprises of an original edit distance operation cost model, introducing semantic relatedness of XML element/attribute labels, in traditional edit distance computations. A prototype has been developed to evaluate our model's performance. Experiments yielded notable results.
2013
Since the last decade, XML has gained growing importance as a major means for information management, and has become inevitable for complex data representation. Due to an unprecedented increasing use of the XML standard, developing efficient techniques for comparing XML-...
Information Systems, 2018
XML documents are widely used to interchange information among heterogeneous systems, ranging from office applications to scientific experiments. Independently of the domain, XML documents may evolve, so identifying and understanding the changes they undergo becomes crucial. Some syntactic diff approaches have been proposed to address this problem. They are mainly designed to compare revisions of XML documents using explicit IDs to match elements. However, elements in different revisions may not share IDs due to tool incompatibility or even divergent or missing schemas. In this paper, we present Phoenix, a similarity-based approach for comparing revisions of XML documents that does not rely on explicit IDs. Phoenix uses dynamic programming and optimization algorithms to compare different features (e.g., element name, content, attributes, and sub-elements) of XML documents and calculate the similarity degree between them. We compared Phoenix with X-Diff and XyDiff, two state-of-the-art XML diff algorithms. XyDiff was the fastest approach but failed in providing precise matching results. X-Diff presented higher efficacy in 30 of the 56 scenarios but was slow. Phoenix executed in a fraction of the running time required by X-Diff and achieved the best results in terms of efficacy in 26 of 56 tested scenarios. In our evaluations, Phoenix was by far the most efficient approach to match elements across revisions of the same XML document.
Proc. of 5th SIGMOD …, 2002
In this paper we propose a technique for detecting the similarity in the structure of XML documents. The technique is based on the idea of representing the structure of an XML document as a time series in which each occurrence of a tag corresponds to a given impulse. By analyzing the frequencies of the corresponding Fourier transform, we can hence state the degree of similarity between documents. The efficiency and effectiveness of this approach are compelling when compared with traditional ones.
Proceedings of the Fifth International Conference on Management of Emergent Digital EcoSystems - MEDES '13, 2013
XML has experimented a rapid growth mostly because of its application on the Web. Application varies from version control management, data storage to clustering and information retrieval. In this context, it is necessary to develop efficient techniques for comparing XML documents. Many method proposed are based only on structural commonalities, ignoring semantics. In this paper, we propose a new method for comparing XML documents based on LevelEdge combining tag structural and semantic similarities.
… on Management of Data, Delhi, India, 2006
Due to the ever-increasing web availability of XML-based data, an efficient approach to compare XML documents becomes crucial in information retrieval. Such comparison of XML documents has applications in version control (finding, scoring and browsing changes between ...
Lecture Notes in Computer Science, 2007
The automatic processing and management of XML-based data are ever more popular research issues due to the increasing abundant use of XML, especially on the Web. Nonetheless, several operations based on the structure of XML data have not yet received strong attention. Among these is the process of matching XML documents and XML grammars, useful in various applications such as documents classification, retrieval and selective dissemination of information. In this paper, we propose an algorithm for measuring the structural similarity between an XML document and a Document Type Definition or DTD considered as the simplest way for specifying structural constraints on XML documents. We consider the various DTD operators that designate constraints on the existence, repeatability and alternativeness of XML elements/attributes. Our approach is based on the concept of tree edit distance, as an effective and efficient means for comparing tree structures, XML documents and DTDs being modeled as ordered labeled trees. It is of polynomial complexity, in comparison with existing exponential algorithms. Classification experiments, conducted on large sets of real and synthetic XML documents, underline our approach's effectiveness, as well as its applicability to large XML repositories and databases.
Loading Preview
Sorry, preview is currently unavailable. You can download the paper by clicking the button above.
Information Systems Journal, 2008
IEEE Transactions on …, 2005
Journal of Web Semantics, 2012
Lecture Notes in Computer Science, 2009
Lecture Notes in Computer Science, 2007
On the Move to Meaningful Internet …, 2009
Information Sciences, 2015
Information Systems, 2004
Expert Systems With Applications, 2008