A Quantitative Summary of XML Structures

Lin, Zi; He, Bingsheng; Choi, Byron

A Quantitative Summary of XML Structures

Byron Choi

2006, Lecture Notes in Computer Science

Sign up for access to the world's latest research

checkGet notified about relevant papers

checkSave papers to use in your research

checkJoin the discussion with peers

checkTrack your impact

Abstract

Statistical summaries in relational databases mainly focus on the distribution of data values and have been found useful for various applications, such as query evaluation and data storage. As xml has been widely used, e.g. for online data exchange, the need for (corresponding) statistical summaries in xml has been evident. While relational techniques may be applicable to the data values in xml documents, novel techniques are requried for summarizing the structures of xml documents. In this paper, we propose metrics for major structural properties, in particular, nestings of entities and one-to-many relationships, of XML documents. Our technique is different from the existing ones in that we generate a quantitative summary of an xml structure. By using our approach, we illustrate that some popular real-world and synthetic xml benchmark datasets are indeed highly skewed and hardly hierarchical and contain few recursions. We wish this preliminary finding shreds insight on improving the design of xml benchmarking and experimentations.

Key takeaways

The computation of our metrics of an xml document relies on the construction of the prefix tree of the document.
The distributions of the number of a particular kind of star edges of a node, i.e. the previous metric, of our xml benchmark datasets have a large variance.
However, these datasets are insufficient to show the benefits of algorithms for recursive xml datasets.
xmark and dblp datasets appear popular in the xml research community, we found that the majority of these xml datasets are mild generalization of relations -not "tree-like".
We derived statistics from a prefix tree of xml structures and used simple paths and star edges as the basis of our metrics.

Sofia Barahona

Proceedings of the 2008 EDBT Ph.D. workshop, 2008

The importance of performing efficient XML query processing increases along with its usage and pervasiveness. Studying the properties of important fragments of XML query languages and designing accurate structural summaries (including indexes and statistical summaries) are all critical ingredients in solving this problem. However, up to this point there has been a gap between the theoretical and engineering efforts taken in the context of XML. We draw from research methodologies used in relational query languages and database design and apply it to the study of XPath and the design of structural summaries for XML. In particular, we study the roles various fragments of XPath algebra play in distinguishing data components in an XML document, and leverage the results in designing novel structural indexes and statistical summaries for more efficient XML query processing and more accurate result size estimation.

Log In

A Quantitative Summary of XML Structures

Sign up for access to the world's latest research

Abstract

Key takeaways

Related papers

Related topics

Related papers