Academia.eduAcademia.edu

Towards Special-Purpose Indexes and Statistics for Uncertain Data

2008

Abstract

The Trio project at Stanford for managing data, uncertainty, and lineage is developed on top of a conventional DBMS. Uncertain data with lineage is encoded in relational tables, and Trio queries are translated to SQL queries on the encoding. Such a layered approach reaps significant benefits in terms of architectural simplicity, and the ability to use an off-the-shelf query processing engine. In this paper, we present special-purpose indexes and statistics that complement the layered approach to further enhance its performance. First, we identify a well-defined structure of Trio queries, relations, and their encoding that can be exploited by the underlying query optimizer to improve the performance using Trio's layered approach. We propose several mechanisms for indexing Trio's uncertain relations and study when these indexes are useful. We then present an interesting order, and an associated operator, which are especially useful to consider when composing query plans. The decision of which query plan to use for a Trio query is dictated by various statistical properties of the input data. We identify the statistical data that can guide the underlying optimizer, and design histograms that enable estimating the statistics accurately.

Key takeaways

  • We then show that query plans executing queries over Trio relations need to consider a special interesting order, namely that of grouping alternatives based on the tuple they are a part of.
  • As described in Section 2, there is a special attribute xid in all Trio encoded relations, and all query translations involve a group by on xid.
  • [13] Operators and Indexes: An interesting followup of our work would be to study whether there are any other specialized operators and indexes that would be useful for query execution in Trio.
  • We have proposed new indexing mechanisms specific to the ULDB data model and its relational encoding, which are more useful for Trio query processing.
  • To guide the query optimizer in choosing the optimal query plan, we designed histograms that enable estimating various useful statistical properties of uncertain data.