Academia.edu no longer supports Internet Explorer.
To browse Academia.edu and the wider internet faster and more securely, please take a few seconds to upgrade your browser.
2005, Very Large Data Bases
…
12 pages
1 file
The largest databases in use today are so large that answering a query exactly can take minutes, hours, or even days. One way to address this problem is to make use of approximation algorithms. Previous work on online aggregation has considered how to give online estimates with everincreasing accuracy for aggregate functions over relational join and selection queries. However, no existing work is applicable to online estimation over subset-based SQL queries-those queries with a correlated subquery linked to an outer query via a NOT EXISTS, NOT IN, EXISTS, or IN clause (other queries such as EXCEPT and INTERSECT can also be seen as subset-based queries). In this paper we develop algorithms for online estimation over such queries, and consider the difficult problem of providing probabilistic accuracy guarantees at all times during query execution.
The VLDB Journal, 2008
We consider the problem of using sampling to estimate the result of an aggregation operation over a subset-based SQL query, where a subquery is correlated to an outer query by a NOT EXISTS, NOT IN, EXISTS or IN clause. We design an unbiased estimator for our query and prove that it is indeed unbiased. We then provide a second, biased estimator that makes use of the superpopulation concept from statistics to minimize the mean squared error of the resulting estimate. The two estimators are tested over an extensive set of experiments.
Journal of Database Management, 1997
CASE-DB is a relational database management system that allows users to specify time constraints in queries. For an aggregate query AGG(E) where AGG is one of COUNT, SUM and AVERAGE, and E is a relational algebra expression, CASE-DB uses statistical estimators to approximate the query. This paper extends our earlier work on statistical estimators of CASE-DB with the following features: (a) New statistical estimators for COUNT queries with projection. (b) Extending the methodology for SUM and AVERAGE aggregate queries. (c) New sampling plans based on systematic sampling and strati ed sampling. We also present performance evaluation experiments of the estimators with the above extensions using correlated and uncorrelated database instances.
Proceedings of the 2021 International Conference on Management of Data, 2021
Sample-based approximate query processing (AQP) suffers from many pitfalls such as the inability to answer very selective queries and unreliable confidence intervals when sample sizes are small. Recent research presented an intriguing solution of combining materialized, pre-computed aggregates with sampling for accurate and more reliable AQP. We explore this solution in detail in this work and propose an AQP physical design called PASS, or Precomputation-Assisted Stratified Sampling. PASS builds a tree of partial aggregates that cover different partitions of the dataset. The leaf nodes of this tree form the strata for stratified samples. Aggregate queries whose predicates align with the partitions (or unions of partitions) are exactly answered with a depth-first search, and any partial overlaps are approximated with the stratified samples. We propose an algorithm for optimally partitioning the data into such a data structure with various practical approximation techniques. * A version of this paper has been accepted to SIGMOD'21. This document is its associated technical report. This work is mainly done when Zechao was at the University of Chicago.
Fast and accurate estimations for complex queries are profoundly beneficial for large databases with heavy workloads. In this research, we propose a statistical summary for a database, called CS2 (Correlated Sample Synopsis), to provide rapid and accurate result size estimations for all queries with joins and arbitrary selections. Unlike the state-of-the-art techniques, CS2 does not completely rely on simple random samples, but mainly consists of correlated sample tuples that retain join relationships with less storage. We introduce a statistical technique, called reverse sample, and design a powerful estimator, called reverse estimator, to fully utilize correlated sample tuples for query estimation. We prove both theoretically and empirically that the reverse estimator is unbiased and accurate using CS2. Extensive experiments on multiple datasets show that CS2 is fast to construct and derives more accurate estimations than existing methods with the same space budget.
Aggregation in traditional database systems is performed in batch mode: a query is submitted, the system processes a large volume of data over a long period of time, and, eventually , the nal answer is returned. This archaic approach is frustrating to users and has been abandoned in most other areas of computing. In this paper we propose a new online aggregation interface that permits users to both observe the progress of their aggregation queries and control execution on the y. After outlining usability and performance requirements for a system supporting online aggregation, we present a suite of techniques that extend a database system to meet these requirements. These include methods for returning the output in random order, for providing control over the relative rate at which diierent aggregates are computed, and for computing running conndence intervals. Finally, we report on an initial implementation of online ag-gregation in postgres.
Proceedings of the VLDB Endowment, 2021
Researchers and industry analysts are increasingly interested in computing aggregation queries over large, unstructured datasets with selective predicates that are computed using expensive deep neural networks (DNNs). As these DNNs are expensive and because many applications can tolerate approximate answers, analysts are interested in accelerating these queries via approximations. Unfortunately, standard approximate query processing techniques to accelerate such queries are not applicable because they assume the result of the predicates are available ahead of time. Furthermore, recent work using cheap approximations (i.e., proxies) do not support aggregation queries with predicates. To accelerate aggregation queries with expensive predicates, we develop and analyze a query processing algorithm that leverages proxies (ABAE). ABAE must account for the key challenge that it may sample records that do not satisfy the predicate. To address this challenge, we first use the proxy to group ...
ACM Transactions on Database Systems, 2018
It is common practice for data scientists to acquire and integrate disparate data sources to achieve higher quality results. But even with a perfectly cleaned and merged data set, two fundamental questions remain: (1) Is the integrated data set complete? and (2) What is the impact of any unknown (i.e., unobserved) data on query results? In this work, we develop and analyze techniques to estimate the impact of the unknown data (a.k.a., unknown unknowns ) on simple aggregate queries. The key idea is that the overlap between different data sources enables us to estimate the number and values of the missing data items. Our main techniques are parameter-free and do not assume prior knowledge about the distribution; we also propose a parametric model that can be used instead when the data sources are imbalanced. Through a series of experiments, we show that estimating the impact of unknown unknowns is invaluable to better assess the results of aggregate queries over integrated data sources.
ACM SIGMOD Record, 1989
We consider those database environments in which queries have strict timing constraints, and develop a time-constrained query evaluation methodology. For aggregate relational algebra queries, we describe a time constrained query evaluation algorithm. The algorithm, which is implemented in our prototype DBMS, iteratively samples from input relations, and evaluates the associated estimators developed in our previous work, until a stopping criterion (e.g., a time quota or a desired error range) is satisfied. To determine sample sizes at each stage of the iteration (so that the time quota will not be overspent) we need to have (a) accurate sample selectivity estimations of the RA operators in the query, (b) precise time cost formulas, and (c) good time-control strategies. To estimate the sample selectivities of RA operators, we use a runtime sample selectivity estimation and improvement approach which is flexible. For query time estimations, we use time-cost formulas which are adaptive ...
Modern data analytics applications typically process massive amounts of data on clusters of tens, hundreds, or thousands of machines to support near-real-time decisions. The quantity of data and limitations of disk and memory bandwidth often make it infeasible to deliver answers at interactive speeds. However, it has been widely observed that many applications can tolerate some degree of inaccuracy. This is especially true for exploratory queries on data, where users are satisfied with " close-enough " answers if they can come quickly. A popular technique for speeding up queries at the cost of accuracy is to execute each query on a sample of data, rather than the whole dataset. To ensure that the returned result is not too inaccurate , past work on approximate query processing has used statistical techniques to estimate " error bars " on returned results. However, existing work in the sampling-based approximate query processing (S-AQP) community has not validated whether these techniques actually generate accurate error bars for real query workloads. In fact, we find that error bar estimation often fails on real world production workloads. Fortunately, it is possible to quickly and accurately diagnose the failure of error estimation for a query. In this paper, we show that it is possible to implement a query approximation pipeline that produces approximate answers and reliable error bars at interactive speeds.
2013
Continuous queries are used to monitor changes to time varying data and to provide results useful for online decision making. In these queries, at specifies coherency a client requirement as part of the query. For continuous queries, network of data aggregators a low cost and scalable technique used. Individual node cannot be determine its inclusion by itself in the query result for this a different algorithmic challenges from aggregate and selection queries are presented. At specific coherencies each data item can serve for a set of data aggregators. In this technique disseminating query into sub query and sub queries are executed on the chosen data aggregator are involves. We build a query cost model, which can be used to estimate the number of refresh messages which is required to satisfy client specified incoherency bound. Each data aggregator serve a set of data items at specific coherencies. Performance show that, our query cost model can be executed using less than one third the number of refresh messages required by existing schemes in result. Distributed decisions made by the distributed servers independently based on localized statistics collected by each server at runtime by our adaptive strategy.
Loading Preview
Sorry, preview is currently unavailable. You can download the paper by clicking the button above.
Lecture Notes in Computer Science, 2007
ACM Transactions on Database Systems, 1991
Proceedings. 16th International Conference on Scientific and Statistical Database Management, 2004.
Lecture Notes in Computer Science, 2012
SIAM Journal on Computing, 2014
Journal of Computer and System Sciences, 2011
Data & Knowledge Engineering, 2010
ACM Transactions on Database Systems, 1978
2001
Information Systems, 2019
The VLDB Journal, 2006
Knowledge and Data Engineering, IEEE …, 2008