Academia.edu no longer supports Internet Explorer.
To browse Academia.edu and the wider internet faster and more securely, please take a few seconds to upgrade your browser.
2010, Proceedings of the VLDB Endowment
…
12 pages
1 file
Uncertain data streams are increasingly common in real-world deployments and monitoring applications require the evaluation of complex queries on such streams. In this paper, we consider complex queries involving conditioning (e.g., selections and group by's) and aggregation operations on uncertain data streams. To characterize the uncertainty of answers to these queries, one generally has to compute the full probability distribution of each operation used in the query. Computing distributions of aggregates given conditioned tuple distributions is a hard, unsolved problem. Our work employs a new evaluation framework that includes a general data model, approximation metrics, and approximate representations. Within this framework we design fast data-stream algorithms, both deterministic and randomized, for returning approximate distributions with bounded errors as answers to those complex queries. Our experimental results demonstrate the accuracy and efficiency of our approximatio...
2002
Recent years have witnessed an increasing interest in designing algorithms for querying and analyzing streaming data (i.e., data that is seen only once in a fixed order) with only limited memory. Providing (perhaps approximate) answers to queries over such continuous data streams is a crucial requirement for many application environments; examples include large telecom and IP network installations where performance data from different parts of the network needs to be continuously collected and analyzed.
The VLDB Journal, 2011
Uncertain data streams, where data are incomplete and imprecise, have been observed in many environments. Feeding such data streams to existing stream systems produces results of unknown quality, which is of paramount concern to monitoring applications. In this paper, we present the Claro system that supports stream processing for uncertain data naturally captured using continuous random variables. Claro employs a unique data model that is flexible and allows efficient computation. Built on this model, we develop evaluation techniques for relational operators by exploring statistical theory and approximation. We also consider query planning for complex queries given an accuracy requirement. Evaluation results show that our techniques can achieve high performance while satisfying accuracy requirements, and outperform state-of-the-art sampling methods.
Computing Research Repository, 2009
We present the design and development of a data stream system that captures data uncertainty from data collection to query processing to final result generation. Our system focuses on data that is naturally modeled as continuous ran- dom variables such as many types of sensor data. To provide an end-to-end solution, our system employs probabilistic modeling and inference to generate
Proceedings. 16th International Conference on Scientific and Statistical Database Management, 2004.
We present a novel approach to approximate evaluation of standing aggregate queries over streaming data, subject to user-specified error bounds. Our method models the behavior of aggregates as Brownian motions, and adaptively updates the model according to stream characteristics. This approach has two advantages. First, it greatly improves system scalability since we can defer query evaluation as long as the difference between the returned and true aggregate values remains within user-specified bounds. Second, we are able to provide approximate answers during stream interruptions by estimating the rate at which the streams and the aggregate drift during the blackout periods. We also study processor allocation issues in such approximate aggregate evaluation systems. Our experiments show that our model captures the behavior of real-world streams such as sensor data and stock traces with excellent fidelity, and scales very well for large numbers of standing queries.
2010
Query processing on uncertain data streams has attracted a lot of attentions lately, due to the imprecise nature in the data generated from a variety of streaming applications, such as readings from a sensor network. However, all of the existing works on uncertain data streams study unbounded streams. This paper takes the first step towards the important and challenging problem of answering sliding-window queries on uncertain data streams, with a focus on arguably one of the most important types of queries-top-k queries. The challenge of answering sliding-window top-k queries on uncertain data streams stems from the strict space and time requirements of processing both arriving and expiring tuples in high-speed streams, combined with the difficulty of coping with the exponential blowup in the number of possible worlds induced by the uncertain data model. In this paper, we design a unified framework for processing sliding-window top-k queries on uncertain streams. We show that all the existing top-k definitions in the literature can be plugged into our framework, resulting in several succinct synopses that use space much smaller than the window size, while are also highly efficient in terms of processing time. In addition to the theoretical space and time bounds that we prove for these synopses, we also present a thorough experimental report to verify their practical efficiency on both synthetic and real data.
Proceedings of the 2003 ACM SIGMOD international conference on on Management of data - SIGMOD '03, 2003
Many applications employ sensors for monitoring entities such as temperature and wind speed. A centralized database tracks these entities to enable query processing. Due to continuous changes in these values and limited resources (e.g., network bandwidth and battery power), it is often infeasible to store the exact values at all times. A similar situation exists for moving object environments that track the constantly changing locations of objects. In this environment, it is possible for database queries to produce incorrect or invalid results based upon old data. However, if the degree of error (or uncertainty) between the actual value and the database value is controlled, one can place more confidence in the answers to queries. More generally, query answers can be augmented with probabilistic estimates of the validity of the answers. In this paper we study probabilistic query evaluation based upon uncertain data. A classification of queries is made based upon the nature of the result set. For each class, we develop algorithms for computing probabilistic answers. We address the important issue of measuring the quality of the answers to these queries, and provide algorithms for efficiently pulling data from relevant sensors or moving objects in order to improve the quality of the executing queries. Extensive experiments are performed to examine the effectiveness of several data update policies.
2006
Motivated by the increasing need to analyze complex, uncertain multidimensional data this paper proposes probabilistic OLAP queries that are computed using probability distributions rather than atomic values. The paper describes how to create probability distributions from base data, and how the distributions can be subsequently used in pre-aggregation. Since the probability distributions can become large, we show how to achieve good time and space efficiency by approximating the distributions. We present the results of several experiments that demonstrate the effectiveness of our methods. The work is motivated with a real-world case study, based on our collaboration with a leading Danish vendor of location-based services. This paper is the first to consider the approximate processing of probabilistic OLAP queries over probability distributions.
28th IEEE International Real-Time Systems Symposium (RTSS 2007), 2007
Data uncertainty is a common problem for the real-time monitoring of data streams. In this paper, we address the issue of efficiently monitoring the satisfaction/violation of user-defined constraints over data streams where the data uncertainty can be probabilistically characterized. We propose a monitoring architecture SPMON that can incorporate probabilistic models of uncertainty in constraint monitoring. We adapt the concept of data similarity in real-time databases to the processing of uncertain data streams. In doing so, we generalize the data similarity by a new concept psr (probabilistic similarity region) that allows us to define similarity relations for probabilistic data with respect to the set of constraints being monitored. This enables the construction of lightweight filters for saving bandwidth. We also show how to efficiently update the filter conditions at run-time.
International Journal of Business Intelligence and Data Mining, 2008
Random samples are common in data streams applications due to limitations in data sources and transmission lines, or to load-shedding policies. Here we introduce a formal error model and show that, besides providing accurate estimates, it improves query answer accuracy by exploiting past statistics. The method is general, robust in the presence of concept drift, and minimises uncertainties due to sampling with negligible time and space overhead. We describe the application of the method, and the results obtained for SQL window aggregates, statistical aggregates such as quantiles, and data mining functions such as k-means clustering and naive Bayesian classifiers.
Information Sciences an International Journal, 1996
Various extended relational data models were proposed to handle uncertain data including possibilistic and probabilistic data. Query processing involving aggregate functions over uncertain data is rarely considered. In this paper, we define a set of extended aggregate functions over probabilistic data. The time complexity of the computations for these extended aggregate functions is, in general, exponential. We develop two efficient algorithms for the computation of the maximum and minimum aggregate functions. The worst-case time complexity of the algorithms are O(n2). These algorithms can be extended to handle the possibilistic data. That is, our work is devoted to the accommodation of uncertain data in database systems with an elaboration on speeding up the processing efficiency of the aggregate functions.
Loading Preview
Sorry, preview is currently unavailable. You can download the paper by clicking the button above.
Lecture Notes in Computer Science, 2009
Computing Research Repository, 2010
… , 2006. ICDE'06. Proceedings of the …, 2006
Information Systems, 2007
Information Sciences, 2012
Proceedings of the ACM SIGMETRICS joint international conference on Measurement and modeling of computer systems - SIGMETRICS '11, 2011
IEEE Transactions on Knowledge and Data Engineering, 2012
Proceedings of the 2005 ACM SIGMOD international conference on Management of data - SIGMOD '05, 2005
Lecture Notes in Computer Science, 2008
Proceedings of the 2008 ACM SIGMOD international conference on Management of data - SIGMOD '08, 2008
Information Sciences, 2012
bvicam.ac.in
Proceedings of the …, 2010
IEEE Transactions on Knowledge and Data Engineering, 2000