Academia.edu no longer supports Internet Explorer.
To browse Academia.edu and the wider internet faster and more securely, please take a few seconds to upgrade your browser.
2002
Recent years have witnessed an increasing interest in designing algorithms for querying and analyzing streaming data (i.e., data that is seen only once in a fixed order) with only limited memory. Providing (perhaps approximate) answers to queries over such continuous data streams is a crucial requirement for many application environments; examples include large telecom and IP network installations where performance data from different parts of the network needs to be continuously collected and analyzed.
Proceedings. 16th International Conference on Scientific and Statistical Database Management, 2004.
We present a novel approach to approximate evaluation of standing aggregate queries over streaming data, subject to user-specified error bounds. Our method models the behavior of aggregates as Brownian motions, and adaptively updates the model according to stream characteristics. This approach has two advantages. First, it greatly improves system scalability since we can defer query evaluation as long as the difference between the returned and true aggregate values remains within user-specified bounds. Second, we are able to provide approximate answers during stream interruptions by estimating the rate at which the streams and the aggregate drift during the blackout periods. We also study processor allocation issues in such approximate aggregate evaluation systems. Our experiments show that our model captures the behavior of real-world streams such as sensor data and stock traces with excellent fidelity, and scales very well for large numbers of standing queries.
… , 2006. ICDE'06. Proceedings of the …, 2006
Aggregate monitoring over data streams is attracting more and more attention in research community due to its broad potential applications. Existing methods suffer two problems, 1) The aggregate functions which could be monitored are restricted to be first-order statistic or monotonic with respect to the window size. 2) Only a limited number of granularity and time scales could be monitored over a stream, thus some interesting patterns might be neglected, and users might be misled by the incomplete changing profile about current data streams. These two impede the development of online mining techniques over data streams, and some kind of breakthrough is urged. In this paper, we employed the powerful tool of fractal analysis to enable the monitoring of both monotonic and non-monotonic aggregates on time-changing data streams. The monotony property of aggregate monitoring is revealed and monotonic search space is built to decrease the time overhead for accessing the synopsis from O(m) to O(log m), where m is the number of windows to be monitored. With the help of a novel inverted histogram, the statistical summary is compressed to be fit in limited main memory, so that high aggregates on windows of any length can be detected accurately and efficiently on-line. Theoretical analysis show the space and time complexity bound of this method are relatively low, while experimental results prove the applicability and efficiency of the proposed algorithm in different application settings.
Lecture Notes in Computer Science, 2004
Recent years have witnessed an increasing interest in designing algorithms for querying and analyzing streaming data (i.e., data that is seen only once in a fixed order) with only limited memory. Providing (perhaps approximate) answers to queries over such continuous data streams is a crucial requirement for many application environments; examples include large telecom and IP network installations where performance data from different parts of the network needs to be continuously collected and analyzed. Randomized techniques, based on computing small "sketch" synopses for each stream, have recently been shown to be a very effective tool for approximating the result of a single SQL query over streaming data tuples. In this paper, we investigate the problems arising when data-stream sketches are used to process multiple such queries concurrently. We demonstrate that, in the presence of multiple query expressions, intelligently sharing sketches among concurrent query evaluations can result in substantial improvements in the utilization of the available sketching space and the quality of the resulting approximation error guarantees. We provide necessary and sufficient conditions for multi-query sketch sharing that guarantee the correctness of the result-estimation process. We also prove that optimal sketch sharing typically gives rise to N P-hard questions, and we propose novel heuristic algorithms for finding good sketch-sharing configurations in practice. Results from our experimental study with realistic workloads verify the effectiveness of our approach, clearly demonstrating the benefits of our sketch-sharing methodology.
Proceedings of the ACM SIGMETRICS joint international conference on Measurement and modeling of computer systems - SIGMETRICS '11, 2011
The massive data streams observed in network monitoring, data processing and scientific studies are typically too large to store. For many applications over such data, we must obtain compact summaries of the stream. These summaries should allow accurate answering of post hoc queries with estimates which approximate the true answers over the original stream. The data often has an underlying structure which makes certain subset queries, in particular range queries, more relevant than arbitrary subsets. Applications such as access control, change detection, and heavy hitters typically involve subsets that are ranges or unions thereof.
2010
Data streams constitute the core of many traditional (e.g. financial) and emerging (e.g. environmental) applications. The sources of streams are ubiquitous in daily life (e.g. web clicks). One feature of these data is the high speed of their arrival. Thus, their processing entails a special constraint. Despite the exponential growth in the capacity of storage devices, it is very expensive - even impossible - to store a data stream in its entirety. Consequently, queries are evaluated only on the recent data of the stream, the old ones are expired. However, some applications need to query the whole data stream. Therefore, the inability to store a complete stream suggests the storage of a compact representation of its data, called summaries. These structures allow users to query the past without an explosion of the required storage space, to provide historical aggregated information, to perform data mining tasks or to detect anomalous behavior in computer systems. The side effect of using summaries is that queries over historical data may not return exact answers, but only approximate ones. This paper introduces a new approach which is a trade-off between the accuracy of query results and the time consumed in building summaries.
International Journal of Business Intelligence and Data Mining, 2008
Random samples are common in data streams applications due to limitations in data sources and transmission lines, or to load-shedding policies. Here we introduce a formal error model and show that, besides providing accurate estimates, it improves query answer accuracy by exploiting past statistics. The method is general, robust in the presence of concept drift, and minimises uncertainties due to sampling with negligible time and space overhead. We describe the application of the method, and the results obtained for SQL window aggregates, statistical aggregates such as quantiles, and data mining functions such as k-means clustering and naive Bayesian classifiers.
Proceedings of the twenty-sixth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems - PODS '07, 2007
IP packet streams consist of multiple interleaving IP o ws. Sta- tistical summaries of these streams, collected for different mea- surement periods, are used for characterization of trafc, billing, anomaly detection, inferring trafc demands, conguring packet lters and routing protocols, and more. While queries are posed over the set of o ws, the summarization algorithm is applied to the stream
Proceedings of the VLDB Endowment, 2012
While traditional data-management systems focus on evaluating single, adhoc queries over static data sets in a centralized setting, several emerging applications require (possibly, continuous) answers to queries on dynamic data that is widely distributed and constantly updated. Furthermore, such query answers often need to discount data that is "stale", and operate solely on a sliding window of recent data arrivals (e.g., data updates occurring over the last 24 hours). Such distributed data streaming applications mandate novel algorithmic solutions that are both time-and space-efficient (to manage high-speed data streams), and also communication-efficient (to deal with physical data distribution). In this paper, we consider the problem of complex query answering over distributed, high-dimensional data streams in the sliding-window model. We introduce a novel sketching technique (termed ECM-sketch) that allows effective summarization of streaming data over both time-based and count-based sliding windows with probabilistic accuracy guarantees. Our sketch structure enables point as well as inner-product queries, and can be employed to address a broad range of problems, such as maintaining frequency statistics, finding heavy hitters, and computing quantiles in the sliding-window model. Focusing on distributed environments, we demonstrate how ECM-sketches of individual, local streams can be composed to generate a (low-error) ECM-sketch summary of the order-preserving aggregation of all streams; furthermore, we show how ECM-sketches can be exploited for continuous monitoring of sliding-window queries over distributed streams. Our extensive experimental study with two real-life data sets validates our theoretical claims and verifies the effectiveness of our techniques. To the best of our knowledge, ours is the first work to address efficient, guaranteed-error complex query answering over distributed data streams in the sliding-window model.
2002
This article deals with continuous conjunctive queries with arithmetic comparisons and optional aggregation over multiple data streams. An algorithm is presented for determining whether or not any given query can be evaluated using a bounded amount of memory for all possible instances of the data streams. For queries that can be evaluated using bounded memory, an execution strategy based on constant-sized synopses of the data streams is proposed. For queries that cannot be evaluated using bounded memory, data stream scenarios are identified in which evaluating the queries requires memory linear in the size of the unbounded streams.
Query estimation plays an important role in query optimization by choosing a particular query plan. Performing query estimation becomes quite challenging in case of fast, continuous, online data streams. Different summarization methods like sampling, histograms, wavelets, sketches, discrete cosine series etc. are used to store data distribution for query estimation. In this paper a brief survey of query estimation techniques in view of data streams is presented.
Proceedings of the 2021 International Conference on Management of Data, 2021
Sample-based approximate query processing (AQP) suffers from many pitfalls such as the inability to answer very selective queries and unreliable confidence intervals when sample sizes are small. Recent research presented an intriguing solution of combining materialized, pre-computed aggregates with sampling for accurate and more reliable AQP. We explore this solution in detail in this work and propose an AQP physical design called PASS, or Precomputation-Assisted Stratified Sampling. PASS builds a tree of partial aggregates that cover different partitions of the dataset. The leaf nodes of this tree form the strata for stratified samples. Aggregate queries whose predicates align with the partitions (or unions of partitions) are exactly answered with a depth-first search, and any partial overlaps are approximated with the stratified samples. We propose an algorithm for optimally partitioning the data into such a data structure with various practical approximation techniques. * A version of this paper has been accepted to SIGMOD'21. This document is its associated technical report. This work is mainly done when Zechao was at the University of Chicago.
Proceedings of the VLDB Endowment, 2010
Uncertain data streams are increasingly common in real-world deployments and monitoring applications require the evaluation of complex queries on such streams. In this paper, we consider complex queries involving conditioning (e.g., selections and group by's) and aggregation operations on uncertain data streams. To characterize the uncertainty of answers to these queries, one generally has to compute the full probability distribution of each operation used in the query. Computing distributions of aggregates given conditioned tuple distributions is a hard, unsolved problem. Our work employs a new evaluation framework that includes a general data model, approximation metrics, and approximate representations. Within this framework we design fast data-stream algorithms, both deterministic and randomized, for returning approximate distributions with bounded errors as answers to those complex queries. Our experimental results demonstrate the accuracy and efficiency of our approximation techniques and offer insights into the strengths and limitations of deterministic and randomized algorithms.
2005
In many data st.reaming applications. streams may cont ain data tuples that are either redundant. repetitive, or that are not "interesting" to any of the standing continuous queries. Processing such tuples may waste s~'stem resources \\'ithout producing useful answers. To the contrary, some other tuples can be categorized as promi8ing. This paper proposes that stream query engines can have the option to execute on promising tuples only and not on all tuples. 'Ve propose to maintain intermediate stream summaries and indices that can direct the stream query engine to detect and operate on promising tuples. As an illustration. the proposed intermediate stream summaries are tuned towards capturing promising tuples that (1) maximize the number of output tuples. (2) contribute to producing a faithful representative sample of the output tuples (compared to the output produced when assuming infinite resources), or (3) produce the outlier or deviant results. Experiments are conducted in the context of Nile [24]. a prototype stream query processing engine developed at Purdue Unil l ersity.
2015
Estimating the frequency of any piece of information in large-scale distributed data streams became of utmost importance in the last decade (e.g., in the context of network monitoring, big data, etc.). If some elegant solutions have been proposed recently, their approximation is computed from the inception of the stream. In a runtime distributed context, one would prefer to gather information only about the recent past. This may be led by the need to save resources or by the fact that recent information is more relevant. In this paper, we consider the sliding window model and propose two different (on-line) algorithms that approximate the items frequency in the active window. More precisely, we determine a (ε, δ)-additive-approximation meaning that the error is greater than ε only with probability δ. These solutions use a very small amount of memory with respect to the size N of the window and the number n of distinct items of the stream, namely, O(1 ε log 1 δ (log N + log n)) and O(1 τ ε log 1 δ (log N + log n)) bits of space, where τ is a parameter limiting memory usage. We also provide their distributed variant, i.e., considering the sliding window functional monitoring model. We compared the proposed algorithms to each other and also to the state of the art through extensive experiments on synthetic traces and real data sets that validate the robustness and accuracy of our algorithms.
Proceedings of the 2005 ACM SIGMOD international conference on Management of data - SIGMOD '05, 2005
A windowed query operator breaks a data stream into possibly overlapping subsets of data and computes results over each. Many stream systems can evaluate window aggregate queries. However, current stream systems suffer from a lack of an explicit definition of window semantics. As a result, their implementations unnecessarily confuse window definition with physical stream properties. This confusion complicates the stream system, and even worse, can hurt performance both in terms of memory usage and execution time. To address this problem, we propose a framework for defining window semantics, which can be used to express almost all types of windows of which we are aware, and which is easily extensible to other types of windows that may occur in the future. Based on this definition, we explore a one-pass query evaluation strategy, the Window-ID (WID) approach, for various types of window aggregate queries. WID significantly reduces both required memory space and execution time for a large class of window definitions. In addition, WID can leverage punctuations to gracefully handle disorder. Our experimental study shows that WID has better execution-time performance than existing window aggregate query evaluation options that retain and reprocess tuples, and has better latency-accuracy tradeoff performance for disordered input streams compared to using a fixed delay for disorder handling.
Proceedings of the …, 2001
We present techniques for computing small space representations of massive data streams. These are inspired by traditional wavelet-based approx-imations that consist of specific linear projec-tions of the underlying data. We present general sketch based methods for ...
The VLDB Journal, 2015
While traditional data-management systems focus on evaluating single, ad-hoc queries over static data sets in a centralized setting, several emerging applications require (possibly, continuous) answers to queries on dynamic data that is widely distributed and constantly updated. Furthermore, such query answers often need to discount data that is "stale", and operate solely on a sliding window of recent data arrivals (e.g., data updates occurring over the last 24 hours). Such distributed data streaming applications mandate novel algorithmic solutions that are both time-and space-efficient (to manage high-speed data streams), and also communication-efficient (to deal with physical data distribution). In this paper, we consider the problem of complex query answering over distributed, high-dimensional data streams in the sliding-window model. We introduce a novel sketching technique (termed ECM-sketch) that allows effective summarization of streaming data over both time-based and count-based sliding windows with probabilistic accuracy guarantees. Our sketch structure enables point as well as inner-product queries, and can be employed to address a broad range of problems, such as maintaining frequency statistics, finding heavy hitters, and computing quantiles in the sliding-window model. Focusing on distributed environments, we demonstrate how ECM-sketches of individual, local streams can be composed to generate a (low-error) ECM-sketch summary of the order-preserving merging of all streams; furthermore, we show how ECM-sketches can
ACM Transactions on Database Systems, 2004
Continuous queries often require significant run-time state over arbitrary data streams. However, streams may exhibit certain data or arrival patterns, or constraints , that can be detected and exploited to reduce state considerably without compromising correctness. Rather than requiring constraints to be satisfied precisely, which can be unrealistic in a data streams environment, we introduce k-constraints , where k is an adherence parameter specifying how closely a stream adheres to the constraint. (Smaller k 's are closer to strict adherence and offer better memory reduction.) We present a query processing architecture, called k-Mon , that detects useful k -constraints automatically and exploits the constraints to reduce run-time state for a wide range of continuous queries. Experimental results showed dramatic state reduction, while only modest computational overhead was incurred for our constraint monitoring and query execution algorithms.
A histogram is a piecewise-constant approximation of an observed data distribution. A histogram is used as a small-space, approximate synopsis of the underlying data distribution, which is often too large to be stored precisely. Histograms have found many applications in database management systems, perhaps most commonly for query selectivity estimation in query optimizers [1], but have also found applications in approximate query answering [2], load balancing in parallel join execution [3], mining time-series data [4], partition-based temporal join execution, query pro.ling for user feedback, etc. Ioannidis has a nice overview of the history of histograms, their applications, and their use in commercial DBMSs [5]. Also, Poosala’s thesis provides a systematic treatment of different types of histograms [3].
Proceedings of the 20th ACM international conference on Information and knowledge management, 2011
Data Streams Management Systems are designed to support monitoring applications which require the processing of hundreds of Aggregate Continuous Queries (ACQs). These ACQs typically have different time granularities, with possibly different selection predicates and group-by attributes. In order to achieve scalability in the presence of heavy workloads, in this paper, we introduce the concept of 'Weaveability' as an indicator of the potential gains of sharing the processing of ACQs. We then propose Weave Share, a cost-based optimizer that exploits weaveability to optimize the shared processing of ACQs. Our experimental analysis shows that Weave Share outperforms the alternative sharing schemes generating up to four orders of magnitude better quality plans. Finally, we describe a practical implementation of the Weave Share optimizer.
Loading Preview
Sorry, preview is currently unavailable. You can download the paper by clicking the button above.