Academia.edu no longer supports Internet Explorer.
To browse Academia.edu and the wider internet faster and more securely, please take a few seconds to upgrade your browser.
2011, Proceedings of the ACM SIGMETRICS joint international conference on Measurement and modeling of computer systems - SIGMETRICS '11
The massive data streams observed in network monitoring, data processing and scientific studies are typically too large to store. For many applications over such data, we must obtain compact summaries of the stream. These summaries should allow accurate answering of post hoc queries with estimates which approximate the true answers over the original stream. The data often has an underlying structure which makes certain subset queries, in particular range queries, more relevant than arbitrary subsets. Applications such as access control, change detection, and heavy hitters typically involve subsets that are ranges or unions thereof.
2010
Data streams constitute the core of many traditional (e.g. financial) and emerging (e.g. environmental) applications. The sources of streams are ubiquitous in daily life (e.g. web clicks). One feature of these data is the high speed of their arrival. Thus, their processing entails a special constraint. Despite the exponential growth in the capacity of storage devices, it is very expensive - even impossible - to store a data stream in its entirety. Consequently, queries are evaluated only on the recent data of the stream, the old ones are expired. However, some applications need to query the whole data stream. Therefore, the inability to store a complete stream suggests the storage of a compact representation of its data, called summaries. These structures allow users to query the past without an explosion of the required storage space, to provide historical aggregated information, to perform data mining tasks or to detect anomalous behavior in computer systems. The side effect of using summaries is that queries over historical data may not return exact answers, but only approximate ones. This paper introduces a new approach which is a trade-off between the accuracy of query results and the time consumed in building summaries.
Proceedings 19th International Conference on Data Engineering (Cat. No.03CH37405)
The problem of statistics and aggregate maintenance over data streams has gained popularity in recent years especially in telecommunications network monitoring, trendrelated analysis, web-click streams, stock tickers, and other time-variant data. The amount of data generated in such applications can become too large to store, or if stored too large to scan multiple times. We consider queries over data streams that are biased towards the more recent values. We develop a technique that summarizes a dynamic stream incrementally at multiple resolutions. This approximation can be used to answer point queries, range queries, and inner product queries. Moreover, the precision of answers can be changed adaptively by a client. Later, we extend the above technique to work in a distributed setting, specifically in a large network where a central site summarizes the stream and clients ask queries. We minimize the message overhead by deciding what and where to replicate by using an adaptive replication scheme. We maintain a hierarchy of approximations that change adaptively based on the query and update rates. We show experimentally that our technique performs better than existing techniques: up to £ ¥ ¤ times better in terms of approximation quality, up to four orders of magnitude times better in response time, and up to five times better in terms of message complexity.
2011
In processing large quantities of data, a fundamental problem is to obtain a summary which supports approximate query answering. Random sampling yields flexible summaries which naturally support subset-sum queries with unbiased estimators and well-understood confidence bounds.
2002
Recent years have witnessed an increasing interest in designing algorithms for querying and analyzing streaming data (i.e., data that is seen only once in a fixed order) with only limited memory. Providing (perhaps approximate) answers to queries over such continuous data streams is a crucial requirement for many application environments; examples include large telecom and IP network installations where performance data from different parts of the network needs to be continuously collected and analyzed.
2005
In many data st.reaming applications. streams may cont ain data tuples that are either redundant. repetitive, or that are not "interesting" to any of the standing continuous queries. Processing such tuples may waste s~'stem resources \\'ithout producing useful answers. To the contrary, some other tuples can be categorized as promi8ing. This paper proposes that stream query engines can have the option to execute on promising tuples only and not on all tuples. 'Ve propose to maintain intermediate stream summaries and indices that can direct the stream query engine to detect and operate on promising tuples. As an illustration. the proposed intermediate stream summaries are tuned towards capturing promising tuples that (1) maximize the number of output tuples. (2) contribute to producing a faithful representative sample of the output tuples (compared to the output produced when assuming infinite resources), or (3) produce the outlier or deviant results. Experiments are conducted in the context of Nile [24]. a prototype stream query processing engine developed at Purdue Unil l ersity.
2015
Estimating the frequency of any piece of information in large-scale distributed data streams became of utmost importance in the last decade (e.g., in the context of network monitoring, big data, etc.). If some elegant solutions have been proposed recently, their approximation is computed from the inception of the stream. In a runtime distributed context, one would prefer to gather information only about the recent past. This may be led by the need to save resources or by the fact that recent information is more relevant. In this paper, we consider the sliding window model and propose two different (on-line) algorithms that approximate the items frequency in the active window. More precisely, we determine a (ε, δ)-additive-approximation meaning that the error is greater than ε only with probability δ. These solutions use a very small amount of memory with respect to the size N of the window and the number n of distinct items of the stream, namely, O(1 ε log 1 δ (log N + log n)) and O(1 τ ε log 1 δ (log N + log n)) bits of space, where τ is a parameter limiting memory usage. We also provide their distributed variant, i.e., considering the sliding window functional monitoring model. We compared the proposed algorithms to each other and also to the state of the art through extensive experiments on synthetic traces and real data sets that validate the robustness and accuracy of our algorithms.
Journal of Computer and System Sciences, 2014
1 Statistical summaries of IP traffic are at the heart of network operation and are used to recover information on arbitrary subpopulations of flows. It is therefore of great importance to collect the most accurate and informative summaries given the router's resource constraints. IP packet streams consist of multiple interleaving IP flows. While queries are posed over the set of flows, the summarization algorithm is applied to the stream of packets. Aggregation of traffic into flows before summarization is often infeasible and therefore the summary has to be produced over the unaggregated stream. Cisco's sampled NetFlow, based on aggregating a sampled packet stream into flows, is the most widely deployed such system.
2009 International Conference on Advanced Information Networking and Applications Workshops, 2009
Nowadays, servers register more and more log entries. Monitoring, analyzing and exctracting knowledge from networks and web servers is crucial for a lot of applications. Indeed, logs can be useful for describing the activity by means of several dimensions. But logs arrive at an intensive rate and are observable at a low level of granularity which make it unrealistic to store the whole log history and leads us considering logs as data stream. Moreover, as logs are composed by several fields which can be considered as multiple levels of granularity, it would be interresting to provide an on-line analytical processing on such data stream. So, a natural question is "it is possible to perform a multi-level and multidimensional analysis by building a cube supplied by a data stream?". A choice has to be done in order selecting the most useful information. We tackle this problem by exploiting user's preferences. Generaly, users consult the recent history at fine levels of granularity. Then, this need of precision decreases when the age of the data increases. To this end, we introduce precision functions. Their combination lead to a compact data cube framework which can answer to most of queries. Experiments conduced on both synthetic and real data set show that our approach can be applied to data stream context.
We present techniques for computing small space representations of massive data streams. These are inspired by traditional wavelet-based approximations that consist of specific linear projections of the underlying data. We present general "sketch" based methods for capturing various linear projections of the data and use them to provide pointwise and rangesum estimation of data streams. These methods use small amounts of space and per-item time while streaming through the data, and provide accurate representation as our experiments with real data streams show.
Query estimation plays an important role in query optimization by choosing a particular query plan. Performing query estimation becomes quite challenging in case of fast, continuous, online data streams. Different summarization methods like sampling, histograms, wavelets, sketches, discrete cosine series etc. are used to store data distribution for query estimation. In this paper a brief survey of query estimation techniques in view of data streams is presented.
2009
Computer systems generate a large amount of data that, in terms of space and time, is very expensive - even impossible - to store. Besides this, many applications need to keep an historical view of such data in order to provide historical aggregated information, perform data mining tasks or detect anomalous behavior in computer systems. One solution is to treat the data as streams being processed on the fly in order to build historical summaries. Many data summarizing techniques have already been developed such as sampling, clustering, histograms, etc. Some of them have been extended to be applied directly to data streams. This chapter presents a new approach to build such historical summaries of data streams. It is based on a combination of two existing algorithms: StreamSamp and CluStream. The combination takes advantages of the benefits of each algorithm and avoids their drawbacks. Some experiments are presented both on real and synthetic data. These experiments show that the new approach gives better results than using any one of the two mentioned algorithms.
Proceedings. 16th International Conference on Scientific and Statistical Database Management, 2004.
We present a novel approach to approximate evaluation of standing aggregate queries over streaming data, subject to user-specified error bounds. Our method models the behavior of aggregates as Brownian motions, and adaptively updates the model according to stream characteristics. This approach has two advantages. First, it greatly improves system scalability since we can defer query evaluation as long as the difference between the returned and true aggregate values remains within user-specified bounds. Second, we are able to provide approximate answers during stream interruptions by estimating the rate at which the streams and the aggregate drift during the blackout periods. We also study processor allocation issues in such approximate aggregate evaluation systems. Our experiments show that our model captures the behavior of real-world streams such as sensor data and stock traces with excellent fidelity, and scales very well for large numbers of standing queries.
Proceedings of the 2005 ACM SIGMOD international conference on Management of data - SIGMOD '05, 2005
A windowed query operator breaks a data stream into possibly overlapping subsets of data and computes results over each. Many stream systems can evaluate window aggregate queries. However, current stream systems suffer from a lack of an explicit definition of window semantics. As a result, their implementations unnecessarily confuse window definition with physical stream properties. This confusion complicates the stream system, and even worse, can hurt performance both in terms of memory usage and execution time. To address this problem, we propose a framework for defining window semantics, which can be used to express almost all types of windows of which we are aware, and which is easily extensible to other types of windows that may occur in the future. Based on this definition, we explore a one-pass query evaluation strategy, the Window-ID (WID) approach, for various types of window aggregate queries. WID significantly reduces both required memory space and execution time for a large class of window definitions. In addition, WID can leverage punctuations to gracefully handle disorder. Our experimental study shows that WID has better execution-time performance than existing window aggregate query evaluation options that retain and reprocess tuples, and has better latency-accuracy tradeoff performance for disordered input streams compared to using a fixed delay for disorder handling.
… , 2006. ICDE'06. Proceedings of the …, 2006
Aggregate monitoring over data streams is attracting more and more attention in research community due to its broad potential applications. Existing methods suffer two problems, 1) The aggregate functions which could be monitored are restricted to be first-order statistic or monotonic with respect to the window size. 2) Only a limited number of granularity and time scales could be monitored over a stream, thus some interesting patterns might be neglected, and users might be misled by the incomplete changing profile about current data streams. These two impede the development of online mining techniques over data streams, and some kind of breakthrough is urged. In this paper, we employed the powerful tool of fractal analysis to enable the monitoring of both monotonic and non-monotonic aggregates on time-changing data streams. The monotony property of aggregate monitoring is revealed and monotonic search space is built to decrease the time overhead for accessing the synopsis from O(m) to O(log m), where m is the number of windows to be monitored. With the help of a novel inverted histogram, the statistical summary is compressed to be fit in limited main memory, so that high aggregates on windows of any length can be detected accurately and efficiently on-line. Theoretical analysis show the space and time complexity bound of this method are relatively low, while experimental results prove the applicability and efficiency of the proposed algorithm in different application settings.
The VLDB Journal, 2004
There is growing interest in algorithms for processing and querying continuous data streams (i.e., data that is seen only once in a fixed order) with limited memory resources. In its most general form, a data stream is actually an update stream, i.e., comprising data-item deletions as well as insertions. Such massive update streams arise naturally in several application domains (e.g., monitoring of large IP network installations, or processing of retail-chain transactions). Estimating the cardinality of set expressions defined over several (perhaps, distributed) update streams is perhaps one of the most fundamental query classes of interest; as an example, such a query may ask "what is the number of distinct IP source addresses seen in passing packets from both router R 1 and R 2 but not router R 3 ?". Earlier work has only addressed very restricted forms of this problem, focusing solely on the special case of insertonly streams and specific operators (e.g., union). In this paper, we propose the first space-efficient algorithmic solution for estimating the cardinality of full-fledged set expressions over general update streams. Our estimation algorithms are probabilistic in nature and rely on a novel, hash-based synopsis data structure, termed "2level hash sketch". We demonstrate how our 2-level hash sketch synopses can be used to provide low-error, highconfidence estimates for the cardinality of set expressions (including operators such as set union, intersection, and difference) over continuous update streams, using only space that is significantly sublinear in the sizes of the streaming input (multi-)sets. Furthermore, our estimators never require rescanning or resampling of past stream items, regardless of the number of deletions in the stream. We also present lower bounds for the problem, demonstrating that the space usage of our estimation algorithms is within small factors of the optimal. Finally, we propose an optimized, time-efficient stream
Computing Research Repository, 2010
In this work, we present a comprehensive treatment of weighted random sampling (WRS) over data streams. More precisely, we examine two natural interpretations of the item weights, describe an existing algorithm for each case ([2, 4]), discuss sampling with and without replacement and show adaptations of the algorithms for several WRS problems and evolving data streams.
Lecture Notes in Computer Science, 2010
With the rapid development of information technology, many applications have to deal with potentially infinite data streams. In such a dynamic context, storing the whole data stream history is unfeasible and providing a high-quality summary is required for decision makers. In this paper, we propose a summarization method for multidimensional data streams based on a graph structure and taking advantage of the data hierarchies. The summarization method we propose takes into account the data distribution and thus overcomes a major drawback of the Tilted Time Window common framework. Finally, we adapt this structure for synthesizing frequent itemsets extracted on temporal windows. Thanks to our approach, as users do not analyze any more numerous extraction results, the result processing is improved. Experiments conducted on both synthetic and real datasets show that our approach can be applied on data streams.
Proceedings of the twenty-sixth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems - PODS '07, 2007
IP packet streams consist of multiple interleaving IP o ws. Sta- tistical summaries of these streams, collected for different mea- surement periods, are used for characterization of trafc, billing, anomaly detection, inferring trafc demands, conguring packet lters and routing protocols, and more. While queries are posed over the set of o ws, the summarization algorithm is applied to the stream
Theoretical Computer Science, 1984
The inference control technique called Auditing is discussed in this paper. Auditing is in many ways better than the previously known techniques. Auditing hould log all answered que i ies, >nd use this information to decide whether answering a new query could lead to comprorr I' 2. Unfortunately, except for small dataha\es, auditing may not be readily usable in practice because of its excessive time and space complexity in processing a new query. !n this paper be restrict our study to SCI~I queries, if there are n records in the database, the problem of determining whether or not answering a new Sub1 query could lead to-ampromise may take O(n') time and space. Furthermore, it is unrealistic to assume that the user can obtain statistical information of any subset of the records in the database: WC assume that stAtical information is only available t'or those subsets of records in which one of their attribute values lies within ii certain rouge (range query). With the proper data structure, the time and space complexity for checking if a new range query could be answered can be reduced to O(n) time and space. or O(t log n) time with O()I' 1 space for t new range queries with I 2 II.
2002
This article deals with continuous conjunctive queries with arithmetic comparisons and optional aggregation over multiple data streams. An algorithm is presented for determining whether or not any given query can be evaluated using a bounded amount of memory for all possible instances of the data streams. For queries that can be evaluated using bounded memory, an execution strategy based on constant-sized synopses of the data streams is proposed. For queries that cannot be evaluated using bounded memory, data stream scenarios are identified in which evaluating the queries requires memory linear in the size of the unbounded streams.
Loading Preview
Sorry, preview is currently unavailable. You can download the paper by clicking the button above.