Academia.edu no longer supports Internet Explorer.
To browse Academia.edu and the wider internet faster and more securely, please take a few seconds to upgrade your browser.
2003, IEEE Data(base) Engineering Bulletin
We propose to demonstrate a Data Stream Management System (DSMS) called STREAM, for STanford stREam datA Manager. The challenges in building a DSMS instead of a traditional DBMS arise from two fundamental differences: ¡ In addition to managing traditional stored data such as relations, a DSMS must handle multiple continuous, unbounded, possibly rapid and time-varying data streams. ¡ Due to the continuous nature of the data, a DSMS typically supports long-running continuous queries, which are expected to produce answers in a continuous and timely fashion.
a book on data stream …, 2004
VLDB '02: Proceedings of the 28th International Conference on Very Large Databases, 2002
This paper introduces monitoring applications, which we will show differ substantially from conventional business data processing. The fact that a software system must process and react to continual inputs from many sources (e.g., sensors) rather than from human operators requires one to rethink the fundamental architecture of a DBMS for this application area. In this paper, we present Aurora, a new DBMS that is currently under construction at Brandeis University, Brown University, and M.I.T. We describe the basic system architecture, a stream-oriented set of operators, optimization tactics, and support for realtime operation.
2006
By providing an integrated and optimized support for user-defined aggregates (UDAs), data stream management systems (DSMS) can achieve superior power and generality while preserving compatibility with current SQL standards. This is demonstrated by the Stream Mill system that, through is Expressive Stream Language (ESL), efficiently supports a wide range of applications-including very advanced ones such as data stream mining, streaming XML processing, time-series queries, and RFID event processing. ESL supports physical and logical windows (with optional slides and tumbles) on both built-in aggregates and UDAs, using a simple framework that applies uniformly to both aggregate functions written in an external procedural languages and those natively written in ESL. The constructs introduced in ESL extend the power and generality of DSMS, and are conducive to UDA-specific optimization and efficient execution as demonstrated by several experiments.
2002
In this overview paper we motivate the need for and research issues arising from a new model of data processing. In this model, data does not take the form of persistent relations, but rather arrives in multiple, continuous, rapid, time-varying data streams. In addition to reviewing past work relevant to data stream systems and current projects in the area, the paper explores topics in stream query languages, new requirements and challenges in query processing, and algorithmic issues. £ systems, view management, sequence databases, and others. Although much of this work clearly has applications to data stream processing, we hope to show in this paper that there are many new problems to address in realizing a complete DSMS.
2003
This paper describes our ongoing work developing the Stanford Stream Data Manager (STREAM), a system for executing continuous queries over multiple continuous data streams. The STREAM system supports a declarative query language, and it copes with high data rates and query workloads by providing approximate answers when resources are limited. This paper describes specific contributions made so far and enumerates our next steps in developing a general-purpose Data Stream Management System. ¢ An extension of SQL suitable for a general-purpose DSMS with a precisely-defined semantics (Section 2) ¢ Structure of query plans, accounting for plan sharing and approximation techniques (Section 3) ¢ An algorithm for exploiting constraints on data streams to reduce memory overhead during query processing (Section 4.1) ¢ A near-optimal scheduling algorithm for reducing inter-operator queue sizes (Section 4.2) ¢ A set of techniques for static and dynamic approximation to cope with limited resources (Section 5) ¢ An algorithm for allocating resources to queries (in a limited environment) that maximizes query result precision (Section 5.3) ¢ A software architecture designed for extensibility and for easy experimentation with DSMS query processing techniques (Section 6) Some current limitations are: ¢ Our DSMS is centralized and based on the relational model. We believe that distributed query processing will be essential for many data stream applications, and we are designing our query processor with a mi-
2008
Many modern applications need to process queries over potentially infinite data streams to provide answers in real-time. This dissertation proposes novel techniques to optimize CPU and memory utilization in stream processing by exploiting metadata on streaming data or queries. It focuses on four topics: 1) exploiting stream metadata to optimize SPJ query operators via operator configuration, 2) exploiting stream metadata to optimize SPJ query plans via query-rewriting, 3) exploiting workload metadata to optimize parameterized queries via indexing, and 4) exploiting event constraints to optimize event stream processing via run-time early termination. The first part of this dissertation proposes algorithms for one of the most common and expensive query operators, namely join, to at runtime identify and purge no-longer-needed data from the state based on punctuations. Exploitations of the combination of punctuation and commonly-used window constraints are also studied. Extensive experimental evaluations demonstrate both reduction on memory usage and improvements on execution time due to the proposed strategies. The second part proposes herald-driven runtime query plan optimiza-
Proceedings of the twenty-third ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems, 2004
Continuous queries in a Data Stream Management System (DSMS) rely on time as a basis for windows on streams and for defining a consistent semantics for multiple streams and updatable relations. The system clock in a centralized DSMS provides a convenient and well-behaved notion of time, but often it is more appropriate for a DSMS application to define its own notion of time-its own clock(s), sequence numbers, or other forms of ordering and timestamping. Flexible application-defined time poses challenges to the DSMS, since streams may be out of order and uncoordinated with each other, they may incur latency reaching the DSMS, and they may pause or stop. We formalize these challenges and specify how to generate heartbeats so that queries can be evaluated correctly and continuously in an application-defined time domain. Our heartbeat generation algorithm is based on parameters capturing skew between streams, unordering within streams, and latency in streams reaching the DSMS. We also describe how to estimate these parameters at run-time, and we discuss how heartbeats can be used for processing continuous queries.
Proceedings of the 2006 ACM SIGMOD international conference on Management of data - SIGMOD '06, 2006
Recent data stream systems such as TelegraphCQ have employed the well-known property of duality between data and queries. In these systems, query processing methods are classified into two dual categories -data-initiative and query-initiative -depending on whether query processing is initiated by selecting a data element or a query. Although the duality property has been widely recognized, previous data stream systems do not fully take advantages of this property since they use the two dual methods independently: data-initiative methods only for continuous queries and query-initiative methods only for ad-hoc queries. We contend that continuous query processing can be better optimized by adopting an approach that integrates the two dual methods. Our primary contribution is based on the observation that spatial join is a powerful tool for achieving this objective. In this paper, we first present a new viewpoint of transforming the continuous query processing problem to a multi-dimensional spatial join problem. We then present a continuous query processing algorithm based on spatial join, which we name Spatial Join CQ. This algorithm processes continuous queries by finding the pairs of overlapping regions from a set of data elements and a set of queries, both defined as regions in the multi-dimensional space. The algorithm achieves the advantages of the two dual methods simultaneously. Experimental results show that the proposed algorithm outperforms earlier algorithms by up to 36 times for simple selection continuous queries and by up to 7 times for sliding window join queries.
IEEE Transactions on Knowledge and Data Engineering, 1990
Data streams are long, relatively unstructured sequences of characters that contain information such as electronic mail or a tape backup of various documents and reports created in an office. This paper deals with a conceptual framework, using relational algebra and relational databases, within which data streams may be queried. As information is extracted from the data stream, it is put into a relational database that may be queried in the usual manner. The database schema evolves as the user's knowledge of the content of the data stream changes. Operators are defined in terms of relational algebra that can be used to extract data from a specially defined relation that contains all or part of the data stream. This approach to querying data streams permits the integration of unstructured data with structured data. The operators defined extend the functionality of relational algebra, in much the same way that the join does relative to the basic operators-select, project, union, difference, and Cartesian product.
The VLDB Journal, 2008
This paper pi,csents t,l~e Scalable On-Line E;cecvtion algorith~ll (SOLE. for short) for co~ltinuous a r~d on-line e~ialuation of concurrei~t contint~ous spatiotemporal clueries over data strcallls. Inco~lli~lg spatioteinporal data streams a.re processed in-memory against a set of outsta~ldi~lg contil.~uot~s queries. The SOLE algorithm utilizes the scarce ineinory resource efficiently by keeping track of only the significant objects. In-memory stored objects are expired (i.e.. dropped) from memory once they beco~ne ,ins-ign(ficant. SOLE is a scalable algorithm where all the conti~luous ot~tstanding clueries share the same buffer pool. 111 addit,ion: SOLE is preseilted as a spatio-temporal join bet.~veen two input streams, a stream of spatio-temporal ol~jects and a stream of spatiot,emporal queries. To cope \vit,h intei.\~als of high arrival rates of objects and/or queries, SOLE utilizes a selftnning approach based on load-shedding where some of the stored objects are dropped fro111 memory. SOLE is i~npleinei~ted as a pipelined query operator that can be co~nbiiled with traditional query operators in a query exect~tion plan to support a witle variety of contilluous queries. Performance expe~i~nents based on a real iinplementation of SOLE inside a prototype of a data stream management system sIlo\v tlle scalability and efficiency of SOLE in higllly tly~lanlic environn~eilts. This work \\)as support,ed in part b\; the Nat,ional Science Founda.t.ion u~~d e r Gra.nt.s 11s-0093116. IIS-0209120, and 0010044-CCR.
2003
Abstract Traditional databases store sets of relatively static records with no pre-defined notion of time, unless timestamp attributes are explicitly added. While this model adequately represents commercial catalogues or repositories of personal information, many current and emerging applications require support for on-line analysis of rapidly changing data streams.
ACM Transactions on Database Systems, 2010
In Relational database nlanageme~~t systems, views supplement basic query constructs to cope with the demand for "higher-level" views of data. h?oreover! in traditional query optimization, answering a query using aset of existing materialized views can yield a more efficient query execution plan. Due to their effectiveness, views are attractive to data stream management systems.
2010
There are several query languages developed for data stream management systems (DSMS), CQL (Stanford), StreamSQL (StreamBase), WaveScript (MIT), SCSQL (Uppsala University), etc. This thesis is the research phase of a two-phase project where the final goal is to provide CQL support to the Super Computer Stream Query processor (SCSQ); a DSMS developed by the Uppsala DataBase Laboratory. In this paper, the main properties of CQL, the extent to which they are implemented by the Stanford STREAM project and the expressibility of the Linear Road (LR) benchmark using CQL is investigated. An overview and comparison of SQL, CQL, StreamSQL and WaveScript is also given.
2008
This paper introduces the DataCell, a data stream management system designed as a seamless integration of continuous queries based on bulk event processing in an SQL software stack. The continuous stream queries are based on a predicate-window, called "basket" expressions, which support arbitrary complex SQL subqueries including, but not limited to, temporal and sequence constraints.
2009
Stream applications gained significant popularity over the last years that lead to the development of specialized stream engines. These systems are designed from scratch with a different philosophy than nowadays database engines in order to cope with the stream applications requirements. However, this means that they lack the power and sophisticated techniques of a full fledged database system that exploits techniques and algorithms accumulated over many years of database research.
Learning from Data Streams, 2007
The rapid growth in information science and technology in general and the complexity and volume of data in particular have introduced new challenges for the research community. Many sources produce data continuously. Examples include sensor networks, wireless networks, radio frequency identification (RFID), customer click streams, telephone records, multimedia data, scientific data, sets of retail chain transactions etc. These sources are called data streams. A data stream is an ordered sequence of instances that can be read only once or a small number of times using limited computing and storage capabilities. These sources of data are characterized by being open-ended, flowing at high-speed, and generated by non stationary distributions in dynamic environments.
The Vldb Journal, 2006
CQL, a Continuous Query Language, is supported by the STREAM prototype Data Stream Management System at Stanford. CQL is an expressive SQL-based declarative language for registering continuous queries against streams and updatable relations. We begin by presenting an abstract semantics that relies only on "black box" mappings among streams and relations. From these mappings we define a precise and general interpretation for continuous queries. CQL is an instantiation of our abstract semantics using SQL to map from relations to relations, window specifications derived from SQL-99 to map from streams to relations, and three new operators to map from relations to streams. Most of the CQL language is operational in the STREAM system. We present the structure of CQL's query execution plans as well as details of the most important components: operators, inter-operator queues, synopses, and sharing of components among multiple operators and queries. Examples throughout the paper are drawn from the Linear Road benchmark recently proposed for Data Stream Management Systems. We also curate a public repository of data stream applications that includes a wide variety of queries expressed in CQL.
IEEE Transactions on Systems, Man, and Cybernetics: Systems, 2017
Data Stream Management Systems (DSMSs) are conceived for running continuous queries (CQs) on the most recently streamed data. This model does not completely fit the needs of several modern data-intensive applications that require to manage recent/historical/static data and execute both CQs and OTQs joining such data. In order to cope with these new needs, some DSMSs have moved towards the integration of DBMS functionalities to augment their capabilities. In this paper we adopt the opposite perspective and we lay the groundwork for extending DBMSs to natively support streaming facilities. To this end, we introduce a new kind of table, the streaming table, as a persistent structure where streaming data enters and remains stored for a long period, ideally forever. Streaming tables feature a novel access paradigm: continuous writes and one-time as well as continuous reads. We present a streaming table implementation and two novel types of indices that efficiently support both update and scan high rates. A detailed experimental evaluation shows the effectiveness of the proposed technology.
2009
Stream applications gained significant popularity in recent years, which lead to the development of specialized datastream engines. They often have been designed from scratch and are tuned towards the specific requirements posed by their initial target applications, e.g., network monitoring and financial services. However, this also meant that they lack the power and sophisticated techniques of a full fledged database system accumulated over many years of database research.
Proceedings of the 2019 International Conference on Management of Data
Real-time data analysis and management are increasingly critical for today's businesses. SQL is the de facto lingua franca for these endeavors, yet support for robust streaming analysis and management with SQL remains limited. Many approaches restrict semantics to a reduced subset of features and/or require a suite of non-standard constructs. Additionally, use of event timestamps to provide native support for analyzing events according to when they actually occurred is not pervasive, and often comes with important limitations. We present a three-part proposal for integrating robust streaming into the SQL standard, namely: (1) time-varying relations as a foundation for classical tables as well as streaming data, (2) event time semantics, (3) a limited set of optional keyword extensions to control the materialization of timevarying query results. Motivated and illustrated using examples and lessons learned from implementations in Apache Calcite, Apache Flink, and Apache Beam, we show how with these minimal additions it is possible to utilize the complete suite of standard SQL semantics to perform robust stream processing. Publication rights licensed to ACM. ACM acknowledges that this contribution was authored or co-authored by an employee, contractor or affiliate of the United States government. As such, the Government retains a nonexclusive, royalty-free right to publish or reproduce this article, or to allow others to do so, for Government purposes only.
Loading Preview
Sorry, preview is currently unavailable. You can download the paper by clicking the button above.