Academia.edu no longer supports Internet Explorer.
To browse Academia.edu and the wider internet faster and more securely, please take a few seconds to upgrade your browser.
The VLDB Journal
The interval join is a popular operation in temporal, spatial, and uncertain databases. The majority of interval join algorithms assume that input data reside on disk and so, their focus is to minimize the I/O accesses. Recently, an in-memory approach based on plane sweep (PS) for modern hardware was proposed which greatly outperforms previous work. However, this approach relies on a complex data structure and its parallelization has not been adequately studied. In this article, we investigate in-memory interval joins in two directions. First, we explore the applicability of a largely ignored forward scan (FS)-based plane sweep algorithm, for single-threaded join evaluation. We propose four optimizations for FS that greatly reduce its cost, making it competitive or even faster than the state-of-the-art. Second, we study in depth the parallel computation of interval joins. We design a non-partitioning-based approach that determines independent tasks of the join algorithm to run in pa...
Proceedings of the 27th ACM SIGSPATIAL International Conference on Advances in Geographic Information Systems
The spatial join is a popular operation in spatial database systems and its evaluation is a well-studied problem. As main memories become bigger and faster and commodity hardware supports parallel processing, there is a need to revamp classic join algorithms which have been designed for I/O-bound processing. In view of this, we study the in-memory and parallel evaluation of spatial joins, by redesigning a classic partitioning-based algorithm to consider alternative approaches for space partitioning. Our study shows that, compared to a straightforward implementation of the algorithm, our tuning can improve performance significantly. We also show how to select appropriate partitioning parameters based on data statistics, in order to tune the algorithm for the given join inputs. Our parallel implementation scales gracefully with the number of threads reducing the cost of the join to at most one second even for join inputs with tens of millions of rectangles.
The VLDB Journal, 2021
Joins are essential and potentially expensive operations in database management systems. When data is associated with time periods, joins commonly include predicates that require pairs of argument tuples to overlap in order to qualify for the result. Our goal is to enable built-in systems support for such joins. In particular, we present an approach where overlap joins are formulated as unions of range joins, which are more general purpose joins compared to overlap joins, i.e., are useful in their own right, and are supported well by B+-trees. The approach is sufficiently flexible that it also supports joins with additional equality predicates, as well as open, closed, and half-open time periods over discrete and continuous domains, thus offering both generality and simplicity, which is important in a system setting. We provide both a stand-alone solution that performs on par with the state-of-the-art and a DBMS embedded solution that is able to exploit standard indexing and clearly...
Proceedings of the 26th International Conference on Scientific and Statistical Database Management - SSDBM '14, 2014
Spatial join is a crucial operation in many spatial analysis applications in scientific and geographical information systems. Due to the compute-intensive nature of spatial predicate evaluation, spatial join queries can be slow even with a moderate sized dataset. Efficient parallelization of spatial join is therefore essential to achieve acceptable performance for many spatial applications. Technological trends, including the rising core count and increasingly large main memory, hold great promise in this regard. Previous parallel spatial join approaches tried to partition the dataset so that the number of spatial objects in each partition was as equal as possible. They also focused only on the filter step. However, when the more compute-intensive refinement step is included, significant processing skew may arise due to the uneven size of the objects. This processing skew significantly limits the achievable parallel performance of the spatial join queries, as the longest-running spatial partition determines the overall query execution time.
Proceedings of the 2014 ACM SIGMOD International Conference on Management of Data, 2014
Each tuple in a valid-time relation includes an interval attribute T that represents the tuple's valid time. The overlap join between two valid-time relations determines all pairs of tuples with overlapping intervals. Although overlap joins are common, existing partitioning and indexing schemes are inefficient if the data includes long-lived tuples or if intervals intersect partition boundaries. We propose Overlap Interval Partitioning (OIP), a new partitioning approach for data with an interval. OIP divides the time range of a relation into k base granules and defines overlapping partitions for sequences of contiguous granules. OIP is the first partitioning method for interval data that gives a constant clustering guarantee: the difference in duration between the interval of a tuple and the interval of its partition is independent of the duration of the tuple's interval. We offer a detailed analysis of the average false hit ratio and the average number of partition accesses for queries with overlap predicates, and we prove that the average false hit ratio is independent of the number of short-and long-lived tuples. To compute the overlap join, we propose the Overlap Interval Partition Join (OIPJOIN), which uses OIP to partition the input relations on-the-fly. Only the tuples from overlapping partitions have to be joined to compute the result. We analytically derive the optimal number of granules, k, for partitioning the two input relations, from the size of the data, the cost of CPU operations, and the cost of main memory or disk IOs. Our experiments confirm the analytical results and show that the OIPJOIN outperforms state-ofthe-art techniques for the overlap join.
IEEE Transactions on Knowledge and Data Engineering, 2002
AbstractÐIn the past decade, the exponential growth in commodity CPU's speed has far outpaced advances in memory latency. A second trend is that CPU performance advances are not only brought by increased clock rate, but also by increasing parallelism inside the CPU. Current database systems have not yet adapted to these trends and show poor utilization of both CPU and memory resources on current hardware. In this paper, we show how these resources can be optimized for large joins and translate these insights into guidelines for future database architectures, encompassing data structures, algorithms, cost modeling, and implementation. In particular, we discuss how vertically fragmented data structures optimize cache performance on sequential data access. On the algorithmic side, we refine the partitioned hash-join with a new partitioning algorithm called radix-cluster, which is specifically designed to optimize memory access. The performance of this algorithm is quantified using a detailed analytical model that incorporates memory access costs in terms of a limited number of parameters, such as cache sizes and miss penalties. We also present a calibration tool that extracts such parameters automatically from any computer hardware. The accuracy of our models is proven by exhaustive experiments conducted with the Monet database system on three different hardware platforms. Finally, we investigate the effect of implementation techniques that optimize CPU resource usage. Our experiments show that large joins can be accelerated almost an order of magnitude on modern RISC hardware when both memory and CPU resources are optimized.
Information Sciences, 2010
A spatial join is a query that searches for a set of object pairs satisfying a given spatial relationship from a database. It is one of the most costly queries, and thus requires an efficient processing algorithm that fully exploits the features of the underlying spatial indexes. In our earlier work, we devised a fairly effective algorithm for processing spatial joins with double transformation (DOT) indexing, which is one of several spatial indexing schemes. However, the algorithm is restricted to only the one-dimensional cases. In this paper, we extend the algorithm for the two-dimensional cases, which are general in Geographic Information Systems (GIS) applications. We first extend DOT to two-dimensional original space. Next, we propose an efficient algorithm for processing range queries using extended DOT. This algorithm employs the quarter division technique and the tri-quarter division technique devised by analyzing the regularity of the space-filling curve used in DOT. This greatly reduces the number of space transformation operations. We then propose a novel spatial join algorithm based on this range query processing algorithm. In processing a spatial join, we determine the access order of disk pages so that we can minimize the number of disk accesses. We show the superiority of the proposed method by extensive experiments using data sets of various distributions and sizes. The experimental results reveal that the proposed method improves the performance of spatial join processing up to three times in comparison with the widely-used R-tree-based spatial join method.
International Journal of Geographical Information Science, 1999
Spatial joins are join operations that involve spatial data types and operators. Spatial access methods are often used to speed up the computation of spatial joins. This paper addresses the issue of benchmarking spatial join operations. For this purpose, we first present a WWW-based benchmark generator to produce sets of rectangles. Using a Web browser, experimenters can specify the number of rectangles in a sample, as well as the statistical distributions of their sizes, shapes, and locations. Second, using the generator and a well-defined set of statistical models we define several tests to compare the performance of three spatial join algorithms: nested loop, scan-and-index, and synchronized tree traversal. We also added a real-life data set from the Sequoia 2000 storage benchmark. Our results show that the relative performance of the different techniques mainly depends on two parameters: sample size, and selectivity of the join predicate. All of the statistical models and algorithms are available on the Web, which allows for easy verification and modification of our experiments.
International Conference on Management of Data, 2005
One of the most common operations in analytic query processing is the application of an aggregate function to the result of a relational join. We describe an algorithm for computing the answer to such a query over large, disk-based input tables. The key innovation of our algorithm is that at all times, it provides an online, statistical estimator for the eventual answer to the query, as well as probabilistic confidence bounds. Thus, a user can monitor the progress of the join throughout its execution and stop the join when satisfied with the estimate's accuracy, or run the algorithm to completion with a total time requirement that is not much longer than other common join algorithms. This contrasts with other online join algorithms, which either do not offer such statistical guarantees or can only offer guarantees so long as the input data can fit into core memory.
Proceedings. Eleventh International Conference on Scientific and Statistical Database Management, 1999
Data warehouses as well as a wide range of other databases exhibit a strong temporal orientation: it is important to track the temporal variation of data over several months or years. In addition, databases often exhibit append-only characteristics where old data is retained while new data is appended. Performing joins efficiently on large databases such as these is essential to obtain good overall query processing performance. This paper presents a sort-merge-based incremental algorithm for time-oriented data. While incremental computation techniques have proven competitive in many settings, they also introduce a space overhead in the form of differential files. For the temporal data explored here, this overhead is avoided because the differential files are already part of the database. In addition, data is naturally sorted, leaving only merging. The incremental algorithm works in a partitioned storage environment and does not assume the availability of indices, making it a competitor to sort-based and nested-loop joins. The paper presents analytical as well as simulation-based characterizations of the performance of the join.
2005
With the increasing occurrence of temporal and spatial data in present-day database applications, the interval data type is adopted by more and more database systems. For an efficient support of queries that contain selections on interval attributes as well as simple-valued attributes (e. g. numbers, strings) at the same time, special index structures are required supporting both types of predicates in combination. Based on the Relational Interval Tree, we present various indexing schemes that support such combined queries and can be integrated in relational database systems with minimum effort. Experiments on different query types show superior performance for the new techniques in comparison to competing access methods.
2013
Efficient spatial joins are pivotal for many applications and particularly important for geographical information systems or for the simulation sciences where scientists work with spatial models. Past research has primarily focused on diskbased spatial joins; efficient in-memory approaches, however, are important for two reasons: a) main memory has grown so large that many datasets fit in it and b) the inmemory join is a very time-consuming part of all disk-based spatial joins. In this paper we develop TOUCH, a novel in-memory spatial join algorithm that uses hierarchical data-oriented space partitioning, thereby keeping both its memory footprint and the number of comparisons low. Our results show that TOUCH outperforms known in-memory spatial-join algorithms as well as in-memory implementations of disk-based join approaches. In particular, it has a one order of magnitude advantage over the memory-demanding state of the art in terms of number of comparisons (i.e., pairwise object co...
Among other operations, a spatiotemporal DBMS should efficiently answer the spatiotemporal join. This paper presents an evaluation of spatiotemporal join algorithms using these new structures, particularly a partially persistent R-Tree called Temporal R-Tree and the 2+3D R-Tree. Starting from spatial join algorithms, we present algorithms for processing spatiotemporal joins over time instants and intervals on both spatiotemporal data structures. Finally, we implement and test these new algorithms with a couple of generated spatiotemporal data sets. Our experiments show that our algorithms´ performance is good even in extreme cases, showing its good scalability-especially for the TR-Tree.
Proc. 26th Int. Conf. on Very Large …, 2000
Modern database applications show a growing demand for efficient and dynamic management of intervals, particularly for temporal and spatial data or for constraint handling. Common approaches require the augmentation of index structures which, however, is not supported by existing relational database systems. By design, the new Relational Interval Tree 1 (RI-tree) employs built-in indexes on an as-they-are basis and is easy to implement. Whereas the functionality and efficiency of the RI-tree is supported by any off-the-shelf relational DBMS, it is perfectly encapsulated by the object-relational data model. The RI-tree requires O(n/b) disk blocks of size b to store n intervals, O(log b n) I/O operations for insertion or deletion, and O(h · log b n + r/b) I/Os for an intersection query producing r results. The height h of the virtual backbone tree corresponds to the current expansion and granularity of the data space but does not depend on n. As demonstrated by our experimental evaluation on an Oracle8i server, competing dynamic interval access methods are outperformed by factors of up to 42 for disk accesses and 4.9 for query response time.
Lecture Notes in Computer Science, 2004
Among other operations, a spatiotemporal DBMS should efficiently answer the spatiotemporal join. This paper presents an evaluation of spatiotemporal join algorithms using these new structures, particularly a partially persistent R-Tree called Temporal R-Tree and the 2+3D R-Tree. Starting from spatial join algorithms, we present algorithms for processing spatiotemporal joins over time instants and intervals on both spatiotemporal data structures. Finally, we implement and test these new algorithms with a couple of generated spatiotemporal data sets. Our experiments show that our algorithms´ performance is good even in extreme cases, showing its good scalability -especially for the TR-Tree.
IEEE Transactions on Knowledge and Data Engineering, 1998
Existing methods for spatial joins require pre-existing spatial indices or other precomputation, but such approaches are inefficient and limited in generality. Operand data sets of spatial joins may not all have precomputed indices, particularly when they are dynamically generated by other selection or join operations. Also, existing spatial indices are mostly designed for spatial selections, and are not always efficient for joins. This paper explores the design and implementation of seeded trees [1], which are effective for spatial joins and efficient to construct at join time. Seeded trees are R-tree-like structures, but divided into seed levels and grown levels. This structure facilitates using information regarding the join to accelerate the join process, and allows efficient buffer management. In addition to the basic structure and behavior of seeded trees, we present techniques for efficient seeded tree construction, a new buffer management strategy to lower I/O costs, and theoretical analysis for choosing algorithmic parameters. We also present methods for reducing space requirements and improving the stability of seeded tree performance with no additional I/O costs. Our performance studies show that the seeded tree method outperforms other tree-based methods by far both in terms of the number disk pages accessed and weighted I/O costs. Further, its performance gain is stable across different input data, and its incurred CPU penalties are also lower.
Proceedings of the ACM SIGMOD 39th International Conference on Management of Data
There exists a need for high performance, read-only mainmemory database systems for OLAP-style application scenarios. Most of the existing works in this area are centered around the domain of column-store databases, which are particularly well suited to OLAP-style scenarios and have been shown to overcome the memory bottleneck issues that have been found to hinder the more traditional row-store database systems. One of the main database operations these systems are focused on optimizing is the JOIN operation. However, all these existing systems use join algorithms that are designed with the unrealistic assumption that there is unlimited temporary memory available to perform the join. In contrast, we propose a Memory Constrained Join algorithm (MCJoin) which is both high performing and also performs all of its operations within a tight given memory constraint. Extensive experimental results show that MCJoin outperforms a naive memory constrained version of the state-of-the-art Radix-Clustered Hash Join algorithm in all of the situations tested, with margins of up to almost 500%.
j. New Generation Computing, 1988
In the recent investigations of reducing the relational join operation complexity several hash-based partitioned-join stategies have been introduced. All of these strategies depend upon the costly operation of data space partitioning before the join can be carried out. We had previously introduced a partitioned-join based on a dynamic and order preserving multidimensional data organization called DYOP. The present study extends the earlier research on DYOP and constructs a simulation model. The simulation studies on DYOP and subsequent comparisons of all the partitioned-join methodologies including DYOP have proven that space utilization of DYOP improves with the increasing number of attributes. Furthermore, the DYOP based join outperforms all the hash-based methodologies by greatly reducing the total I/O bandwidth required for the entire partitioned-join operation. The comparison model is independent of the architectural issues such as multiprocessing, multiple disk usage, and large memory availability all of which help to further increase the efficiency of the operation.
Transactions in GIS, 2005
Many applications capture, or make use of, spatial data that changes over time. This requirement for effective and efficient spatio-temporal data management has given rise to a range of research activities relating to spatio-temporal data management. Such work has sought to understand, for example, the requirements of different categories of application, and the modelling facilities that are most effective for these applications. However, at present, there are few systems with fully integrated support for spatio-temporal data, and thus developers must often construct custom solutions for their applications. Developers of both bespoke solutions and of generic spatio-temporal platforms will often need to support the inter-relating of large spatio-temporal data sets. Supporting such requests in a database setting involves the use of join operations with both spatial and temporal conditions -spatio-temporal joins. However, there has been little work to date on spatio-temporal join algorithms or their evaluation. This paper presents an evaluation of several approaches to the implementation of spatio-temporal joins that build upon widely available indexing techniques. The evaluation explores how several algorithms perform for databases with different spatial and temporal characteristics, with a view to helping developers of generic infrastructures or custom solutions in the selection and development of appropriate spatio-temporal join strategies.
2002
A join-index is a data structure used for processing join queries in databases. Join-indices use precomputation techniques to speed up online query processing and are useful for data sets which are updated infrequently. The I/O cost of join computation using a join-index with limited buffer space depends primarily on the page-access sequence used to fetch the pages of the base relations. Given a join-index, we introduce a suite of methods based on clustering to compute the joins. We derive upper bounds on the length of the page-access sequences. Experimental results with Sequoia 2000 data sets show that the clustering method outperforms existing methods based on sorting and online-clustering heuristics.
Lecture Notes in Computer Science, 2004
The efficient management of interval sequences represents a core requirement for many temporal and spatial database applications. With the Relational Interval Tree (RI-tree), an efficient access method has been proposed to process intersection queries of spatial objects encoded by interval sequences on top of existing object-relational database systems. This paper complements that approach by effective and efficient models to estimate the selectivity and the I/O cost of interval sequence intersection queries in order to guide the cost-based optimizer whether and how to include the RI-tree into the execution plan. By design, the models immediately fit to common extensible indexing/optimization frameworks, and their implementations exploit the built-in statistics facilities of the database server. According to our experimental evaluation on an Oracle database, the average relative error of the estimated query results and costs lies in the range of 0% to 32%, depending on the size and the structural complexity of the query objects.
Loading Preview
Sorry, preview is currently unavailable. You can download the paper by clicking the button above.