Academia.edu no longer supports Internet Explorer.
To browse Academia.edu and the wider internet faster and more securely, please take a few seconds to upgrade your browser.
2003, Eighth International Conference on Database Systems for Advanced Applications, 2003. (DASFAA 2003). Proceedings.
The efficient processing of similarity joins is important for a large class of applications. The dimensionality of the data for these applications ranges from low to high. Most existing methods have focussed on the execution of high-dimensional joins over large amounts of diskbased data. The increasing sizes of main memory available on current computers, and the need for efficient processing of spatial joins suggest that spatial joins for a large class of problems can be processed in main memory. In this paper we develop two new spatial join algorithms, the Grid-join and EGO*-join, and study their performance in comparison to the state of the art algorithm, EGO-join, and the RSJ algorithm. Through evaluation we explore the domain of applicability of each algorithm and provide recommendations for the choice of join algorithm depending upon the dimensionality of the data as well as the critical ε parameter. We also point out the significance of the choice of this parameter for ensuring that the selectivity achieved is reasonable. The proposed EGO*-join algorithm always, often significantly, outperforms the EGO-join. For low-dimensional data the Grid-join outperform both the EGO-and EGO*-joins. An analysis of the cost of the Grid-join is presented and highly accurate cost estimator functions are developed. These are used to choose an appropriate grid size for optimal performance and can also be used by a query optimizer to compute the estimated cost of the Grid-join.
Information Systems, 2007
The efficient processing of multidimensional similarity joins is important for a large class of applications. The dimensionality of the data for these applications ranges from low to high. Most existing methods have focused on the execution of high-dimensional joins over large amounts of disk-based data. The increasing sizes of main memory available on current computers, and the need for efficient processing of spatial joins suggest that spatial joins for a large class of problems can be processed in main memory. In this paper, we develop two new in-memory spatial join algorithms, the Grid-join and EGO*-join, and study their performance. Through evaluation, we explore the domain of applicability of each approach and provide recommendations for the choice of a join algorithm depending upon the dimensionality of the data as well as the expected selectivity of the join. We show that the two new proposed join techniques substantially outperform the state-of-the-art join algorithm, the EGO-join.
IEEE Transactions on Knowledge and Data Engineering, 2000
ÐCurrent data repositories include a variety of data types, including audio, images, and time series. State-of-the-art techniques for indexing such data and doing query processing rely on a transformation of data elements into points in a multidimensional feature space. Indexing and query processing then take place in the feature space. In this paper, we study algorithms for finding relationships among points in multidimensional feature spaces, specifically algorithms for multidimensional joins. Like joins of conventional relations, correlations between multidimensional feature spaces can offer valuable information about the data sets involved. We present several algorithmic paradigms for solving the multidimensional join problem and we discuss their features and limitations. We propose a generalization of the Size Separation Spatial Join algorithm, named Multidimensional Spatial Join (MSJ), to solve the multidimensional join problem. We evaluate MSJ along with several other specific algorithms, comparing their performance for various dimensionalities on both real and synthetic multidimensional data sets. Our experimental results indicate that MSJ, which is based on space filling curves, consistently yields good performance across a wide range of dimensionalities. Index TermsÐSpatial join, sort merge joins, multiple-key indexes, data structures.
1997
Multidimensional similarity join finds pairs of multidimensional points that are within some small distance of each other. The -k-d-B tree has been proposed as a data structure that scales better as the number of dimensions increases compared to previous data structures. We present a cost model of the -k-d-B tree and use it to optimize the leaf size.
Proceedings of the 29th International Conference on Advances in Geographic Information Systems, 2021
The importance and complexity of spatial join resulted in many join algorithms, some of which run on big-data platforms such as Hadoop and Spark. This paper proposes the first machine-learningbased query optimizer for spatial join operation which can accommodate the skewness of the spatial datasets and the complexity of the different algorithms. The main challenge is how to develop portable cost models that take into account the important input characteristics such as data distribution, spatial partitioning, logic of spatial join algorithms, and the relationship between the two datasets. The proposed system defines a set of features that can all be computed efficiently for the data to catch the intricate aspects of spatial join. Then, it uses these features to train three machine learning models that capture several metrics to estimate the cost of four spatial join algorithms according to user requirements. The first model can estimate the cardinality of spatial join algorithm. The second model can predict the number of rough comparisons for a specific join algorithm. Finally, the third model is a classification model that can choose the best join algorithm to run. Experiments on large scale synthetic and real data show the efficiency of the proposed models over baseline methods. CCS CONCEPTS • Information systems → Database management system engines; • Computing methodologies → Machine learning approaches.
2002
A join-index is a data structure used for processing join queries in databases. Join-indices use precomputation techniques to speed up online query processing and are useful for data sets which are updated infrequently. The I/O cost of join computation using a join-index with limited buffer space depends primarily on the page-access sequence used to fetch the pages of the base relations. Given a join-index, we introduce a suite of methods based on clustering to compute the joins. We derive upper bounds on the length of the page-access sequences. Experimental results with Sequoia 2000 data sets show that the clustering method outperforms existing methods based on sorting and online-clustering heuristics.
Information Sciences, 2010
A spatial join is a query that searches for a set of object pairs satisfying a given spatial relationship from a database. It is one of the most costly queries, and thus requires an efficient processing algorithm that fully exploits the features of the underlying spatial indexes. In our earlier work, we devised a fairly effective algorithm for processing spatial joins with double transformation (DOT) indexing, which is one of several spatial indexing schemes. However, the algorithm is restricted to only the one-dimensional cases. In this paper, we extend the algorithm for the two-dimensional cases, which are general in Geographic Information Systems (GIS) applications. We first extend DOT to two-dimensional original space. Next, we propose an efficient algorithm for processing range queries using extended DOT. This algorithm employs the quarter division technique and the tri-quarter division technique devised by analyzing the regularity of the space-filling curve used in DOT. This greatly reduces the number of space transformation operations. We then propose a novel spatial join algorithm based on this range query processing algorithm. In processing a spatial join, we determine the access order of disk pages so that we can minimize the number of disk accesses. We show the superiority of the proposed method by extensive experiments using data sets of various distributions and sizes. The experimental results reveal that the proposed method improves the performance of spatial join processing up to three times in comparison with the widely-used R-tree-based spatial join method.
International Journal of Geographical Information Science, 1999
Spatial joins are join operations that involve spatial data types and operators. Spatial access methods are often used to speed up the computation of spatial joins. This paper addresses the issue of benchmarking spatial join operations. For this purpose, we first present a WWW-based benchmark generator to produce sets of rectangles. Using a Web browser, experimenters can specify the number of rectangles in a sample, as well as the statistical distributions of their sizes, shapes, and locations. Second, using the generator and a well-defined set of statistical models we define several tests to compare the performance of three spatial join algorithms: nested loop, scan-and-index, and synchronized tree traversal. We also added a real-life data set from the Sequoia 2000 storage benchmark. Our results show that the relative performance of the different techniques mainly depends on two parameters: sample size, and selectivity of the join predicate. All of the statistical models and algorithms are available on the Web, which allows for easy verification and modification of our experiments.
2010 IEEE 26th International Conference on Data Engineering (ICDE 2010), 2010
Similarity joins have been studied as key operations in multiple application domains, e.g., record linkage, data cleaning, multimedia and video applications, and phenomena detection on sensor networks. Multiple similarity join algorithms and implementation techniques have been proposed. They range from out-of-database approaches for only in-memory and external memory data to techniques that make use of standard database operators to answer similarity joins. Unfortunately, there has not been much study on the role and implementation of similarity joins as database physical operators. In this paper, we focus on the study of similarity joins as first-class database operators. We present the definition of several similarity join operators and study the way they interact among themselves, with other standard database operators, and with other previously proposed similarity-aware operators. In particular, we present multiple transformation rules that enable similarity query optimization through the generation of equivalent similarity query execution plans. We then describe an efficient implementation of two similarity join operators, Ɛ-Join and Join-Around, as core DBMS operators. The performance evaluation of the implemented operators in PostgreSQL shows that they have good execution time and scalability properties. The execution time of Join-Around is less than 5% of the one of the equivalent query that uses only regular operators while Ɛ-Join's execution time is 20 to 90% of the one of its equivalent regular operators based query for the useful case of small Ɛ (0.01% to 10% of the domain range). We also show experimentally that the proposed transformation rules can generate plans with execution times that are only 10% to 70% of the ones of the initial query plans.
Lecture Notes in Computer Science, 2002
Existing work on multiway spatial joins focuses on the retrieval of all exact solutions with no time limit for query processing. Depending on the query and data properties, however, exhaustive processing of multiway spatial joins can be prohibitively expensive due to the exponential nature of the problem. Furthermore, if there do not exist any exact solutions, the result will be empty even though there may exist solutions that match the query very closely. These shortcomings motivate the current work, which aims at the retrieval of the best possible (exact or approximate) solutions within a time threshold, since fast retrieval of approximate matches is the only way to deal with the ever increasing amounts of multimedia information in several real time systems. We propose various techniques that combine local and evolutionary search with underlying indexes to prune the search space. In addition to their usefulness as standalone methods for approximate query processing, the techniques can be combined with systematic search to enhance performance when the goal is retrieval of the best solutions.
Proceedings 14th International Conference on Data Engineering, 1998
The join query is one of the fundamental operations in Data Base Management Systems (DBMSs). Modern DBMSs should be able to support non-traditional data, including spatial objects, in an efficient manner. Towards this goal, spatial data structures can be adopted in order to support the execution of join queries on sets of multidimensional data. This paper introduces analytical models that estimate the cost (in terms of node or disk accesses) of join queries involving two multidimensional indexed data sets using R-tree-based structures. In addition, experimental results are presented, which show the accuracy of the analytical estimations when compared to actual runs on both synthetic and real data sets. It turns out that the relative error rarely exceeds 15% for all combinations, a fact that makes the proposed cost models useful tools for efficient spatial query optimization.
Proceedings of the 14th annual ACM international symposium on Advances in geographic information systems - GIS '06, 2006
This paper presents a query optimizer module based on cost estimation that chooses the best filtering step algorithm to perform a specific spatial join operation. A set of expressions to predict the number of I/O operations and the response time of each algorithm is first presented and later refined considering a given hardware configuration. The query optimizer chooses the algorithm that returns the smaller estimated response time. In order to evaluate the query optimizer, we carried out a set of tests with synthetic and real data sets, in a significant number of different scenarios. The query optimizer correctly chooses the fastest algorithm in almost 90% of submitted operations, with minimal overhead.
2010 18th Euromicro Conference on Parallel, …, 2010
Efficient processing of similarity joins is important for a large class of data analysis and data-mining applications. This primitive finds all pairs of records within a predefined distance threshold of each other. However, most of the existing approaches have been based on spatial join techniques designed primarily for data in a vector space. Treating data collections as metric objects brings a great advantage in generality, because a single metric technique can be applied to many specific search problems quite different in nature. In this ...
IEEE Transactions on Knowledge and Data Engineering, 1998
Existing methods for spatial joins require pre-existing spatial indices or other precomputation, but such approaches are inefficient and limited in generality. Operand data sets of spatial joins may not all have precomputed indices, particularly when they are dynamically generated by other selection or join operations. Also, existing spatial indices are mostly designed for spatial selections, and are not always efficient for joins. This paper explores the design and implementation of seeded trees [1], which are effective for spatial joins and efficient to construct at join time. Seeded trees are R-tree-like structures, but divided into seed levels and grown levels. This structure facilitates using information regarding the join to accelerate the join process, and allows efficient buffer management. In addition to the basic structure and behavior of seeded trees, we present techniques for efficient seeded tree construction, a new buffer management strategy to lower I/O costs, and theoretical analysis for choosing algorithmic parameters. We also present methods for reducing space requirements and improving the stability of seeded tree performance with no additional I/O costs. Our performance studies show that the seeded tree method outperforms other tree-based methods by far both in terms of the number disk pages accessed and weighted I/O costs. Further, its performance gain is stable across different input data, and its incurred CPU penalties are also lower.
Information Systems, 2020
This is a PDF file of an article that has undergone enhancements after acceptance, such as the addition of a cover page and metadata, and formatting for readability, but it is not yet the definitive version of record. This version will undergo additional copyediting, typesetting and review before it is published in its final form, but we are providing this version to give early visibility of the article. Please note that, during the production process, errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.
2018 IEEE 34th International Conference on Data Engineering (ICDE), 2018
Set similarity join is a fundamental and wellstudied database operator. It is usually studied in the exact setting where the goal is to compute all pairs of sets that exceed a given similarity threshold (measured e.g. as Jaccard similarity). But set similarity join is often used in settings where 100% recall may not be importantindeed, where the exact set similarity join is itself only an approximation of the desired result set. We present a new randomized algorithm for set similarity join that can achieve any desired recall up to 100%, and show theoretically and empirically that it significantly improves on existing methods. The present state-of-the-art exact methods are based on prefix-filtering, the performance of which depends on the data set having many rare tokens. Our method is robust against the absence of such structure in the data. At 90% recall our algorithm is often more than an order of magnitude faster than state-of-the-art exact methods, depending on how well a data set lends itself to prefix filtering. Our experiments on benchmark data sets also show that the method is several times faster than comparable approximate methods. Our algorithm makes use of recent theoretical advances in highdimensional sketching and indexing that we believe to be of wider relevance to the data engineering community.
Synthesis Lectures on Data Management, 2013
Synthesis Lectures on Data Management is edited by Tamer Özsu of the University of Waterloo. e series publishes 50-to 125 page publications on topics pertaining to data management. e scope will largely follow the purview of premier information and computer science conferences, such as ACM SIGMOD, VLDB, ICDE, PODS, ICDT, and ACM KDD. Potential topics include, but not are limited to: query languages, database system architectures, transaction management, data warehousing, XML and databases, data stream systems, wide scale data distribution, multimedia data management, data mining, and related subjects.
2008
Abstract. The spatial join operation is both one of the most important and expensive operations in Geographic Database Management Systems (GDBMS). This paper presents a set of rules to optimize the performance of the filtering step of spatial joins operations. First, a set of expressions to predict the number of I/O operations and CPU performance is presented. The rules are based on expressions to predict the performance of algorithms and tests performed with synthetic and real data sets.
j. New Generation Computing, 1988
In the recent investigations of reducing the relational join operation complexity several hash-based partitioned-join stategies have been introduced. All of these strategies depend upon the costly operation of data space partitioning before the join can be carried out. We had previously introduced a partitioned-join based on a dynamic and order preserving multidimensional data organization called DYOP. The present study extends the earlier research on DYOP and constructs a simulation model. The simulation studies on DYOP and subsequent comparisons of all the partitioned-join methodologies including DYOP have proven that space utilization of DYOP improves with the increasing number of attributes. Furthermore, the DYOP based join outperforms all the hash-based methodologies by greatly reducing the total I/O bandwidth required for the entire partitioned-join operation. The comparison model is independent of the architectural issues such as multiprocessing, multiple disk usage, and large memory availability all of which help to further increase the efficiency of the operation.
Journal of King Saud University - Computer and Information Sciences, 2010
Enhancing the performance of large database systems depends heavily on the cost of performing join operations. When two very large tables are joined, optimizing such operation is considered one of the interesting research topics to many researchers, especially when both tables, to be joined, are very large to fit in main memory. In such case, join is usually performed by any other method than hash Join algorithms. In this paper, a novel join algorithm that is based on the use of quadtrees, is introduced. Applying the proposed algorithm on two very large tables, that are too large to fit in main memory, is proven to be fast and efficient. In the proposed new algorithm, both tables are represented by a storage efficient quadtree that is designed to handle one-dimensional arrays (1-D arrays). The algorithm works on the two 1-D arrays of the two tables to perform join operations. For the new algorithm, time and space complexities are studied. Experimental studies show the efficiency and superiority of this algorithm. The proposed join algorithm requires minimum number of I/O operations and operates in main memory with O(n log (n/k)) time complexity, where k is number of key groups with same first letter, and (n/k) is much smaller than n.
This paper proposes an efficient similarity join method using unsupervised learning, when no labeled data is available. In our previous work, we showed that the performance of similarity join could improve when long string attributes, such as paper abstracts, movie summaries, product descriptions, and user feedback, are used under supervised learning, where a training set exists. In this work, we adopt using long string attributes during the similarity join under unsupervised learning. Along with its importance when no labeled data exists, unsupervised learning is used when no labeled data is available, it acts also as a quick preprocessing method for huge datasets. Here, we show that using long attributes during the unsupervised learning can further enhance the performance. Moreover, we provide an efficient dynamically expandable algorithm for databases with frequent transactions.
Loading Preview
Sorry, preview is currently unavailable. You can download the paper by clicking the button above.