Academia.edu no longer supports Internet Explorer.
To browse Academia.edu and the wider internet faster and more securely, please take a few seconds to upgrade your browser.
2001, Algorithmica
Partial match queries arise frequently in the context of large databases, where each record contains a distinct multidimensional key, that is, the key of each record is a K -tuple of values. The components of a key are called the coordinates or attributes of the key. In a partial match query we specify the value of s attributes, 0 < s < K , and leave the remaining K − s attributes unspecified. The goal is to retrieve all the records in the database that match the specified attributes. In this paper we present several results about the average performance and variance of partial matches in relaxed K -dimensional trees (search trees and digital tries). These data structures are variants of the well known K d-trees and K d-tries. In relaxed trees the sequence of attributes used to guide a query is explicitly stored at the nodes of the tree and randomly generated and, in general, will be different for different search paths. In the standard variants, the sequence of attributes that guides a query examines the attributes in a cyclic fashion, fixed and identical for all search paths. We show that the probabilistic analysis of the relaxed multidimensional trees is very similar to that of standard K d-trees and K d-tries, and also to the analysis of quadtrees. In fact, besides the average cost and variance of partial match in relaxed K d-trees and K d-tries, we also obtain the variance of partial matches in two-dimensional quadtrees. We also compute the average cost of partial matches in other relaxed multidimensional digital tries, namely, relaxed K d-Patricia and relaxed K d-digital search trees.
Communications of the ACM, 1975
This paper develops the multidimensional binary search tree (or k-d tree, where k is the dimensionality of the search space) as a data structure for storage of information to be retrieved by associative searches. The k-d tree is defined and examples are given. It is shown to be quite efficient in its storage requirements. A significant advantage of this structure is that a single data structure can handle many types of queries very efficiently. Various utility algorithms are developed; their proven average running times in an n record file are : insertion, O(log n); deletion of the root, 0 (n (k--1)/k) ; deletion of a random node, O(log n); and optimization (guarantees logarithmic performance of searches), 0 (n log n).
Information Systems, 1982
A new method for multiple attribute indexing, the Multidimensional B-Tree (MBDT), is developed. This method is well suited for dynamic databases, since it handles several types of associative queries efficiently and requires low-cost maintenance. Algorithms and search strategies for exact match, partial match, and range queries are presented and statistical procedures are given to estimate the average and worst case retrieval times. The applicability of our organization to practical databases is discussed and analytical tradeoffs with regard to index organizations based on k-d trees are established.
Proceedings of the sixth ACM SIGKDD international conference on Knowledge discovery and data mining - KDD '00, 2000
Affordable, fast computers with large memories have lessened the demand for program efficiency, but applications such as browsing and searching very large databases often have ratelimiting constraints and therefore benefit greatly from improvements in efficiency. This paper empirically evaluates several variants of a common k-dimensional tree technique to demonstrate how different algorithm options influence search cost for nearest neighbors.
Acta Informatica, 1974
The quad tree is a data structure appropriate for storing information to be retrieved on composite keys. We discuss the specific case of two-dimensional retrieval, although the structure is easily generalis~ to arbitrary dimensions. Algorithms are given both for staightforward insertion and for a type of balanced insertion into quad trees. Empirical analyses show that the average time for insertion is logarithmic with the tree size. An algorithm for retrieval within regions is presented along with data from empirical studies which imply that searching is reasonably efficient. We define an optimized tree and present an algorithm to accomplish optimization in n log n time. Searching is guaranteed to be fast in optimized trees. Remaining problems include those of deletion from quad trees and merging of quad trees, which seem to be inherently difficult operations.
2001
Nowadays feature vector based similarity search is increasingly emerging in database systems. Consequently, many multidimensional data index techniques have been widely introduced to database researcher community. These index techniques are categorized into two main classes: SP (space partitioning)/KD-tree-based and DP (data partitioning)/R-tree-based. Recently, a hybrid index structure has been proposed. It combines both SP/KDtree-based and DP/R-tree-based techniques to form a new, more efficient index structure. However, weaknesses are still existing in techniques above. In this paper, we introduce a novel and flexible index structure for multidimensional data, the SH-tree (Super Hybrid tree). Theoretical analyses show that the SHtree is a good combination of both techniques with respect to both presentation and search algorithms. It overcomes the shortcomings and makes use of their positive aspects to facilitate efficient similarity searches.
2017
Efficient evaluation of selection predicates (e.g., range predicates) defined on multiple columns of the same table is a difficult, but nevertheless important task. As we have seen an enormous increase of data within the last decade, efficient multi-dimensional selection predicate evaluation becomes more important. This is especially important for scientific data management tasks, where we often face data sets that need to be filtered based on several dimensions. So far, the state-of-the-art solution strategy is to apply highly optimized sequential scans. However, the intermediate results are often large, while the final query result often only contains a small fraction of the data set. This is due to the combined selectivity of all predicates. We propose Elf a new tree-based approach to efficiently support such queries. Our structure indexes densely populated sub-spaces allowing for efficient pruning. Keywords— data analytics, indexing, main-memory databases, storage structures.
2005
In this paper, we propose an efficient access method, named MK-tree, to dynamically index large data sets in high dimensional spaces. It is an extension of Mtree with key dimension to improve the efficiency of space partition and reduce the response time of similarity search for high dimensional data. The main idea behind the key dimension is to make the fanout of tree larger by partitioning a subspace further into two subspaces, called a twin-node, according to the key dimension. To get a high space utilization, we conduct data reallocation within a twin-node dynamically, therefore further improve the performance of MK-tree. Our experimental results show that a higher filtering efficiency can be obtained by using the concept of key dimension for both R-neighbor search and K-nearest neighbor search.
2016
Efficient evaluation of selection predicates (e.g., range predicates) defined on multiple columns of the same table is a difficult, but nevertheless important task. Especially for subsequent join processing or aggregation, we need to reduce the amount of tuples to be processed. As we have seen an enormous increase of data with the last decade, this kind of selection predicate became more important. Especially in OLAP scenarios or scientific data management tasks, we often face multi-dimensional data sets that need to be filtered based on several dimensions. So far, the state-of-the-art solution strategy is to apply highly optimized sequential scans. However, the intermediate results are often large, while the final query result often only contains a small fraction of the data set. This is due to the combined selectivity of all predicates. In this report, we propose Elf a new tree-based approach to efficiently support such queries. In contrast, to other tree-based approaches, we do n...
2003
Similarity searches in multidimensional Nonordered Discrete Data Spaces (NDDS) are becoming increasingly important for application areas such as genome sequence databases. Existing indexing methods developed for multidimensional (ordered) Continuous Data Spaces (CDS) such as R-tree cannot be directly applied to an NDDS. This is because some essential geometric concepts/properties such as the minimum bounding region and the area of a region in a CDS are no longer valid in an NDDS. On the other hand, indexing methods based on metric spaces such as M-tree are too general to effectively utilize the data distribution characteristics in an NDDS. Therefore, their retrieval performance is not optimized. To support efficient similarity searches in an NDDS, we propose a new dynamic indexing technique, called the ND-tree. The key idea is to extend the relevant geometric concepts as well as some indexing strategies used in CDSs to NDDSs. Efficient algorithms for ND-tree construction are presented. Our experimental results on synthetic and genomic sequence data demonstrate that the performance of the ND-tree is significantly better than that of the linear scan and M-tree in high dimensional NDDSs.
2006
Multi-dimensional data structures are applied in many real index applications, i.e. data mining, indexing multimedia data, indexing nonstructured text documents and so on. Many index structures and algorithms have been proposed. There are two major approaches to multi-dimensional indexing. These are, data structures to indexing metric and vector spaces. The R-tree, R*-tree, and UB-tree are representatives of the vector data structures. These data structures provide efficient processing for many types of queries, i.e. point queries, range queries and so on. As far as the vector data structures are concerned the range query retrieves all points in defined hyper box in an n-dimensional space. The narrow range query is a significant type of the range query. Its processing is inefficient in the vector data structures. Moreover, the efficiency decreases from increase dimension of an indexed space. We depict an application of the signature for more efficient processing of narrow range queries. The approach puts the signature into the R-tree but native functionalities are preserved, i.e. the range query algorithm for general range query. The novel data structure is called the Signature R-tree. This data structure is more resistant to the curse of dimensionality.
The problem of Music Information Retrieval can often be formalized as "searching for multidimensional trajectories". It is well known that string-matching techniques provide robust and effective theoretic solutions to this problem. However, for low dimensional searches, especially queries concerning a single vector as opposed to a series of vectors, there are a wide variety of other methods available. In this work we examine and benchmark those methods and attempt to determine if they may be useful in the field of information retrieval. Notably, we propose the use of KD-Trees for multidimensional nearneighbor searching. We show that a KD-Tree is optimized for multidimensional data, and is preferred over other methods that have been suggested, such as the K-Tree, the box-assisted sort and the multidimensional quick-sort.
1997
Abstract In this paper we discuss some design issues concerning a semi-dynamic data structure for searching in multidimensional point sets in distributed environments. The data structure is based on an extension of kd trees and supports exact, partial, and range search queries. We assume multicast is available in our distributed environment, but discuss how to use it only when needed and investigate, through a cost-model, the best strategy to deal with range queries.
1998
We investigate the usability and performance of the UB-Tree (universal B-Tree) for multidimensional data, as they arise in all relational databases and in particular in data- warehousing and data-mining applications. The UB-Tree is balanced and has all the guaranteed performance characteristics of B-Trees, i.e., it requires linear space for storage and logarithmic time for the basic operations of insertion, retrieval
1998
In this paper we present a data structure for searching in multi-dimensional point sets in distributed environments and discuss its experimental evaluation also through a comparison with previous proposals. The data structure is based on an extension ofk-d trees. The technological reference context is a distributed environment where multicast (ie, restricted broadcast) is allowed, but it is also shown how to avoid using it.
Journal of Algorithms, 2002
In this work we present the average-case analysis of orthogonal range search for several multidimensional data structures. We first consider random relaxed K-d trees as a prototypical example. Later we extend these results to many different multidimensional data structures. We show that the performance of range searches is related to the performance of a variant of partial matches using a mixture of geometric and combinatorial arguments. This reduction simplifies the analysis and allows us to give exact upper and lower bounds for the performance of range searches (Theorems 3 and 4) and a useful characterization of the cost of range search as a sum of the costs of partial match-like operations (Theorem 5). Using these results, we can get very precise asymptotic estimates for the expected cost of range searches (Theorem 6).
2007
Over the last two decades, much research effort has been spent on nearest neighbor search in high-dimensional data sets. Most of the approaches published thus far have, however, only been tested on rather small collections. When large collections have been considered, high-performance environments have been used, in particular systems with a large main memory. Accessing data on disk has largely been avoided because disk operations are considered to be too slow. It has been shown, however, that using large amounts of memory is generally not an economic choice. Therefore, we propose the NV-tree, which is a very efficient disk-based data structure that can give good approximate answers to nearest neighbor queries with a single disk operation, even for very large collections of high-dimensional data. Using a single NV-tree, the returned results have high recall but contain a number of false positives. By combining two or three NV-trees, most of those false positives can be avoided while retaining the high recall. Finally, we compare the NV-tree to Locality Sensitive Hashing, a popular method for-distance search. We show that they return results of similar quality, but the NV-tree uses many fewer disk reads.
2006 10th International Database Engineering and Applications Symposium (IDEAS'06), 2006
Multi-dimensional data structures are applied in many real index applications, i.e. data mining, indexing multimedia data, indexing nonstructured text documents and so on. Many index structures and algorithms have been proposed. There are two major approaches to multi-dimensional indexing. These are, data structures to indexing metric and vector spaces. The R-tree, R*-tree, and UB-tree are representatives of the vector data structures. These data structures provide efficient processing for many types of queries, i.e. point queries, range queries and so on. As far as the vector data structures are concerned the range query retrieves all points in defined hyper box in an n-dimensional space. The narrow range query is a significant type of the range query. Its processing is inefficient in the vector data structures. Moreover, the efficiency decreases from increase dimension of an indexed space. We depict an application of the signature for more efficient processing of narrow range queries. The approach puts the signature into the R-tree but native functionalities are preserved, i.e. the range query algorithm for general range query. The novel data structure is called the Signature R-tree. This data structure is more resistant to the curse of dimensionality.
2003
Relational index structures, as for instance the Relational Interval Tree, the Relational R-Tree, or the Linear Quadtree, support efficient processing of queries on top of existing object-relational database systems. Furthermore, there exist effective and efficient models to estimate the selectivity and the I/O cost in order to guide the cost-based optimizer whether and how to include these index structures into the execution plan. By design, the models immediately fit to common extensible indexing/optimization frameworks, and their implementations exploit the built-in statistics facilities of the database server. In this paper, we show how these statistics can also be used for accelerating geo-spatial queries using the relational quadtree by reducing the number of generated join partners which results in less logical reads and consequently improves the overall runtime. We cut down on the number of join partners by grouping different join partners together according to a statistic driven grouping algorithm. Our experiments on an Oracle9i database yield an average speed-up between 30% and 300% for spatial selection queries on the Relational Quadtree.
Distributed and Parallel Databases, 2005
Multidimensional indexing is concerned with the indexing of multi-attributed records, where queries can be applied on some or all of the attributes. Indexing multi-attributed records is referred to by the term multidimensional indexing because each record is viewed as a ...
Loading Preview
Sorry, preview is currently unavailable. You can download the paper by clicking the button above.