Academia.edu no longer supports Internet Explorer.
To browse Academia.edu and the wider internet faster and more securely, please take a few seconds to upgrade your browser.
2008
Many computational applications need to look for informa- tion in a database. Nowadays, the predominance of non- conventional databases makes the similarity search (i.e., searching elements of the database that are "similar" to a given query) becomes a preponderant concept. The Spatial Approximation Tree has been shown that it compares favorably against alternative data structures for similarity searching in metric spaces of medium to high di- mensionality ("difficult" spaces) or queries with low selec- tivity. However, for the construction process the tree root has been randomly selected and the tree ,in its shape and performance, is completely determined by this selection. Therefore, we are interested in improve mainly the searches in this data structure trying to select the tree root so to re- flect some of the own characteristics of the metric space to be indexed. We regard that selecting the root in this way it allows a better adaption of the data structure ...
2010
The metric space model allows abstracting many similarity search problems. Similarity search has multiple applications especially in the multimedia databases area. The idea is to index the database so as to accelerate similarity queries. Although there are several promising indices, few of them are dynamic, i.e., once created very few allow to perform insertions and deletions of elements at a reasonable cost.
Similarity Search and Applications, 2009 …, 2009
Metric space searching is an emerging technique to address the problem of efficient similarity searching in many applications, including multimedia databases and other repositories handling complex objects. Although promising, the metric space approach is still immature in several aspects that are well established in traditional databases. In particular, most indexing schemes are not dynamic, that is, few of them tolerate insertion of elements at reasonable cost over an existing index and only a few work efficiently in secondary memory.
2010
Similarity search in high-dimensional metric spaces is a key operation in many applications, such as multimedia databases, image retrieval, object recognition, and others. The high dimensionality of the data requires special index structures to facilitate the search. A problem regarding the creation of suitable index structures for highdimensional data is the relationship between the geometry of the data and the organization of an index structure. In this paper, we study the performance of a new index structure, called Divisive-Agglomerative Hierarchical Clustering tree (DAHC-tree), which reduces the effects imposed by the above liability. DAHC-tree is constructed by dividing and grouping the data set into compact clusters. We perform a rigorous experimental design and analyze the trade-offs involved in building such an index structure. Additionally, we present extensive experiments comparing our method against state-of-the-art of exact and approximate solutions. The conducted analysis and the reported comparative test results demonstrate that our technique significantly improves the performance of similarity queries.
2002
The Spatial Approximation Tree (sa-tree) is a recently proposed data structure for searching in metric spaces. It has been shown that it compares favorably against alternative data structures in spaces of high dimension or queries with low selectivity. The main drawback of the ...
2003
Hybrid dynamic spatial approximation trees are recently proposed data structures for searching in metric spaces, based on combining the concepts of spatial approximation and pivot based algorithms. These data structures are hybrid schemes, with the full features of dynamic spatial approximation trees and able of using the available memory to improve the query time. It has been shown that they compare favorably against alternative data structures in spaces of medium difficulty. In this paper we complete and improve hybrid dynamic spatial approximation trees, by presenting a new search alternative, an algorithm to remove objects from the tree, and an improved way of managing the available memory. The result is a fully dynamic and optimized data structure for similarity searching in metric spaces.
2001
The spatial approximation tree (sa-tree) is a recently proposed data structure for searching in metric spaces. It has been shown to compare favorably against alternative data structures in spaces of high dimension or queries with low selectivity. The main drawback of the sa-tree is that it is a static data structure, that is, once built, it is difficult to add new elements to it. This rules it out for many interesting applications. In this paper we overcome this weakness. We propose and study several methods to handle insertions in the sa-tree. Some are classical solutions well known in the data structures community, while the most promising ones have been specifically developed considering the particular properties of the sa-tree, and involve new algorithmic insights in the behavior of this data structure. As a result, we show that it is viable to modify the sa-tree so as to permit fast insertions while keeping its good search efficiency
IEEE Transactions on Knowledge and Data Engineering, 2017
Spatial queries including similarity search and similarity joins are useful in many areas, such as multimedia retrieval, data integration, and so on. However, they are not supported well by commercial DBMSs. This may be due to the complex data types involved and the needs for flexible similarity criteria seen in real applications. In this paper, we propose a versatile and efficient diskbased index for metric data, the Space-filling curve and Pivot-based B +-tree (SPB-tree). This index leverages the B +-tree, and uses space-filling curve to cluster data into compact regions, thus achieving storage efficiency. It utilizes a small set of so-called pivots to reduce significantly the number of distance computations when using the index. Further, it makes use of a separate random access file to support a broad range of data. By design, it is easy to integrate the SPB-tree into an existing DBMS. We present efficient algorithms for processing similarity search and similarity joins, as well as corresponding cost models based on SPB-trees. Extensive experiments using both real and synthetic data show that, compared with state-of-the-art competitors, the SPB-tree has much lower construction cost, smaller storage size, and supports more efficient similarity search and similarity joins with high accuracy cost models.
ACM Journal of Experimental Algorithmics, 2009
Proximity searching consists of retrieving from a database those elements that are similar to a query object. The usual model for proximity searching is a metric space where the distance, which models the proximity, is expensive to compute. An index uses precomputed distances to speedup query processing. Among all the known indices, the baseline for performance for about 20 years has been AESA. This index uses an iterative procedure, where at each iteration it first chooses the next promising element (“pivot”) to compare to the query, and then it discards database elements that can be proved not relevant to the query using the pivot. The next pivot in AESA is chosen as the one minimizing the sum of lower bounds to the distance to the query proved by previous pivots. In this article, we introduce the new index iAESA , which establishes a new performance baseline for metric space searching. The difference with AESA is the method to select the next pivot. In iAESA, each candidate sorts...
Journal of Computer Science Technology, 2014
Metric space searching is an emerging technique to address the problem of similarity searching in many applications. In order to efficiently answer similarity queries, the database must be indexed. In some interesting real applications dynamism is an indispensable property of the index. There are very few actually dynamic indexes that support not only searches, but also insertions and deletions of elements. The dynamic spatial approximation tree (DSAT) is a data structure specially designed for searching in metric spaces, which compares favorably against other data structures in high dimensional spaces or queries with low selectivity. Insertions are efficient and easily supported in DSAT, but deletions degrade the structure over time. Several methods are proposed to handle deletions over the DSAT. One of them has shown to be superior to the others, in the sense that it permits controlling the expected deletion cost as a proportion of the insertion cost and searches does not overly degrade after several deletions. In this paper we propose and study a new alternative deletion method, based on the better existing strategy. The outcome is a fully dynamic data structure that can be managed through insertions and deletions over arbitrarily long periods of time without any significant reorganization.
Hybrid dynamic spatial approximation trees are recently proposed data structures for searching in metric spaces, based on combining the concepts of spatial approximation and pivot based algorithms. These data structures are hybrid schemes, with the full features of dynamic spatial approximation trees and able of using the available memory to improve the query time. It has been shown that they compare favorably against alternative data structures in spaces of medium difficulty.
String Processing and Information Retrieval, 2002
The Spatial Approximation Tree (sa-tree) is a recently proposed data structure for searching in metric spaces. It has been shown that it compares favorably against alternative data structures in spaces of high dimension or queries with low selectivity. Its main drawbacks are: costly construction time, poor performance in low dimensional spaces or queries with high selectivity, and the fact of being a static data structure, that is, once built, one cannot add or delete elements. These facts rule it out for many interesting applications. In this paper we overcome these weaknesses. We present a dynamic version of the sa-tree that handles insertions and deletions, showing experimentally that the price of adding dynamism is rather low. This is remarkable by itself since very few data structures for metric spaces are fully dynamic. In addition, we show how to obtain large improvements in construction and search time for low dimensional spaces or highly selective queries. The outcome is a much more practical data structure that can be useful in a wide range of applications.
This work deals with the approximate string search in large spatial databases. Specially, I investigate range queries augmented with a string similarity search predicate in both Euclidean space and road networks. I dub this query the Spatial Approximate String (SAS) query. In Euclidean space, it propose an approximate solution, the MHR-tree, which embeds min-wise signatures into an R-tree. The min-wise signature for an index node u keeps a concise representation of the union of q-grams from strings under the sub-tree of u. It analyzes the pruning functionality of such signatures based on the set resemblance between the query string and the q-grams from the sub-trees of index nodes. We analyze the pruning functionality of such signatures based on set resemblance between the query string and the q-grams from the sub-trees of index nodes. MHR-tree supports a wide range of query predicates efficiently, including range and nearest neighbor queries. We also discuss how to estimate range query selectivity accurately. We present a novel adaptive algorithm for finding balanced partitions using both the spatial and string information stored in the tree. Extensive experiments on large real data sets demonstrate the efficiency and effectiveness of our approach.
1997
A new access method, called M-tree, is proposed to organize and search large data sets from a generic "metric space", i.e. where object proximity is only defined by a distance function satisfying the positivity, symmetry, and triangle inequality postulates. We detail algorithms for insertion of objects and split management, which keep the M-tree always balanced -several heuristic split alternatives are considered and experimentally evaluated. Algorithms for similarity (range and k-nearest neighbors) queries are also described.
ACM Transactions on Database Systems, 2002
Metric access methods (MAMs), such as the M-tree, are powerful index structures for supporting similarity queries on metric spaces, which represent a common abstraction for those searching problems that arise in many modern application areas, such as multimedia, data mining, decision support, pattern recognition, and genomic databases. As compared to multi-dimensional (spatial) access methods (SAMs), MAMs are more general, yet they are reputed to lose in flexibility, since it is commonly deemed that they can only answer queries using the same distance function used to build the index. In this paper we show that this limitation is only apparent -thus MAMs are far more flexible than believed -and extend the M-tree so as to be able to support user-defined distance criteria, approximate distance functions to speed up query evaluation, as well as dissimilarity functions which are not metrics. The so-extended M-tree, also called QIC-M-tree, can deal with three distinct distances at a time: 1) a query (user-defined) distance, 2) an index distance (used to build the tree), and 3) a comparison (approximate) distance (used to quickly discard from the search uninteresting parts of the tree). We develop an analytical cost model that accurately characterizes the performance of QIC-M-tree and validate such model through extensive experimentation on real metric data sets. In particular, our analysis is able to predict the best evaluation strategy (i.e. which distances to use) under a variety of configurations, by properly taking into account relevant factors such as the distribution of distances, the cost of computing distances, and the actual index structure. We also prove that the overall saving in CPU search costs when using an approximate distance can be estimated by using information on the data set only -thus such measure is independent of the underlying access method -and show that performance results are closely related to a novel "indexing" error measure. Finally, we show how our results apply to other MAMs and query types.
We consider the computational problem of nding nearest neighbors in general metric spaces. Of particular interest are spaces that may not be conveniently embedded or approximated in Euclidian space, or where the dimensionality of a Euclidian representation is very high. Also relevant are high-dimensional Euclidian settings in which the distribution of data is in some sense of lower dimension and embedded in the space. The vp-tree (vantage point tree) is introduced in several forms, together with associated algorithms, as an improved method for these di cult search problems. Tree construction executes in O(n log(n)) time, and search is under certain circumstances and in the limit, O(log(n)) expected time. The theoretical basis for this approach is developed and the results of several experiments are reported. In Euclidian cases, kd-tree performance is compared.
Journal of Information and Data Management
Many applications rely on spatial information retrieval, which involves costly computational geometric algorithms to process spatial queries. Spatial approximations simplify the geometric shape of complex spatial objects, allowing faster spatial queries at the expense of result accuracy. In this sense, spatial approximations have been employed to efficiently reduce the number of objects under consideration, followed by a refinement step to restore accuracy. For instance, spatial index structures employ spatial approximations to organize spatial objects in hierarchical structures (e.g., the R-tree). It leads to the interest in studying how spatial approximations can be efficiently employed to improve spatial query processing. This article presents a systematic review on this topic. We gather relevant studies by performing a search string on several digital libraries. We further expand the studies under consideration by employing a single iteration of the snowballing approach, where w...
1997
M-tree is a dynamic access method suitable to index generic "metric spaces", where the function used to compute the distance between any two objects satisfies the positivity, symmetry, and triangle inequality postulates. The M-tree design fulfills typical requirements of multimedia applications, where objects are indexed using complex features, and similarity queries can require application of time-consuming distance functions. In this paper we describe the basic search and management algorithms of M-tree, introduce several heuristic split policies, and experimentally evaluate them, considering both I/O and CPU costs. Results also show that M-tree performs better than R * -tree on highdimensional vector spaces. * This work has been partially supported by ESPRIT LTR project no. 9141, HERMES (Foundations of High Performance Multimedia Information Management Systems). P. Zezula has also been supported by Grants GACR No. 102/96/0986 and KON-TAKT No. PM96 S028.
arXiv (Cornell University), 2015
Emerging location-based systems and data analysis frameworks requires efficient management of spatial data for approximate and exact search. Exact similarity search can be done using space partitioning data structures, such as KD-tree, R*-tree, and ball-tree. In this paper, we focus on ball-tree, an efficient search tree that is specific for spatial queries which use euclidean distance. Each node of a ball-tree defines a ball, i.e. a hypersphere that contains a subset of the points to be searched. In this paper, we propose ball*-tree, an improved ball-tree that is more efficient for spatial queries. Ball*-tree enjoys a modified space partitioning algorithm that considers the distribution of the data points in order to find an efficient splitting hyperplane. Also, we propose a new algorithm for KNN queries with restricted range using ball*-tree, which performs better than both KNN and range search for such queries. Results show that ball*-tree performs 39%-57% faster than the original ball-tree algorithm.
Chilean Computer Science Society, …, 2003
The Dynamic Spatial Approximation Tree (dsa-tree) is a recently proposed data structure for searching in metric spaces. It has been shown that it compares favorably against alternative data structures in spaces of high dimension or queries with low selectivity. The dsa-tree supports insertion and deletions of elements. However, it has been noted that deletions degrade the structure over time, so the structure cannot be regarded as fully dynamic in the sense that deletions are not sustainable for long periods of time.
Proceedings of the fifth ACM SIGKDD international conference on Knowledge discovery and data mining - KDD '99, 1999
We consider the problem of performing nearest-neighbor queries e ciently over large high-dimensional databases. Assuming that a full database scan to determine the nearest neighborentries is not acceptable, we study the possibility of constructing an index structure over the database. It is well-accepted that traditional database indexing algorithms fail for high-dimensional data (say d > 10 or 20 depending on the scheme). Some arguments have a d v ocated that nearest-neighbor queries do not even make sense for high-dimensional data since the ratio of maximum and minimum distance goes to 1 as dimensionality increases. We show that these arguments are based on over-restrictive assumptions, and that in the general case it is meaningful and possible to perform such queries. We present an approach for deriving a multidimensional index to support approximate nearestneighbor queries over large databases. Our approach, called DBIN, scales to high-dimensional databases by exploiting statistical properties of the data. The approach is based on statistically modeling the density of the content of the data table. DBIN uses the density model to derive a single index over the data table and requires physically rewriting data in a new table sorted by the newly created index (i.e. create what is known as a clustered-index in the database literature). The indexing scheme produces a mapping between a query point (a data record) and an ordering on the clustered index values. Data is then scanned according to the index until the probability that the nearest-neighbor has been found exceeds some threshold. We present theoretical and empirical justi cation for DBIN. The scheme supports a family of distance functions which includes the traditional Euclidean distance measure.
Loading Preview
Sorry, preview is currently unavailable. You can download the paper by clicking the button above.