Academia.edu no longer supports Internet Explorer.
To browse Academia.edu and the wider internet faster and more securely, please take a few seconds to upgrade your browser.
2009, ACM Journal of Experimental Algorithmics
Proximity searching consists of retrieving from a database those elements that are similar to a query object. The usual model for proximity searching is a metric space where the distance, which models the proximity, is expensive to compute. An index uses precomputed distances to speedup query processing. Among all the known indices, the baseline for performance for about 20 years has been AESA. This index uses an iterative procedure, where at each iteration it first chooses the next promising element (“pivot”) to compare to the query, and then it discards database elements that can be proved not relevant to the query using the pivot. The next pivot in AESA is chosen as the one minimizing the sum of lower bounds to the distance to the query proved by previous pivots. In this article, we introduce the new index iAESA , which establishes a new performance baseline for metric space searching. The difference with AESA is the method to select the next pivot. In iAESA, each candidate sorts...
Lecture Notes in Computer Science, 2004
A number of problems in computer science can be solved efficiently with the so called memory based or kernel methods. Among this problems (relevant to the AI community) are multimedia indexing, clustering, non supervised learning and recommendation systems. The common ground to this problems is satisfying proximity queries with an abstract metric database. In this paper we introduce a new technique for making practical indexes for metric range queries. This technique improves existing algorithms based on pivots and signatures, and introduces a new data structure, the Fixed Queries Trie to speedup metric range queries. The result is an O(n) construction time index, with query complexity O(n α), α ≤ 1. The indexing algorithm uses only a few bits of storage for each database element.
ACM Transactions on Database Systems, 2002
Metric access methods (MAMs), such as the M-tree, are powerful index structures for supporting similarity queries on metric spaces, which represent a common abstraction for those searching problems that arise in many modern application areas, such as multimedia, data mining, decision support, pattern recognition, and genomic databases. As compared to multi-dimensional (spatial) access methods (SAMs), MAMs are more general, yet they are reputed to lose in flexibility, since it is commonly deemed that they can only answer queries using the same distance function used to build the index. In this paper we show that this limitation is only apparent -thus MAMs are far more flexible than believed -and extend the M-tree so as to be able to support user-defined distance criteria, approximate distance functions to speed up query evaluation, as well as dissimilarity functions which are not metrics. The so-extended M-tree, also called QIC-M-tree, can deal with three distinct distances at a time: 1) a query (user-defined) distance, 2) an index distance (used to build the tree), and 3) a comparison (approximate) distance (used to quickly discard from the search uninteresting parts of the tree). We develop an analytical cost model that accurately characterizes the performance of QIC-M-tree and validate such model through extensive experimentation on real metric data sets. In particular, our analysis is able to predict the best evaluation strategy (i.e. which distances to use) under a variety of configurations, by properly taking into account relevant factors such as the distribution of distances, the cost of computing distances, and the actual index structure. We also prove that the overall saving in CPU search costs when using an approximate distance can be estimated by using information on the data set only -thus such measure is independent of the underlying access method -and show that performance results are closely related to a novel "indexing" error measure. Finally, we show how our results apply to other MAMs and query types.
Lecture Notes in Computer Science, 2006
Proximity searching consists in retrieving from a database those elements that are similar to a query. As the distance is usually expensive to compute, the goal is to use as few distance computations as possible to satisfy queries. Indexes use precomputed distances among database elements to speed up queries. As such, a baseline is AESA, which stores all the distances among database objects, but has been unbeaten in query performance for 20 years. In this paper we show that it is possible to improve upon AESA by using a radically different method to select promising database elements to compare against the query. Our experiments show improvements of up to 75% in document databases. We also explore the usage of our method as a probabilistic algorithm that may lose relevant answers. On a database of faces where any exact algorithm must examine virtually all elements, our probabilistic version obtains 85% of the correct answers by scanning only 10% of the database.
Multimedia Tools and Applications, 2001
Pivot-based algorithms are effective tools for proximity searching in metric spaces. They allow trading space overhead for number of distance evaluations performed at query time. With additional search structures (that pose extra space overhead) they can also reduce the amount of side computations. We introduce a new data structure, the Fixed Queries Array (FQA), whose novelties are (1) it permits sublinear extra CPU time without any extra data structure;(2) it permits trading number of pivots for their precision so as to make better use ...
Lecture Notes in Computer Science, 2012
Some metric indexes, like the pivot based family, can natively trade space for query time. Other indexes may have a small memory footprint and still outperform the pivot based approach; but are unable to increase the memory usage to boost the query time. In this paper we propose a new metric indexing technique with an algorithmic mechanism to lift the performance of otherwise rigid metric indexes. We selected the well known List of Clusters (LC) as the base data structure, obtaining an index which is orders of magnitude faster to build, with memory usage adaptable to the intrinsic dimension of the data, and faster at query time than the original LC. We also present a nearest neighbor algorithm, of independent interest, which is optimal in the sense that requires the same number of distance computations as a range query with the radius of the nearest neighbor. We present exhaustive experimental evidence supporting our claims, for both synthetic and real world datasets.
Pattern Recognition Letters, 2011
This work focus on fast nearest neighbor (NN) search algorithms that can work in any metric space (not just the Euclidean distance) and where the distance computation is very time consuming. One of the most well known methods in this field is the AESA algorithm, used as baseline for performance measurement for over twenty years. The AESA works in two steps that repeats: first it searches a promising candidate to NN and computes its distance (approximation step), next it eliminates all the unsuitable NN candidates in view of the new information acquired in the previous calculation (elimination step).
Lecture Notes in Computer Science, 2005
Proximity searching consists in retrieving from a database, objects that are close to a query. For this type of searching problem, the most general model is the metric space, where proximity is defined in terms of a distance function. A solution for this problem consists in building an offline index to quickly satisfy online queries. The ultimate goal is to use as few distance computations as possible to satisfy queries, since the distance is considered expensive to compute. Proximity searching is central to several applications, ranging from multimedia indexing and querying to data compression and clustering. In this paper we present a new approach to solve the proximity searching problem. Our solution is based on indexing the database with the knearest neighbor graph (knng), which is a directed graph connecting each element to its k closest neighbors. We present two search algorithms for both range and nearest neighbor queries which use navigational and metrical features of the knng graph. We show that our approach is competitive against current ones. For instance, in the document metric space our nearest neighbor search algorithms perform 30% more distance evaluations than AESA using only a 0.25% of its space requirement. In the same space, the pivot-based technique is completely useless.
Pattern Recognition Letters, 2003
With few exceptions, proximity search algorithms in metric spaces based on the use of pivots select them at random among the objects of the metric space. However, it is well known that the way in which the pivots are selected can drastically affect the performance of the algorithm. Between two sets of pivots of the same size, better chosen pivots can largely reduce the search time. Alternatively, a better chosen small set of pivots (requiring much less space) can yield the same efficiency as a larger, randomly chosen, set. We propose an efficiency measure to compare two pivot sets, combined with an optimization technique that allows us to select good sets of pivots. We obtain abundant empirical evidence showing that our technique is effective, and it is the first that we are aware of in producing consistently good results in a wide variety of cases and in being based on a formal theory. We also show that good pivots are outliers, but that selecting outliers does not ensure that good pivots are selected.
1990
In applications such as vision and molecular biology, a common problem is to find the similar objects to a given target (according to some distance measure) in a large database. This paper presents a scheme for query processing in such situations. The basic strategy is to (partially) precompute inter-object distances, and by using the distance information and the triangle inequality, we eliminate the need to calculate certain object distances while evaluating queries. We propose several heuristics that may speed up query evaluation. A series of experiments are then performed to evaluate the effectiveness of our scheme and the relative performance of the heuristics for different queries. Finally we investigate the possibility of parallelizing our scheme through simulation. Our results show that parallelism is best applied in the later stages in evaluating a query.
Iberoamerican Congress on Pattern Recognition CIARP, 2009
Many pattern recognition tasks can be modeled as proximity searching. Here the common task is to quickly find all the elements close to a given query without sequentially scanning a very large database. A recent shift in the searching paradigm has been established by using permutations instead of distances to predict proximity. Every object in the database record how the set of reference objects (the permutants) is seen, i.e. only the relative positions are used. When a query arrives the relative displacements in the permutants between the query and a particular object is measured. This approach turned out to be the most efficient and scalable, at the expense of loosing recall in the answers. The permutation of every object is represented with κ short integers in practice, producing bulky indexes of 16 κn bits. In this paper we show how to represent the permutation as a binary vector, using just one bit for each permutant (instead of logκ in the plain representation). The Hamming distance in the binary signature is used then to predict proximity between objects in the database. We tested this approach with many real life metric databases obtaining faster queries with a recall close to the Spearman ρ using 16 times less space.
Lecture Notes in Computer Science, 2015
Proximity searching consists in retrieving objects out of a database similar to a given query. Nowadays, when multimedia databases are growing up, this is an elementary task. The permutation based index (PBI) and its variants are excellent techniques to solve proximity searching in high dimensional spaces, however they have been surmountable in low dimensional ones. Another PBI's drawback is that the distance between permutations cannot allow to discard elements safely when solving similarity queries. In the following, we introduce an improvement on the PBI that allows to produce a better promissory order using less space than the basic permutation technique and also gives us information to discard some elements. To do so, besides the permutations, we quantize distance information by defining distance rings around each permutant, and we also keep this data. The experimental evaluation shows we can dramatically improve upon specialized techniques in low dimensional spaces. For instance, in the real world dataset of NASA images, our boosted PBI uses up to 90% less distances evaluations than AESA, the state-of-the-art searching algorithm with the best performance in this particular space.
First DELOS Network of …, 2000
Abstract. The problem of approximated similarity search for the range and nearest neighbor queries is investigated for generic metric spaces. The search speedup is achieved by ignoring data regions with a small, user defined, proximity with respect to the query. For zero proximity, exact ...
ACM Transactions on Information Systems, 2003
Similarity search structures for metric data typically bound object partitions by ball regions. Since regions can overlap, a relevant issue is to estimate the proximity of regions in order to predict the number of objects in the regions' intersection. This paper analyzes the problem using a probabilistic approach and provides a solution that effectively computes the proximity through realistic heuristics that only require small amounts of auxiliary data. An extensive simulation to validate the technique is provided. An application is developed to demonstrate how the proximity measure can be successfully applied to the approximate similarity search. Search speedup is achieved by ignoring data regions whose proximity to the query region is smaller than a user-defined threshold. This idea is implemented in a metric tree environment for the similarity range and "nearest neighbors" queries. Several measures of efficiency and effectiveness are applied to evaluate proposed approximate search algorithms on real-life data sets. An analytical model is developed to relate proximity parameters and the quality of search. Improvements of two orders of magnitude are achieved for moderately approximated search results. We demonstrate that the precision of proximity measures can significantly influence the quality of approximated algorithms.
Lecture Notes in Computer Science, 2014
The permutation based index has shown to be very effective in medium and high dimensional metric spaces, even in difficult problems such as solving reverse k-nearest neighbor queries. Nevertheless, currently there is no study about which are the desirable features one can ask to a permutant set, or how to select good permutants. Similar to the case of pivots, our experimental results show that, compared with a randomly chosen set, a good permutant set yields to fast query response or to reduce the amount of space used by the index. In this paper, we start by characterizing permutants and studying their predictive power; then we propose an effective heuristic to select a good set of permutant candidates. We also show empirical evidence that supports our technique.
2003
In order to speedup retrieval in large collections of data, index structures partition the data into subsets so that query requests can be evaluated without examining the entire collection. As the complexity of modern data types grows, metric spaces have become a popular paradigm for similarity retrieval.
2010
Similarity search in high-dimensional metric spaces is a key operation in many applications, such as multimedia databases, image retrieval, object recognition, and others. The high dimensionality of the data requires special index structures to facilitate the search. A problem regarding the creation of suitable index structures for highdimensional data is the relationship between the geometry of the data and the organization of an index structure. In this paper, we study the performance of a new index structure, called Divisive-Agglomerative Hierarchical Clustering tree (DAHC-tree), which reduces the effects imposed by the above liability. DAHC-tree is constructed by dividing and grouping the data set into compact clusters. We perform a rigorous experimental design and analyze the trade-offs involved in building such an index structure. Additionally, we present extensive experiments comparing our method against state-of-the-art of exact and approximate solutions. The conducted analysis and the reported comparative test results demonstrate that our technique significantly improves the performance of similarity queries.
Similarity Search and Applications 12th International Conference, SISAP 2019, 2019
Many approaches for approximate metric search rely on a permutation-based representation of the original data objects. The main advantage of transforming metric objects into permutations is that the latter can be efficiently indexed and searched using data structures such as inverted-files and prefix trees. Typically, the permutation is obtained by ordering the identifiers of a set of pivots according to their distances to the object to be represented. In this paper, we present a novel approach to transform metric objects into permutations. It uses the object-pivot distances in combination with a metric transformation, called n-Simplex projection. The resulting permutation-based representation , named SPLX-Perm, is suitable only for the large class of metric space satisfying the n-point property. We tested the proposed approach on two benchmarks for similarity search. Our preliminary results are encouraging and open new perspectives for further investigations on the use of the n-Simplex projection for supporting permutation-based indexing.
1999
We investigate the problem of approximate similarity (nearest neighbor) search in high-dimensional metric spaces, and describe how the distance distribution of the query object can be exploited so as to provide probabilistic guarantees on the quality of the result. This leads to a new paradigm for similarity search, called PAC-NN (probably approximately correct nearest neighbor) queries, aiming to break the "dimensionality curse". PAC-NN queries return, with probability at least 1 ; , a (1 + )-approximate NN -an object whose distance from the query q is less than (1 + ) times the distance between q and its NN. Analytical and experimental results obtained for sequential and index-based algorithms show that PAC-NN queries can be efficiently processed even on very high-dimensional spaces and that control can be exerted in order to tradeoff the accuracy of the result and the cost.
Proceedings of the seventeenth ACM SIGACT-SIGMOD-SIGART symposium on Principles of database systems - PODS '98, 1998
We consider the problem of estimating CPU (distance computations) and I/O costs for processing range and k-nearest neighbors queries over metric spaces. Unlike the specific case of vector spaces, where information on data distribution has been exploited to derive cost models for predicting the performance of multi-dimensional access methods, in a generic metric space there is no such a possibility, which makes the problem quite different and requires a novel approach. We insist that the distance distribution of objects can be profitably used to solve the problem, and consequently develop a concrete cost model for the M-tree access method [10]. Our results rely on the assumption that the indexed dataset comes from a metric space which is "homogeneous" enough (in a probabilistic sense) to allow reliable cost estimations even if the distance distribution with respect to a specific query object is unknown. We experimentally validate the model over both real and synthetic datasets, and show how the model can be used to tune the M-tree in order to minimize a combination of CPU and I/O costs. Finally, we sketch how the same approach can be applied to derive a cost model for the vp-tree index structure .
Database and Expert Systems Applications, 2009
Metric access methods (MAMs) serve as a tool for speeding similarity queries. However, all MAMs developed so far are index-based; they need to build an index on a given database. The indexing itself is either static (the whole database is indexed at once) or dynamic (insertions/deletions are supported), but there is always a preprocessing step needed. In this paper, we propose D-file, the first MAM that requires no indexing at all. This feature is especially beneficial in domains like data mining, streaming databases, etc., where the production of data is much more intensive than querying. Thus, in such environments the indexing is the bottleneck of the entire production/querying scheme. The idea of D-file is an extension of the trivial sequential file (an abstraction over the original database, actually) by so-called D-cache. The D-cache is a main-memory structure that keeps track of distance computations spent by processing all similarity queries so far (within a runtime session). Based on the distances stored in D-cache, the D-file can cheaply determine lower bounds of some distances while the distances alone have not to be explicitly computed, which results in faster queries. Our experimental evaluation shows that query efficiency of D-file is comparable to the index-based state-of-the-art MAMs, however, for zero indexing costs.
Loading Preview
Sorry, preview is currently unavailable. You can download the paper by clicking the button above.