Academia.edu no longer supports Internet Explorer.
To browse Academia.edu and the wider internet faster and more securely, please take a few seconds to upgrade your browser.
2004, Lecture Notes in Computer Science
A number of problems in computer science can be solved efficiently with the so called memory based or kernel methods. Among this problems (relevant to the AI community) are multimedia indexing, clustering, non supervised learning and recommendation systems. The common ground to this problems is satisfying proximity queries with an abstract metric database. In this paper we introduce a new technique for making practical indexes for metric range queries. This technique improves existing algorithms based on pivots and signatures, and introduces a new data structure, the Fixed Queries Trie to speedup metric range queries. The result is an O(n) construction time index, with query complexity O(n α), α ≤ 1. The indexing algorithm uses only a few bits of storage for each database element.
Multimedia Tools and Applications, 2001
Pivot-based algorithms are effective tools for proximity searching in metric spaces. They allow trading space overhead for number of distance evaluations performed at query time. With additional search structures (that pose extra space overhead) they can also reduce the amount of side computations. We introduce a new data structure, the Fixed Queries Array (FQA), whose novelties are (1) it permits sublinear extra CPU time without any extra data structure;(2) it permits trading number of pivots for their precision so as to make better use ...
Lecture Notes in Computer Science, 2005
Proximity searching consists in retrieving from a database, objects that are close to a query. For this type of searching problem, the most general model is the metric space, where proximity is defined in terms of a distance function. A solution for this problem consists in building an offline index to quickly satisfy online queries. The ultimate goal is to use as few distance computations as possible to satisfy queries, since the distance is considered expensive to compute. Proximity searching is central to several applications, ranging from multimedia indexing and querying to data compression and clustering. In this paper we present a new approach to solve the proximity searching problem. Our solution is based on indexing the database with the knearest neighbor graph (knng), which is a directed graph connecting each element to its k closest neighbors. We present two search algorithms for both range and nearest neighbor queries which use navigational and metrical features of the knng graph. We show that our approach is competitive against current ones. For instance, in the document metric space our nearest neighbor search algorithms perform 30% more distance evaluations than AESA using only a 0.25% of its space requirement. In the same space, the pivot-based technique is completely useless.
Lecture Notes in Computer Science, 2006
Proximity searching consists in retrieving from a database those elements that are similar to a query. As the distance is usually expensive to compute, the goal is to use as few distance computations as possible to satisfy queries. Indexes use precomputed distances among database elements to speed up queries. As such, a baseline is AESA, which stores all the distances among database objects, but has been unbeaten in query performance for 20 years. In this paper we show that it is possible to improve upon AESA by using a radically different method to select promising database elements to compare against the query. Our experiments show improvements of up to 75% in document databases. We also explore the usage of our method as a probabilistic algorithm that may lose relevant answers. On a database of faces where any exact algorithm must examine virtually all elements, our probabilistic version obtains 85% of the correct answers by scanning only 10% of the database.
Lecture Notes in Computer Science, 2012
Some metric indexes, like the pivot based family, can natively trade space for query time. Other indexes may have a small memory footprint and still outperform the pivot based approach; but are unable to increase the memory usage to boost the query time. In this paper we propose a new metric indexing technique with an algorithmic mechanism to lift the performance of otherwise rigid metric indexes. We selected the well known List of Clusters (LC) as the base data structure, obtaining an index which is orders of magnitude faster to build, with memory usage adaptable to the intrinsic dimension of the data, and faster at query time than the original LC. We also present a nearest neighbor algorithm, of independent interest, which is optimal in the sense that requires the same number of distance computations as a range query with the radius of the nearest neighbor. We present exhaustive experimental evidence supporting our claims, for both synthetic and real world datasets.
2003
In order to speedup retrieval in large collections of data, index structures partition the data into subsets so that query requests can be evaluated without examining the entire collection. As the complexity of modern data types grows, metric spaces have become a popular paradigm for similarity retrieval.
Database and Expert Systems Applications, 2009
Metric access methods (MAMs) serve as a tool for speeding similarity queries. However, all MAMs developed so far are index-based; they need to build an index on a given database. The indexing itself is either static (the whole database is indexed at once) or dynamic (insertions/deletions are supported), but there is always a preprocessing step needed. In this paper, we propose D-file, the first MAM that requires no indexing at all. This feature is especially beneficial in domains like data mining, streaming databases, etc., where the production of data is much more intensive than querying. Thus, in such environments the indexing is the bottleneck of the entire production/querying scheme. The idea of D-file is an extension of the trivial sequential file (an abstraction over the original database, actually) by so-called D-cache. The D-cache is a main-memory structure that keeps track of distance computations spent by processing all similarity queries so far (within a runtime session). Based on the distances stored in D-cache, the D-file can cheaply determine lower bounds of some distances while the distances alone have not to be explicitly computed, which results in faster queries. Our experimental evaluation shows that query efficiency of D-file is comparable to the index-based state-of-the-art MAMs, however, for zero indexing costs.
Lecture Notes in Computer Science, 2015
Proximity searching consists in retrieving objects out of a database similar to a given query. Nowadays, when multimedia databases are growing up, this is an elementary task. The permutation based index (PBI) and its variants are excellent techniques to solve proximity searching in high dimensional spaces, however they have been surmountable in low dimensional ones. Another PBI's drawback is that the distance between permutations cannot allow to discard elements safely when solving similarity queries. In the following, we introduce an improvement on the PBI that allows to produce a better promissory order using less space than the basic permutation technique and also gives us information to discard some elements. To do so, besides the permutations, we quantize distance information by defining distance rings around each permutant, and we also keep this data. The experimental evaluation shows we can dramatically improve upon specialized techniques in low dimensional spaces. For instance, in the real world dataset of NASA images, our boosted PBI uses up to 90% less distances evaluations than AESA, the state-of-the-art searching algorithm with the best performance in this particular space.
2009
Mobile query processing is, currently, a very active research field. Range and nearest neighbor queries are commonly used in spatiotemporal databases and location based services (LBS). In this paper, we focus on finding nearest neighbors of a query point within a certain distance range. We propose a new indexing structure CN-tree, Compact N-tree, based on a recent indexing technique called N-tree. CN-tree joins efficiency of N-tree's data partitioning scheme to pertinent objects' approximation with minimal bounding rectangles of R-trees which are reported to be the best performing for range search. We show how we use the approximation in constructing CN-tree and, then, how this index can support range queries efficiently by minimizing computation of distances and avoiding overlapping of minimal bounding rectangles. The experimental results through the comparison with the well know R*-tree, show that the proposed CN-tree widely outperforms R*-tree as an in-memory index and it presents competitive performances when used as an in-disk index.
We present an approximate distance oracle for a point set S with n points and doubling dimension λ. For every ε > 0, the oracle supports (1 + ε)-approximate distance queries in (universal) constant time, occupies space [ε −O(λ) + 2 O(λ log λ) ]n, and can be constructed in [2 O(λ) log 3 n + ε −O(λ) + 2 O(λ log λ) ]n expected time. This improves upon the best previously known constructions, presented by . Furthermore, the oracle can be made fully dynamic with expected O(1) query time and only 2 O(λ) log n + ε −O(λ) + 2 O(λ log λ) update time. This is the first fully dynamic (1 + ε)-distance oracle.
ACM SIGMOD Record, 1997
In many database applications, one of the common queries is to find approximate matches to a given query item from a collection of data items. For example, given an image database, one may want to retrieve all images that are similar to a given quety image. Distance based index structures are proposed for applications where the data domain is high dimensional, or the distance function used to compute distances between data objects is non-Euclidean. In this paper, we introduce a distance based index structure called multi-vantage point (mvp) tree for similarity queries on high-dimensional metric spaces. The mvptree uses more than one vantage point to partition the space into spherical cuts at each level. It also utilizes the pre-computed (at construction time) distances between the data points and the vantage points. We have done experiments to compare mvp-trees with vp-trees which have a similar partitioning strategy, but use only one vantage point at each level, and do not make use of the pre-computed distances. Empirical studies show that mvp tree outperforms the vp-tree 2096 to 80% for varying query ranges and different distance distributions. 1. Irttmduction In many database applications, it is desirable to be able to answer queries based on proximity such as asking for data items that are similar to a query item, or that are closest to a query item. We face such queries in the context of many database applications such as genetics, image/picttrre databases, time series analysis, information retrieval, etc. In genetics, the concern is to find DNA or protein sequences that are similar in a genetic database. In time-series analysis, we would like to find similar patterns among a given collection of sequences. Image databases can be queried to find and retrieve images in the database that are similar to the query image with respect to a specified criteria. *~s rsscarchis partiatlysupportsd by the National Science Foundation grant [RI 92-24660, and the National seiemx t%undadon FAW award IRI-90-24152 Permission to make digital/hard copy of part or all this work for personal or claesroom use is granted without fee providad that copies are not made or distributed for profit or commercial advantage, tha cop~ight notica, tha title of tha publication and its dste appear, and notice ia given that copying is by permission of ACM, Inc. To copy otherwise, to republish, to poet on servara, or to redistribute to Iista, requires prior specific permission and/or a faa SIGMOD '97 AZ, USA
Iberoamerican Congress on Pattern Recognition CIARP, 2009
Many pattern recognition tasks can be modeled as proximity searching. Here the common task is to quickly find all the elements close to a given query without sequentially scanning a very large database. A recent shift in the searching paradigm has been established by using permutations instead of distances to predict proximity. Every object in the database record how the set of reference objects (the permutants) is seen, i.e. only the relative positions are used. When a query arrives the relative displacements in the permutants between the query and a particular object is measured. This approach turned out to be the most efficient and scalable, at the expense of loosing recall in the answers. The permutation of every object is represented with κ short integers in practice, producing bulky indexes of 16 κn bits. In this paper we show how to represent the permutation as a binary vector, using just one bit for each permutant (instead of logκ in the plain representation). The Hamming distance in the binary signature is used then to predict proximity between objects in the database. We tested this approach with many real life metric databases obtaining faster queries with a recall close to the Spearman ρ using 16 times less space.
ACM Transactions on Database Systems, 2002
Metric access methods (MAMs), such as the M-tree, are powerful index structures for supporting similarity queries on metric spaces, which represent a common abstraction for those searching problems that arise in many modern application areas, such as multimedia, data mining, decision support, pattern recognition, and genomic databases. As compared to multi-dimensional (spatial) access methods (SAMs), MAMs are more general, yet they are reputed to lose in flexibility, since it is commonly deemed that they can only answer queries using the same distance function used to build the index. In this paper we show that this limitation is only apparent -thus MAMs are far more flexible than believed -and extend the M-tree so as to be able to support user-defined distance criteria, approximate distance functions to speed up query evaluation, as well as dissimilarity functions which are not metrics. The so-extended M-tree, also called QIC-M-tree, can deal with three distinct distances at a time: 1) a query (user-defined) distance, 2) an index distance (used to build the tree), and 3) a comparison (approximate) distance (used to quickly discard from the search uninteresting parts of the tree). We develop an analytical cost model that accurately characterizes the performance of QIC-M-tree and validate such model through extensive experimentation on real metric data sets. In particular, our analysis is able to predict the best evaluation strategy (i.e. which distances to use) under a variety of configurations, by properly taking into account relevant factors such as the distribution of distances, the cost of computing distances, and the actual index structure. We also prove that the overall saving in CPU search costs when using an approximate distance can be estimated by using information on the data set only -thus such measure is independent of the underlying access method -and show that performance results are closely related to a novel "indexing" error measure. Finally, we show how our results apply to other MAMs and query types.
Proceedings of the seventeenth ACM SIGACT-SIGMOD-SIGART symposium on Principles of database systems - PODS '98, 1998
We consider the problem of estimating CPU (distance computations) and I/O costs for processing range and k-nearest neighbors queries over metric spaces. Unlike the specific case of vector spaces, where information on data distribution has been exploited to derive cost models for predicting the performance of multi-dimensional access methods, in a generic metric space there is no such a possibility, which makes the problem quite different and requires a novel approach. We insist that the distance distribution of objects can be profitably used to solve the problem, and consequently develop a concrete cost model for the M-tree access method [10]. Our results rely on the assumption that the indexed dataset comes from a metric space which is "homogeneous" enough (in a probabilistic sense) to allow reliable cost estimations even if the distance distribution with respect to a specific query object is unknown. We experimentally validate the model over both real and synthetic datasets, and show how the model can be used to tune the M-tree in order to minimize a combination of CPU and I/O costs. Finally, we sketch how the same approach can be applied to derive a cost model for the vp-tree index structure .
1990
In applications such as vision and molecular biology, a common problem is to find the similar objects to a given target (according to some distance measure) in a large database. This paper presents a scheme for query processing in such situations. The basic strategy is to (partially) precompute inter-object distances, and by using the distance information and the triangle inequality, we eliminate the need to calculate certain object distances while evaluating queries. We propose several heuristics that may speed up query evaluation. A series of experiments are then performed to evaluate the effectiveness of our scheme and the relative performance of the heuristics for different queries. Finally we investigate the possibility of parallelizing our scheme through simulation. Our results show that parallelism is best applied in the later stages in evaluating a query.
The metric space model abstracts many proximity or similarity problems, where the most frequently considered primitives are range and k-nearest neighbor search, leaving out the similarity join, an extremely important primitive. In fact, despite the great attention that this primitive has received in traditional and even multidimensional databases, little has been done for general metric databases.
IEEE Transactions on Systems, Man and Cybernetics, Part B (Cybernetics), 2005
In this paper, we present an embedding technique, called MetricMap, which is capable of estimating distances in a pseudometric space. Given a database of objects and a distance function for the objects, which is a pseudometric, we map the objects to vectors in a pseudo-Euclidean space with a reasonably low dimension while preserving the distance between two objects approximately. Such an embedding technique can be used as an approximate oracle to process a broad class of distance-based queries. It is also adaptable to data mining applications such as data clustering and classification. We present the theory underlying MetricMap and conduct experiments to compare MetricMap with other methods including MVP-tree and M-tree in processing the distance-based queries. Experimental results on both protein and RNA data show the good performance and the superiority of MetricMap over the other methods.
Pattern Recognition Letters, 2011
This work focus on fast nearest neighbor (NN) search algorithms that can work in any metric space (not just the Euclidean distance) and where the distance computation is very time consuming. One of the most well known methods in this field is the AESA algorithm, used as baseline for performance measurement for over twenty years. The AESA works in two steps that repeats: first it searches a promising candidate to NN and computes its distance (approximation step), next it eliminates all the unsuitable NN candidates in view of the new information acquired in the previous calculation (elimination step).
Pattern Recognition Letters, 2005
The metric space model abstracts many proximity search problems, from nearest-neighbor classifiers to textual and multimedia information retrieval. In this context, an index is a data structure that speeds up proximity queries. However, indexes lose their efficiency as the intrinsic data dimensionality increases. In this paper we present a simple index called list of clusters (LC), which is based on a compact partitioning of the data set. The LC is shown to require little space, to be suitable both for main and secondary memory implementations, and most importantly, to be very resistant to the intrinsic dimensionality of the data set. In this aspect our structure is unbeaten. We finish with a discussion of the role of unbalancing in metric space searching, and how it permits trading memory space for construction time.
Lecture Notes in Computer Science, 2004
Similarity retrieval is an important paradigm for searching in environments where exact match has little meaning. Moreover, in order to enlarge the set of data types for which the similarity search can efficiently be performed, the notion of mathematical metric space provides a useful abstraction for similarity. In this paper we consider the problem of organizing and searching large data-sets from arbitrary metric spaces, and a novel access structure for similarity search in metric data, called D-Index, is discussed. D-Index combines a novel clustering technique and the pivot-based distance searching strategy to speed up execution of similarity range and nearest neighbor queries for large files with objects stored in disk memories. Moreover, we propose an extension of this access structure (eD-Index) which is able to deal with the problem of similarity self join. Though this approach is not able to eliminate the intrinsic quadratic complexity of similarity joins, significant performance improvements are confirmed by experiments.
Pattern Recognition Letters, 2003
With few exceptions, proximity search algorithms in metric spaces based on the use of pivots select them at random among the objects of the metric space. However, it is well known that the way in which the pivots are selected can drastically affect the performance of the algorithm. Between two sets of pivots of the same size, better chosen pivots can largely reduce the search time. Alternatively, a better chosen small set of pivots (requiring much less space) can yield the same efficiency as a larger, randomly chosen, set. We propose an efficiency measure to compare two pivot sets, combined with an optimization technique that allows us to select good sets of pivots. We obtain abundant empirical evidence showing that our technique is effective, and it is the first that we are aware of in producing consistently good results in a wide variety of cases and in being based on a formal theory. We also show that good pivots are outliers, but that selecting outliers does not ensure that good pivots are selected.
Loading Preview
Sorry, preview is currently unavailable. You can download the paper by clicking the button above.