Academia.edu no longer supports Internet Explorer.
To browse Academia.edu and the wider internet faster and more securely, please take a few seconds to upgrade your browser.
1999, Proceedings of the fifth ACM SIGKDD international conference on Knowledge discovery and data mining - KDD '99
We consider the problem of performing nearest-neighbor queries e ciently over large high-dimensional databases. Assuming that a full database scan to determine the nearest neighborentries is not acceptable, we study the possibility of constructing an index structure over the database. It is well-accepted that traditional database indexing algorithms fail for high-dimensional data (say d > 10 or 20 depending on the scheme). Some arguments have a d v ocated that nearest-neighbor queries do not even make sense for high-dimensional data since the ratio of maximum and minimum distance goes to 1 as dimensionality increases. We show that these arguments are based on over-restrictive assumptions, and that in the general case it is meaningful and possible to perform such queries. We present an approach for deriving a multidimensional index to support approximate nearestneighbor queries over large databases. Our approach, called DBIN, scales to high-dimensional databases by exploiting statistical properties of the data. The approach is based on statistically modeling the density of the content of the data table. DBIN uses the density model to derive a single index over the data table and requires physically rewriting data in a new table sorted by the newly created index (i.e. create what is known as a clustered-index in the database literature). The indexing scheme produces a mapping between a query point (a data record) and an ordering on the clustered index values. Data is then scanned according to the index until the probability that the nearest-neighbor has been found exceeds some threshold. We present theoretical and empirical justi cation for DBIN. The scheme supports a family of distance functions which includes the traditional Euclidean distance measure.
1999
We investigate the problem of approximate similarity (nearest neighbor) search in high-dimensional metric spaces, and describe how the distance distribution of the query object can be exploited so as to provide probabilistic guarantees on the quality of the result. This leads to a new paradigm for similarity search, called PAC-NN (probably approximately correct nearest neighbor) queries, aiming to break the "dimensionality curse". PAC-NN queries return, with probability at least 1 ; , a (1 + )-approximate NN -an object whose distance from the query q is less than (1 + ) times the distance between q and its NN. Analytical and experimental results obtained for sequential and index-based algorithms show that PAC-NN queries can be efficiently processed even on very high-dimensional spaces and that control can be exerted in order to tradeoff the accuracy of the result and the cost.
British National Conference on Databases, 1998
Building an index tree is a common approach to speed upthe k nearest neighbour search in large databases of many-dimensionalrecords. Many applications require varying distance metrics by putting aweight on different dimensions. The main problem with k nearest neighboursearches using weighted euclidean metrics in a high dimensionalspace is whether the searches can be done efficiently We present a solutionto this
Lecture Notes in Computer Science, 1998
Building an index tree is a common approach to speed up the k nearest neighbour search in large databases of many-dimensional records. Many applications require varying distance metrics by putting a weight on di erent dimensions. The main problem with k nearest neighbour searches using weighted euclidean metrics in a high dimensional space is whether the searches can be done e ciently We present a solution to this problem which uses the bounding rectangle of the nearestneighbour disk instead of using the disk directly. The algorithm is able to perform nearest-neighbour searches using distance metrics di erent from the metric used to build the search tree without having to rebuild the tree. It is e cient for weighted euclidean distance and extensible to higher dimensions.
ACM SIGMOD Record, 1997
In many database applications, one of the common queries is to find approximate matches to a given query item from a collection of data items. For example, given an image database, one may want to retrieve all images that are similar to a given quety image. Distance based index structures are proposed for applications where the data domain is high dimensional, or the distance function used to compute distances between data objects is non-Euclidean. In this paper, we introduce a distance based index structure called multi-vantage point (mvp) tree for similarity queries on high-dimensional metric spaces. The mvptree uses more than one vantage point to partition the space into spherical cuts at each level. It also utilizes the pre-computed (at construction time) distances between the data points and the vantage points. We have done experiments to compare mvp-trees with vp-trees which have a similar partitioning strategy, but use only one vantage point at each level, and do not make use of the pre-computed distances. Empirical studies show that mvp tree outperforms the vp-tree 2096 to 80% for varying query ranges and different distance distributions. 1. Irttmduction In many database applications, it is desirable to be able to answer queries based on proximity such as asking for data items that are similar to a query item, or that are closest to a query item. We face such queries in the context of many database applications such as genetics, image/picttrre databases, time series analysis, information retrieval, etc. In genetics, the concern is to find DNA or protein sequences that are similar in a genetic database. In time-series analysis, we would like to find similar patterns among a given collection of sequences. Image databases can be queried to find and retrieve images in the database that are similar to the query image with respect to a specified criteria. *~s rsscarchis partiatlysupportsd by the National Science Foundation grant [RI 92-24660, and the National seiemx t%undadon FAW award IRI-90-24152 Permission to make digital/hard copy of part or all this work for personal or claesroom use is granted without fee providad that copies are not made or distributed for profit or commercial advantage, tha cop~ight notica, tha title of tha publication and its dste appear, and notice ia given that copying is by permission of ACM, Inc. To copy otherwise, to republish, to poet on servara, or to redistribute to Iista, requires prior specific permission and/or a faa SIGMOD '97 AZ, USA
Proceedings of the fourth annual ACM-SIAM …, 1993
We consider the computational problem of finding nearest neighbors in general metric spaces. Of particular interest are spaces that may not be conveniently embedded or approxi-mated in Euclidian space, or where the dimensionality of a Euclidian representation 1s ...
SIAM Journal on Computing, 2013
We study the Approximate Nearest Neighbor problem for metric spaces where the query points are constrained to lie on a subspace of low doubling dimension, while the data is high-dimensional. We show that this problem can be solved efficiently despite the high dimensionality of the data.
2010
Numerous techniques have been proposed in the past for supporting efficient k-nearest neighbor (k-NN) queries in continuous data spaces. Limited work has been reported in the literature for k-NN queries in a non-ordered discrete data space (NDDS). Performing k-NN queries in an NDDS raises new challenges. The Hamming distance is usually used to measure the distance between two vectors (objects) in an NDDS. Due to the coarse granularity of the Hamming distance, a k-NN query in an NDDS may lead to a high degree of non-determinism for the query result. We propose a new distance measure, called Granularity-Enhanced Hamming (GEH) distance, that effectively reduces the number of candidate solutions for a query. We have also implemented k-NN queries using multidimensional database indexing in NDDSs. Further, we use the properties of our multidimensional NDDS index to derive the probability of encountering new neighbors within specific regions of the index. This probability is used to develop a new search ordering heuristic. Our experiments on synthetic and genomic data sets demonstrate that our index-based k-NN algorithm is efficient in finding k-NNs in both uniform and non-uniform data sets in NDDSs and that our heuristics are effective in improving the performance of such queries.
Iberoamerican Congress on Pattern Recognition CIARP, 2009
Many pattern recognition tasks can be modeled as proximity searching. Here the common task is to quickly find all the elements close to a given query without sequentially scanning a very large database. A recent shift in the searching paradigm has been established by using permutations instead of distances to predict proximity. Every object in the database record how the set of reference objects (the permutants) is seen, i.e. only the relative positions are used. When a query arrives the relative displacements in the permutants between the query and a particular object is measured. This approach turned out to be the most efficient and scalable, at the expense of loosing recall in the answers. The permutation of every object is represented with κ short integers in practice, producing bulky indexes of 16 κn bits. In this paper we show how to represent the permutation as a binary vector, using just one bit for each permutant (instead of logκ in the plain representation). The Hamming distance in the binary signature is used then to predict proximity between objects in the database. We tested this approach with many real life metric databases obtaining faster queries with a recall close to the Spearman ρ using 16 times less space.
1999
In this paper we introduce a new paradigm for similarity search, called PAC-NN (probably approximately correct nearest neighbor) queries, aiming to break the "dimensionality curse" which inhibits current approaches to be applied in high-dimensional spaces. PAC-NN queries return, with probability at least 1 − δ, a (1 +)-approximate NN-an object whose distance from the query q is less than (1 +) times the distance between q and its NN. We describe how the distance distribution of the query object can be used to determine a suitable stopping condition with probabilistic guarantees on the quality of the result, and then analyze performance of both sequential and index-based PAC-NN algorithms. This shows that PAC-NN queries can be efficiently processed even on very high-dimensional spaces and that control can be exerted in order to tradeoff between the accuracy of the result and the cost.
Similarity search in multimedia databases requires an efficient support of nearest-neighbor search on a large set of high-dimensional points as a basic operation for query processing. As recent theoretical results show, state of the art approaches to nearest-neighbor search are not efficient in higher dimensions. In our new approach, we therefore pre-compute the result of any nearest-neighbor search which corresponds to a computation of the voronoi cell of each data point. In the second step, we store the voronoi cells in an index structure efficient for high-dimensional data spaces. As a result, nearest neighbor search corresponds to a simple point query on the index structure. Although our technique is based on a precipitation of the solution space, it is dynamic, i.e. it supports insertions of new data points. An extensive experimental evaluation of our tech-unique demonstrates the high efficiency for uniformly distributed as well as real data. We obtained a significant reduction of the search time compared to nearest neighbor search in the X-tree.
Proceedings of the 26th Annual International Conference on Machine Learning - ICML '09, 2009
High dimensionality can pose severe difficulties, widely recognized as different aspects of the curse of dimensionality. In this paper we study a new aspect of the curse pertaining to the distribution of k-occurrences, i.e., the number of times a point appears among the k nearest neighbors of other points in a data set. We show that, as dimensionality increases, this distribution becomes considerably skewed and hub points emerge (points with very high k-occurrences). We examine the origin of this phenomenon, showing that it is an inherent property of highdimensional vector space, and explore its influence on applications based on measuring distances in vector spaces, notably classification, clustering, and information retrieval.
ACM Transactions on Database Systems, 2002
Metric access methods (MAMs), such as the M-tree, are powerful index structures for supporting similarity queries on metric spaces, which represent a common abstraction for those searching problems that arise in many modern application areas, such as multimedia, data mining, decision support, pattern recognition, and genomic databases. As compared to multi-dimensional (spatial) access methods (SAMs), MAMs are more general, yet they are reputed to lose in flexibility, since it is commonly deemed that they can only answer queries using the same distance function used to build the index. In this paper we show that this limitation is only apparent -thus MAMs are far more flexible than believed -and extend the M-tree so as to be able to support user-defined distance criteria, approximate distance functions to speed up query evaluation, as well as dissimilarity functions which are not metrics. The so-extended M-tree, also called QIC-M-tree, can deal with three distinct distances at a time: 1) a query (user-defined) distance, 2) an index distance (used to build the tree), and 3) a comparison (approximate) distance (used to quickly discard from the search uninteresting parts of the tree). We develop an analytical cost model that accurately characterizes the performance of QIC-M-tree and validate such model through extensive experimentation on real metric data sets. In particular, our analysis is able to predict the best evaluation strategy (i.e. which distances to use) under a variety of configurations, by properly taking into account relevant factors such as the distribution of distances, the cost of computing distances, and the actual index structure. We also prove that the overall saving in CPU search costs when using an approximate distance can be estimated by using information on the data set only -thus such measure is independent of the underlying access method -and show that performance results are closely related to a novel "indexing" error measure. Finally, we show how our results apply to other MAMs and query types.
2003
In order to speedup retrieval in large collections of data, index structures partition the data into subsets so that query requests can be evaluated without examining the entire collection. As the complexity of modern data types grows, metric spaces have become a popular paradigm for similarity retrieval.
Data & Knowledge Engineering, 2010
Similarity search is a very active area of research because of its usefulness in a set of modern applications, such as content-based image retrieval (CBIR), time series, spatial databases, data mining and multimedia databases in general. The usual way to do a similarity search is to map the objects to feature vectors and to model the search as a nearest neighbor query in the multidimensional space where vectors reside. The main critical issues to this process are: the distance function used to measure the proximity between vectors and the index method to accelerate the search. In this paper we propose a formal framework to perform similarity search that provides the user with a high degree of freedom in the choice of both the distance and the index structure used to organize the feature space. More specifically, we introduce a function to approximate eventually any distance function that can be used in conjunction with index structures that divide the feature space in multidimensional rectangular regions. Cases of use and experimental work are presented to demonstrate the applicability and the overhead of the framework.
Lecture Notes in Computer Science, 2015
Much recent work has been devoted to approximate nearest neighbor queries. Motivated by applications in recommender systems, we consider approximate furthest neighbor (AFN) queries. We present a simple, fast, and highly practical data structure for answering AFN queries in high-dimensional Euclidean space. We build on the technique of Indyk (SODA 2003), storing random projections to provide sublinear query time for AFN. However, we introduce a di↵erent query algorithm, improving on Indyk's approximation factor and reducing the running time by a logarithmic factor. We also present a variation based on a queryindependent ordering of the database points; while this does not have the provable approximation factor of the query-dependent data structure, it o↵ers significant improvement in time and space complexity. We give a theoretical analysis, and experimental results.
Similarity searching has a vast range of applications in various fields of computer science. Many methods have been proposed for exact search, but they all suffer from the curse of dimensionality and are, thus, not applicable to high dimensional spaces. Approximate search methods are considerably more efficient in high dimensional spaces. Unfortunately, there are few theoretical results regarding the complexity of these methods and there are no comprehensive empirical evaluations, especially for non-metric spaces. To fill this gap, we present an empirical analysis of data structures for approximate nearest neighbor search in high dimensional spaces. We provide a comparison with recently published algorithms on several data sets. Our results show that small world approaches provide some of the best tradeoffs between efficiency and effectiveness in both metric and non-metric spaces.
Lecture Notes in Computer Science, 2004
In this paper we show the results of a performance comparison between two Nearest Neighbour Search Methods: one, proposed by Arya & Mount, is based on a kd−tree data structure and a Branch and Bound approximate search algorithm [1], and the other is a search method based on dimensionality projections, presented by Nene & Nayar in [5]. A number of experiments have been carried out in order to find the best choice to work with high dimensional points and large data sets.
2007 IEEE 23rd International Conference on Data Engineering, 2007
A k-nearest neighbor (k-NN) query retrieves k objects from a database that are considered to be the closest to a given query point. Numerous techniques have been proposed in the past for supporting efficient k-NN searches in continuous data spaces. No such work has been reported in the literature for k-NN searches in a non-ordered discrete data space (NDDS). Performing k-NN searches in an NDDS raises new challenges. The Hamming distance is usually used to measure the distance between two vectors (objects) in an NDDS. Due to the coarse granularity of the Hamming distance, a k-NN query in an NDDS may lead to a large set of candidate solutions, creating a high degree of nondeterminism for the query result. We propose a new distance measure, called Granularity-Enhanced Hamming (GEH) distance, that effectively reduces the number of candidate solutions for a query. We have also considered using multidimensional database indexing for implementing k-NN searches in NDDSs. Our experiments on synthetic and genomic data sets demonstrate that our index-based k-NN algorithm is effective and efficient in finding k-NNs in NDDSs.
2010
Similarity search in high-dimensional metric spaces is a key operation in many applications, such as multimedia databases, image retrieval, object recognition, and others. The high dimensionality of the data requires special index structures to facilitate the search. A problem regarding the creation of suitable index structures for highdimensional data is the relationship between the geometry of the data and the organization of an index structure. In this paper, we study the performance of a new index structure, called Divisive-Agglomerative Hierarchical Clustering tree (DAHC-tree), which reduces the effects imposed by the above liability. DAHC-tree is constructed by dividing and grouping the data set into compact clusters. We perform a rigorous experimental design and analyze the trade-offs involved in building such an index structure. Additionally, we present extensive experiments comparing our method against state-of-the-art of exact and approximate solutions. The conducted analysis and the reported comparative test results demonstrate that our technique significantly improves the performance of similarity queries.
Lecture Notes in Computer Science, 2005
Proximity searching consists in retrieving from a database, objects that are close to a query. For this type of searching problem, the most general model is the metric space, where proximity is defined in terms of a distance function. A solution for this problem consists in building an offline index to quickly satisfy online queries. The ultimate goal is to use as few distance computations as possible to satisfy queries, since the distance is considered expensive to compute. Proximity searching is central to several applications, ranging from multimedia indexing and querying to data compression and clustering. In this paper we present a new approach to solve the proximity searching problem. Our solution is based on indexing the database with the knearest neighbor graph (knng), which is a directed graph connecting each element to its k closest neighbors. We present two search algorithms for both range and nearest neighbor queries which use navigational and metrical features of the knng graph. We show that our approach is competitive against current ones. For instance, in the document metric space our nearest neighbor search algorithms perform 30% more distance evaluations than AESA using only a 0.25% of its space requirement. In the same space, the pivot-based technique is completely useless.
Loading Preview
Sorry, preview is currently unavailable. You can download the paper by clicking the button above.