no longer supports Internet Explorer.
To browse and the wider internet faster and more securely, please take a few seconds to upgrade your browser.
Among the similarity queries in metric spaces, there are one that obtains the k-nearest neighbors of all the elements in the database (All-k-NN). One way to solve it is the naı̈ve one: comparing each object in the database with all the other ones and returning the k elements nearest to it (k-NN). Another way to do this is by preprocessing the database to build an index, and then searching on this index for the k-NN of each element of the dataset. Answering to the All-k-NN problem allows to build the k-Nearest Neighbor graph (kNNG). Given an object collection of a metric space, the Nearest Neighbor Graph (NNG) associates each node with its closest neighbor under the given metric. If we link each object to their k nearest neighbors, we obtain the k Nearest Neighbor Graph (kNNG).The kNNG can be considered an index for a database, which is quite efficient and can allow improvements. In this work, we propose a new technique to solve the All-k-NN problem which do not use any index to obta...
Lecture Notes in Computer Science, 2005
Proximity searching consists in retrieving from a database, objects that are close to a query. For this type of searching problem, the most general model is the metric space, where proximity is defined in terms of a distance function. A solution for this problem consists in building an offline index to quickly satisfy online queries. The ultimate goal is to use as few distance computations as possible to satisfy queries, since the distance is considered expensive to compute. Proximity searching is central to several applications, ranging from multimedia indexing and querying to data compression and clustering. In this paper we present a new approach to solve the proximity searching problem. Our solution is based on indexing the database with the knearest neighbor graph (knng), which is a directed graph connecting each element to its k closest neighbors. We present two search algorithms for both range and nearest neighbor queries which use navigational and metrical features of the knng graph. We show that our approach is competitive against current ones. For instance, in the document metric space our nearest neighbor search algorithms perform 30% more distance evaluations than AESA using only a 0.25% of its space requirement. In the same space, the pivot-based technique is completely useless.
Nearest neighbor queries can be satisfied, in principle, with a greedy algorithm under a proximity graph. Each object in the database is represented by a node, and proximal nodes in this graph will share an edge. To find the nearest neighbor the idea is quite simple, we start in a random node and get iteratively closer to the nearest neighbor following only adjacent edges in the proximity graph. Every reachable node from current vertex is reviewed, and only the closer-to-the-query node is expanded in the next round. The algorithm stops when none of the neighbors of the current node is closer to the query. The number of revised objects will be proportional to the diameter of the graph times the average degree of the nodes. Unfortunately the degree of a proximity graph is unbounded for a general metric space [1], and hence the number of inspected objects can be linear on the size of the database, which is the same as no indexing at all. In this paper we introduce a quasi-proximity graph induced by the all-knearest neighbor graph. The degree of the above graph is bounded but we will face local minima when running the above greedy algorithm, which boils down to have false positives in the queries. We show experimental results for high dimensional spaces. We report a recall greater than 90% for most configurations, which is very good for many proximity searching applications, reviewing just a tiny portion of the database. The space requirement for the index is linear on the database size, and the construction time is quadratic in worst case. Relaxations of our method are sketched to obtain practical subquadratic implementations.
The reverse k-nearest neighbor (RkNN) problem, i.e. finding all objects in a data set the k-nearest neighbors of which include a specified query object, is a generalization of the reverse 1-nearest neighbor problem which has received increasing attention recently. Many industrial and scientific applications call for solutions of the RkNN problem in arbitrary metric spaces where the data objects are not Euclidean and only a metric distance function is given for specifying object similarity. Usually, these applications need a solution for the generalized problem where the value of k is not known in advance and may change from query to query. However, existing approaches, except one, are designed for the specific R1NN problem. In addition -to the best of our knowledge -all previously proposed methods, especially the one for generalized RkNN search, are only applicable to Euclidean vector data but not for general metric objects. In this paper, we propose the first approach for efficient RkNN search in arbitrary metric spaces where the value of k is specified at query time. Our approach uses the advantages of existing metric index structures but proposes to use conservative and progressive distance approximations in order to filter out true drops and true hits. In particular, we approximate the k-nearest neighbor distance for each data object by upper and lower bounds using two functions of only two parameters each. Thus, our method does not generate any considerable storage overhead. We show in a broad experimental evaluation on real-world data the scalability and the usability of our novel approach.
XXII Congreso Argentino de Ciencias de la Computación (CACIC 2016)., 2016
Given a collection of objects in a metric space, the Nearest Neighbor Graph (NNG) associate each node with its closest neighbor under the given metric. It can be obtained trivially by computing the nearest neighbor of every object. To avoid computing every distance pair an index could be constructed. Unfortunately, due to the curse of dimensionality the indexed and the brute force methods are almost equally inefficient. This bring the attention to algorithms computing approximate versions of NNG. The DiSAT is a proximity searching tree. It is hierarchical. The root computes the distances to all objects, and each child node of the root computes the distance to all its subtree recursively. Top levels will have accurate computation of the nearest neighbor, and as we descend the tree this information would be less accurate. If we perform a few rebuilds of the index, taking deep nodes in each iteration, keeping score of the closest known neighbor, it is possible to compute an Approximate NNG (ANNG). Accordingly, in this work we propose to obtain de ANNG by this approach, without performing any search, and we tested this proposal in both synthetic and real world databases with good results both in costs and response quality.
Proceedings of the 12th International Conference on Extending Database Technology Advances in Database Technology - EDBT '09, 2009
In this paper, we propose an original solution for the general reverse k-nearest neighbor (RkNN) search problem. Compared to the limitations of existing methods for the RkNN search, our approach works on top of any hierarchically organized tree-like index structure and, thus, is applicable to any type of data as long as a metric distance function is defined on the data objects. We will exemplarily show how our approach works on top of the most prevalent index structures for Euclidean and metric data, the R-Tree and the M-Tree, respectively. Our solution is applicable for arbitrary values of k and can also be applied in dynamic environments where updates of the database frequently occur. Although being the most general solution for the RkNN problem, our solution outperforms existing methods in terms of query execution times because it exploits different strategies for pruning false drops and identifying true hits as soon as possible.
2009 Second International Workshop on Similarity Search and Applications, 2009
Retrieving the k-nearest neighbors of a query object is a basic primitive in similarity searching. A related, far less explored primitive is to obtain the dataset elements which would have the query object within their own k-nearest neighbors, known as the reverse k-nearest neighbor query. We already have indices and algorithms to solve k-nearest neighbors queries in general metric spaces; yet, in many cases of practical interest they degenerate to sequential scanning. The naive algorithm for reverse k-nearest neighbor queries has quadratic complexity, because the k-nearest neighbors of all the dataset objects must be found; this is too expensive. Hence, when solving these primitives we can tolerate trading correctness in the solution for searching time. In this paper we propose an efficient approximate approach to solve these similarity queries with high retrieval rate. Then, we show how to use our approximate k-nearest neighbor queries to construct (an approximation of) the k-nearest neighbor graph when we have a fixed dataset. Finally, combining both primitives we show how to dynamically maintain the approximate k-nearest neighbor graph of the objects currently stored within the metric dataset, that is, considering both object insertions and deletions.
British National Conference on Databases, 1998
Building an index tree is a common approach to speed upthe k nearest neighbour search in large databases of many-dimensionalrecords. Many applications require varying distance metrics by putting aweight on different dimensions. The main problem with k nearest neighboursearches using weighted euclidean metrics in a high dimensionalspace is whether the searches can be done efficiently We present a solutionto this
Proceedings of the fourth annual ACM-SIAM …, 1993
We consider the computational problem of finding nearest neighbors in general metric spaces. Of particular interest are spaces that may not be conveniently embedded or approxi-mated in Euclidian space, or where the dimensionality of a Euclidian representation 1s ...
Lecture Notes in Computer Science, 1998
Building an index tree is a common approach to speed up the k nearest neighbour search in large databases of many-dimensional records. Many applications require varying distance metrics by putting a weight on di erent dimensions. The main problem with k nearest neighbour searches using weighted euclidean metrics in a high dimensional space is whether the searches can be done e ciently We present a solution to this problem which uses the bounding rectangle of the nearestneighbour disk instead of using the disk directly. The algorithm is able to perform nearest-neighbour searches using distance metrics di erent from the metric used to build the search tree without having to rebuild the tree. It is e cient for weighted euclidean distance and extensible to higher dimensions.
Numerous techniques have been proposed in the past for supporting efficient k-nearest neighbor (k-NN) queries in continuous data spaces. Limited work has been reported in the literature for k-NN queries in a non-ordered discrete data space (NDDS). Performing k-NN queries in an NDDS raises new challenges. The Hamming distance is usually used to measure the distance between two vectors (objects) in an NDDS. Due to the coarse granularity of the Hamming distance, a k-NN query in an NDDS may lead to a high degree of non-determinism for the query result. We propose a new distance measure, called Granularity-Enhanced Hamming (GEH) distance, that effectively reduces the number of candidate solutions for a query. We have also implemented k-NN queries using multidimensional database indexing in NDDSs. Further, we use the properties of our multidimensional NDDS index to derive the probability of encountering new neighbors within specific regions of the index. This probability is used to develop a new search ordering heuristic. Our experiments on synthetic and genomic data sets demonstrate that our index-based k-NN algorithm is efficient in finding k-NNs in both uniform and non-uniform data sets in NDDSs and that our heuristics are effective in improving the performance of such queries.
2007 IEEE 23rd International Conference on Data Engineering, 2007
A k-nearest neighbor (k-NN) query retrieves k objects from a database that are considered to be the closest to a given query point. Numerous techniques have been proposed in the past for supporting efficient k-NN searches in continuous data spaces. No such work has been reported in the literature for k-NN searches in a non-ordered discrete data space (NDDS). Performing k-NN searches in an NDDS raises new challenges. The Hamming distance is usually used to measure the distance between two vectors (objects) in an NDDS. Due to the coarse granularity of the Hamming distance, a k-NN query in an NDDS may lead to a large set of candidate solutions, creating a high degree of nondeterminism for the query result. We propose a new distance measure, called Granularity-Enhanced Hamming (GEH) distance, that effectively reduces the number of candidate solutions for a query. We have also considered using multidimensional database indexing for implementing k-NN searches in NDDSs. Our experiments on synthetic and genomic data sets demonstrate that our index-based k-NN algorithm is effective and efficient in finding k-NNs in NDDSs.
Lecture Notes in Computer Science, 2006
Let U be a set of elements and d a distance function defined among them. Let NN k (u) be the k elements in U − {u} having the smallest distance to u. The k-nearest neighbor graph (knng) is a weighted
Lecture Notes in Computer Science, 2007
Similarity search algorithms that directly rely on index structures and require a lot of distance computations are usually not applicable to databases containing complex objects and defining costly distance functions on spatial, temporal and multimedia data. Rather, the use of an adequate multi-step query processing strategy is crucial for the performance of a similarity search routine that deals with complex distance functions. Reducing the number of candidates returned from the filter step which then have to be exactly evaluated in the refinement step is fundamental for the efficiency of the query process. The state-of-the-art multi-step k-nearest neighbor (kNN) search algorithms are designed to use only a lower bounding distance estimation for candidate pruning. However, in many applications, also an upper bounding distance approximation is available that can additionally be used for reducing the number of candidates. In this paper, we generalize the traditional concept of R-optimality and introduce the notion of RI-optimality depending on the distance information I available in the filter step. We propose a new multi-step kNN search algorithm that utilizes lower-and upper bounding distance information (I lu) in the filter step. Furthermore, we show that, in contrast to existing approaches, our proposed solution is RI luoptimal. In an experimental evaluation, we demonstrate the significant performance gain over existing methods.
SIAM Journal on Computing, 2013
We study the Approximate Nearest Neighbor problem for metric spaces where the query points are constrained to lie on a subspace of low doubling dimension, while the data is high-dimensional. We show that this problem can be solved efficiently despite the high dimensionality of the data.
Lecture Notes in Computer Science, 2006
Proximity searching consists in retrieving from a database those elements that are similar to a query. As the distance is usually expensive to compute, the goal is to use as few distance computations as possible to satisfy queries. Indexes use precomputed distances among database elements to speed up queries. As such, a baseline is AESA, which stores all the distances among database objects, but has been unbeaten in query performance for 20 years. In this paper we show that it is possible to improve upon AESA by using a radically different method to select promising database elements to compare against the query. Our experiments show improvements of up to 75% in document databases. We also explore the usage of our method as a probabilistic algorithm that may lose relevant answers. On a database of faces where any exact algorithm must examine virtually all elements, our probabilistic version obtains 85% of the correct answers by scanning only 10% of the database.
Multimedia Tools and Applications, 2003
In order to speedup retrieval in large collections of data, index structures partition the data into subsets so that query requests can be evaluated without examining the entire collection. As the complexity of modern data types grows, metric spaces have become a popular paradigm for similarity retrieval. We propose a new index structure, called D-Index, that combines a novel clustering technique and the pivot-based distance searching strategy to speed up execution of similarity range and nearest neighbor queries for large files with objects stored in disk memories. We have qualitatively analyzed D-Index and verified its properties on actual implementation. We have also compared D-Index with other index structures and demonstrated its superiority on several real-life data sets. Contrary to tree organizations, the D-Index structure is suitable for dynamic environments with a high rate of delete/insert operations.
We investigate the problem of approximate similarity (nearest neighbor) search in high-dimensional metric spaces, and describe how the distance distribution of the query object can be exploited so as to provide probabilistic guarantees on the quality of the result. This leads to a new paradigm for similarity search, called PAC-NN (probably approximately correct nearest neighbor) queries, aiming to break the "dimensionality curse". PAC-NN queries return, with probability at least 1 ; , a (1 + )-approximate NN -an object whose distance from the query q is less than (1 + ) times the distance between q and its NN. Analytical and experimental results obtained for sequential and index-based algorithms show that PAC-NN queries can be efficiently processed even on very high-dimensional spaces and that control can be exerted in order to tradeoff the accuracy of the result and the cost.
Lecture Notes in Computer Science, 2012
We propose a novel approach for solving the approximate nearest neighbor search problem in arbitrary metric spaces. The distinctive feature of our approach is that we can incrementally build a non-hierarchical distributed structure for given metric space data with a logarithmic complexity scaling on the size of the structure and adjustable accuracy probabilistic nearest neighbor queries. The structure is based on a small world graph with vertices corresponding to the stored elements, edges for links between them and the greedy algorithm as base algorithm for searching. Both search and addition algorithms require only local information from the structure. The performed simulation for data in the Euclidian space shows that the structure built using the proposed algorithm has navigable small world properties with logarithmic search complexity at fixed accuracy and has weak (power law) scalability with the dimensionality of the stored data.
Lecture Notes in Computer Science, 2004
A number of problems in computer science can be solved efficiently with the so called memory based or kernel methods. Among this problems (relevant to the AI community) are multimedia indexing, clustering, non supervised learning and recommendation systems. The common ground to this problems is satisfying proximity queries with an abstract metric database. In this paper we introduce a new technique for making practical indexes for metric range queries. This technique improves existing algorithms based on pivots and signatures, and introduces a new data structure, the Fixed Queries Trie to speedup metric range queries. The result is an O(n) construction time index, with query complexity O(n α), α ≤ 1. The indexing algorithm uses only a few bits of storage for each database element.
Given a set P of N points in a ddimensional space, along with a query point q, it is often desirable to find k points of P that are with high probability close to q. This is the Approximate k-Nearest-Neighbors problem. We present two algorithms for AkNN. Both require O(N 2 d) preprocessing time. The first algorithm has a query time cost that is O(d+log N ), while the second has a query time cost that is O(d). Both algorithms create an undirected graph on the points of P by adding edges to a linked list storing P in Hilbert order. To find approximate nearest neighbors of a query point, both algorithms perform bestfirst search on this graph. The first algorithm uses standard one dimensional indexing structures to find starting points on the graph for this search, whereas the second algorithm using random starting points. Despite the quadratic preprocessing time, our algorithms have the potential to be useful in machine learning applications where the number of query points that need to be processed is large compared to the number of points in P . The linear dependence in d of the preprocessing and query time costs of our algorithms allows them to remain effective even when dealing with highdimensional data.
Loading Preview
Sorry, preview is currently unavailable. You can download the paper by clicking the button above.