Academia.edu no longer supports Internet Explorer.
To browse Academia.edu and the wider internet faster and more securely, please take a few seconds to upgrade your browser.
1996, Information Processing Letters
AI
Fast retrieval from large dictionaries is a classic problem addressed by various data structures. This paper introduces a new storage-efficient structure called multiple bit hashing (mbh) that enhances retrieval speed by using multiple hash tables to represent keywords as bits, rather than storing complete keywords. The mbh structure's efficiency is demonstrated through its application and performance analysis, which reveals significant advantages over traditional techniques in terms of space utilization and retrieval time.
Intelligent Distributed Computing, Systems and …, 2008
Data dictionaries allow eAEcient transformation of repeating input values. The attention is focused on the analysis of voluminous lookup tables that store up to a few tens of millions of key-value pairs. Because of their compactness and search eAEciency, hash tables turn out to provide the best solutions in such cases. This paper deals with performance issues of such structures and its main contribution is to take into consideration the e«ect of the multi-level memory hierarchies present in all the current computers. The paper enumerates and compares various choices and methods in order to give an indication how to choose the structure and the parameters of hash tables in case of large-scale, in-memory data dictionaries.
2017
The problem of text retrieval is continuously attracting more research attention; they still used for efficiently analyze text data. The unstructured text data take more importance in numerous fields such as business analysis, customer retention and extension, social media, information retrieval and legal applications, etc. This article considers the importance of exploratory dictionary construction for finding the concepts of interest, also it proposes a system for efficient dictionary construction, tuning. The re-use of these dictionaries across a large scale and different datasets still remain an unsolved problem. This paper employing different types of hash functions to conduct progressive multi-search stages, and reducing the time that required constructing the dictionary as much as possible while maintaining the accuracy of the information contained in it. Many text-mining tools, hashing functions, data structures concepts and numeration operations were utilized in the planned...
2009
Abstract: We consider the problem of representing, in a compressed format, a bit-vector $ S $ of $ m $ bits with $ n $ 1s, supporting the following operations, where $ b\ in\{0, 1\} $: $ rank_b (S, i) $ returns the number of occurrences of bit $ b $ in the prefix $ S [1.. i] $; $ select_b (S, i) $ returns the position of the $ i $ th occurrence of bit $ b $ in $ S $. Such a data structure is called\ emph {fully indexable dictionary (FID)}[Raman et al., 2007], and is at least as powerful as predecessor data structures.
Proceedings of the VLDB Endowment, 2015
Hashing is a solved problem. It allows us to get constant time access for lookups. Hashing is also simple. It is safe to use an arbitrary method as a black box and expect good performance, and optimizations to hashing can only improve it by a negligible delta. Why are all of the previous statements plain wrong? That is what this paper is about. In this paper we thoroughly study hashing for integer keys and carefully analyze the most common hashing methods in a five-dimensional requirements space: (1) data-distribution, (2) load factor, (3) dataset size, (4) read/write-ratio, and (5) un/successful-ratio. Each point in that design space may potentially suggest a different hashing scheme, and additionally also a different hash function. We show that a right or wrong decision in picking the right hashing scheme and hash function combination may lead to significant difference in performance. To substantiate this claim, we carefully analyze two additional dimensions: (6) five representati...
Information Processing Letters, 2001
Eighth International Conference on Document Analysis and Recognition (ICDAR'05), 2005
A lexicon is needed in many applications. In the past, different structures such as tries, hash tables and their variants have been investigated for lexicon organization and lexical access. In this paper, we propose a new data structure that combines the use of hash table and tries for storing a Chinese lexicon. The data structure facilitates an efficient lexical access yet requires less memory than that of a trie lexicon. Experiments are conducted to evaluate its performance for in-vocabulary lexical access, out-of-vocabulary word rejection, and substring matching. The effectiveness of the proposed approach is confirmed.
Many techniques for text processing are based on efficient data storing and retrieval techniques. Careful selection of data structures used and retrieval techniques play a significant role in efficiency of the whole system of data processing. Hashing technique is one very often used technique with O(1) run-time complexity for data storing and retrieval. A comparison of new technique for hash function construction is presented in the paper without need of division operation. The comparison of the proposed technique is especially convenient for large textual data sets processing. State of the art in hashing of textual data is given (the perfect hashing techniques are not included). The proposed hash function construction and hashing technique have been compared with other comparative techniques for different languages and textual data (chemical data sets etc.).
1999
ABSTRACT We use the signature file method to search for partially specified terms in large lexicons. To optimize efficiency, we use the concepts of the partially evaluated bit-sliced signature file method and memory resident data structures. Our system employs signature partitioning, compression, and term blocking. We derive equations to obtain system design parameters, and measure indexing efficiency in terms of time and space. The resulting approach provides good response time and is storage-efficient.
ACM Transactions on …, 1991
2009
Abstract. Fast elimination of duplicate data is needed in many areas, especially in the textual data context. A solution to this problem was recently found for geometrical data using a hash function to speed up the process. The usage of the hash function is extremely efficient when incremental elimination is required especially for processing large data sets. In this paper a new construction of the hash function is presented, giving short clusters with few collisions only. The proposed hash function is not a perfect hash function, nevertheless it gives similar properties to it. The hash function used takes advantage of the relatively large amount of available memory on modern computers, and works well with large data sets. Experiments have proved that different approaches should be used for different types of languages, because the structures of Slavonic and Anglo-Saxon languages are different. Therefore, tests were made with a Czech dictionary having 2.5 million words and an Englis...
2009
Fast elimination of duplicate data is needed in many areas, especially in the textual data context. A solution to this problem was recently found for geometrical data using a hash function to speed up the process. The usage of the hash function is extremely efficient when incremental elimination is required especially for processing large data sets. In this paper a new construction of the hash function is presented, giving short clusters with few collisions only. The proposed hash function is not a perfect hash function nevertheless it gives similar properties to it. The hash function used takes advantage of the relatively large amount of available memory on modern computers, and works well with large data sets.
ACM SIGCSE Bulletin, 2002
Hashing is a singularly important technique for building efficient data structures. Unfortunately, the topic has been poorly understood historically, and recent developments in the practice of hashing have not yet found their way into textbooks.
Proceedings DCC 2002. Data Compression Conference, 2002
ACM Transactions on Database Systems, 1988
A new dynamic hashing scheme is presented. Its most outstanding feature is that any record can be retrieved in exactly one disk access. This is achieved by using a small amount of supplemental internal storage that stores enough information to uniquely determine the current location of any record. The amount of internal storage required is small: typically one byte for each page of the file. The necessary address computation, insertion, and expansion algorithms are presented and the performance is studied by means of simulation. The new method is the first practical method offering one-access retrieval for large dynamic files.
BRICS Report Series, 1997
We consider dictionaries of size n over the finite universe U ={0, 1}^w and introduce a new technique for their implementation: error correcting codes. The use of such codes makes it possible to replace the use of strong forms of hashing, such as universal hashing, with much weaker forms, such as clustering.<br />We use our approach to construct, for any epsilon > 0, a deterministic solution to the dynamic dictionary problem using linear space, with worst case time O(n) for insertions and deletions, and worst case time O(1) for lookups. This is the first deterministic solution to the dynamic dictionary problem with linear space, constant query time, and non-trivial update time. In particular, we get a solution to the static dictionary problem with O(n) space, worst case query time O(1), and deterministic initialization time O(n^(1+epsilon)). The best previous deterministic initialization time for such dictionaries, due to Andersson, is O(n^(2+epsilon)). The model of computa...
Software - Practice and Experience, 2007
A method for finding all matches in a pre-processed dictionary for a query string q and with at most k differences is presented. A very fast constant-time estimate using hashes is presented. A tree structure is used to minimize the number of estimates made. Practical tests are performed, showing that the estimate can filter out 99% of the full comparisons for 40% error rates and dictionaries of up to four million words. The tree is found to be efficient up to a 50% error rate. Copyright © 2006 John Wiley & Sons, Ltd.
Algorithms, 2011
The problem of compressed pattern matching, which has recently been treated in many papers dealing with free text, is extended to structured files, specifically to dictionaries, which appear in any full-text retrieval system. The prefix-omission method is combined with Huffman coding and a new variant based on Fibonacci codes is presented. Experimental results suggest that the new methods are often preferable to earlier ones, in particular for small files which are typical for dictionaries, since these are usually kept in small chunks.
ACM Transactions on Database Systems, 1979
Extendible hashing is a new access technique, in which the user is guaranteed no more than two page faults to locate the data associated with a given unique identifier, or key. Unlike conventional hashing, extendible hashing has a dynamic structure that grows and shrinks gracefully as the database grows and shrinks. This approach simultaneously solves the problem of making hash tables that are extendible and of making radix search trees that are balanced. We study, by analysis and simulation, the performance of extendible hashing. The results indicate that extendible hashing provides an attractive alternative to other access methods, such as balanced trees.
2006
We propose a novel external memory based algorithm for constructing minimal perfect hash functions h for huge sets of keys. For a set of n keys, our algorithm outputs h in time O(n). The algorithm needs a small vector of one byte entries in main memory to construct h. The evaluation of h(x) requires three memory accesses for any key x. The description of h takes a constant number of up to 9 bits for each key, which is optimal and close to the theoretical lower bound, i.e., around 2 bits per key. In our experiments, we used a collection of 1 billion URLs collected from the web, each URL 64 characters long on average. For this collection, our algorithm (i) finds a minimal perfect hash function in approximately 3 hours using a commodity PC, (ii) needs just 5.45 megabytes of internal memory to generate h and (iii) takes 8.1 bits per key for the description of h.
Pesquisa Operacional, 2015
Hash tables are among the most important data structures known to mankind. Through hashing, the address of each stored object is calculated as a function of the object's contents. Because they do not require exorbitant space and, in practice, allow for constanttime dictionary operations (insertion, lookup, deletion), hash tables are often employed in the indexation of large amounts of data. Nevertheless, there are numerous problems of somewhat different nature that can be solved in elegant fashion using hashing, with significant economy of time and space. The purpose of this paper is to reinforce the applicability of such technique in the area of Operations Research and to stimulate further reading, for which adequate references are given. To our knowledge, the proposed solutions to the problems presented herein have never appeared in the literature, and some of the problems are novel themselves.
Loading Preview
Sorry, preview is currently unavailable. You can download the paper by clicking the button above.