Academia.edu no longer supports Internet Explorer.
To browse Academia.edu and the wider internet faster and more securely, please take a few seconds to upgrade your browser.
2009, Proceeding of the 18th ACM conference on Information and knowledge management - CIKM '09
…
10 pages
1 file
AS-Index is a new index structure for exact string search in disk resident databases. It uses hashing, unlike known alternatives, whether baesd on trees or tries. It typically indexes every n-gram in the database, though non-dense indexing is possible. The hash function uses the algebraic signatures of n-grams. Use of hashing provides for constant index access time for arbitrarily long patterns, unlike other structures whose search cost is at best logarithmic. The storage overhead of AS-Index is basically 500 -600%, similar to that of alternatives or smaller.
AS-Index is a new index structure for exact string search in disk resident databases. It uses hashing, unlike known alternatives, whether baesd on trees or tries. It typically indexes every n-gram in the database, though non-dense indexing is also possible. The hash function uses the algebraic signatures of all n-grams. Use of hash-ing provides for constant index access time for arbitrarily long patterns , unlike other structures whose search cost is at best logarithmic. The storage overhead of AS-Index is basically 500-600%, similar to that of alternatives or smaller. We show the index structure, our use of algebraic signatures and the search algorithm. We present the theoretical and experimental performance analysis. We compare the AS-Index to main alternatives. We conclude that AS-Index is an attractive structure and we indicate directions for future work.
The tremendous expanse of search engines, dictionary and thesaurus storage, and other text mining applications, combined with the popularity of readily available scanning devices and optical character recognition tools, has necessitated efficient storage, retrieval and management of massive text databases for various modern applications. For such applications, we propose a novel data structure, INSTRUCT, for efficient storage and management of sequence databases. Our structure uses bit vectors for reusing the storage space for common triplets, and hence, has a very low memory requirement. INSTRUCT efficiently handles prefix and suffix search queries in addition to the exact string search operation by iteratively checking the presence of triplets. The paper also proposes an extension of the structure to handle substring search efficiently, albeit with an increase in the space requirements. This extension is important in the context of trie-based solutions since they are unable to handle such queries efficiently. We perform several experiments portraying that INSTRUCT outperforms the existing structures by nearly a factor of two in terms of space requirements, while the query times are better than the competing structures. The ability to handle insertion and deletion of strings in addition to supporting all kinds of queries including exact search, prefix/suffix search and substring search makes INSTRUCT a complete data structure.
ACM SIGMOD Record, 1992
The objective of this paper is to develop and analyze high performance hash based search methods for main memory databases. We define optimal search in main memory databases as the search that requires at most one key comparison to locate a record. Existing hashing techniques become impractical when they are adapted to yield optimal search in main memory databases because of their large directory size. Multi-directory hashing techniques can provide significantly improved directory utilization over single-directory hashing techniques. A multi-directory hashing scheme, called fast search multi-directory hashing, and its generalization, called controlled search multi-directory hashing, are presented. Both methods achieve linearly increasing expected directory size with the number of records. Their performance is compared to existing alternatives.
Information Processing Letters, 1996
Suffix tree and suffix array are data structures that allow fast search in a large static text. By using the suffix tree data structure we can find all k occurrences of a pattern w in a text of length n in time 0( Iw] + k). The same problem can be solved by using the suffix array data structure in time 0( IwI + log(n) + k). Thus suffix trees perform better than suffix arrays with respect to the search time. On the other hand suffix trees require as much as four times more memory space than suffix arrays. We propose a new data structure, the augmented sujix array, that allows searching in 0( 1 WI + log log(n) + k) time and requires about the same memory space as the suffix array. Moreover, in case of very large texts, most of the new data structure and tire text itself can be stored in secondary inemory without compromising search operation efficiency. This is not the case for both suffix trees and suffix arrays.
IEEE Transactions on Knowledge and Data Engineering, 2000
The suffix array is an efficient data structure for in-memory pattern search. Suffix arrays can also be used for external-memory pattern search, via two-level structures that use an internal index to identify the correct block of suffix pointers. In this paper we describe a new two-level suffix array-based index structure that requires significantly less disk space than previous approaches. Key to the saving is the use of disk blocks that are based on prefixes rather than the more usual uniform-sampling approach, allowing reductions between blocks and subparts of other blocks. We also describe a new in-memory structure based on a condensed BWT string, and show that it allows common patterns to be resolved without access to the text. Experiments using 64 GB of English web text and a laptop computer with just 4 GB of main memory demonstrate the speed and versatility of the new approach. For this data the index is around onethird the size of previous two-level mechanisms; and the memory footprint of as little as 1% of the text size means that queries can be processed more quickly than is possible with a compact FM-INDEX.
A wide range of applications require that large quantities of data be maintained in sort order on disk. The B-tree, and its variants, are an efficient general-purpose diskbased data structure that is almost universally used for this task. The B-trie has the potential to be a competitive alternative for the storage of data where strings are used as keys, but has not previously been thoroughly described or tested. We propose new algorithms for the insertion, deletion, and equality search of variable-length strings in a disk-resident B-trie, as well as novel splitting strategies which are a critical element of a practical implementation. We experimentally compare the B-trie against variants of B-tree on several large sets of strings with a range of characteristics. Our results demonstrate that, although the B-trie uses more memory, it is faster, more scalable, and requires less disk space.
2008
The suffix tree (or equivalently, the enhanced suffix array) provides efficient solutions to many problems involving pattern matching and pattern discovery in large strings, such as those arising in computational biology. Here we address the problem of arranging a suffix array on disk so that querying is fast in practice. We show that the combination of a small trie and a suffix array-like blocked data structure allows queries to be answered as much as three times faster than the best alternative disk-based suffix array arrangement. Construction of our data structure requires only modest processing time on top of that required to build the suffix tree, and requires negligible extra memory.
The proliferation of online text, such as found on the World Wide Web and in online databases, motivates the need for space-efficient text indexing methods that support fast string searching. We model this scenario as follows: Consider a text T consisting of n symbols drawn from a fixed alphabet Σ. The text T can be represented in n lg |Σ| bits by encoding each symbol with lg |Σ| bits. The goal is to support fast online queries for searching any string pattern P of m symbols, with T being fully scanned only once, namely, when the index is created at preprocessing time. The text indexing schemes published in the literature are greedy in terms of space usage: they require Ω(n lg n) additional bits of space in the worst case. For example, in the standard unit cost RAM, suffix trees and suffix arrays need Ω(n) memory words, each of Ω(lg n) bits. These indexes are larger than the text itself by a multiplicative factor of Ω(lg |Σ| n), which is significant when Σ is of constant size, such as in ascii or unicode. On the other hand, these indexes support fast searching, either in O(m lg |Σ|) time or in O(m + lg n) time, plus an output-sensitive cost O(occ) for listing the occ pattern occurrences. We present a new text index that is based upon compressed representations of suffix arrays and suffix trees. It achieves a fast O(m/ lg |Σ| n + lg |Σ| n) search time in the worst case, for any constant 0 < < ≤ 1, using at most −1 + O(1) n lg |Σ| bits of storage. Our result thus presents for the first time an efficient index whose size is provably linear in the size of the text in the worst case, and for many scenarios, the space is actually sublinear in practice. As a concrete example, the compressed suffix array for a typical 100 MB ascii file can require 30–40 MB or less, while the raw suffix array requires 500 MB. Our theoretical bounds improve both time and space of previous indexing schemes. Listing the pattern occurrences introduces a sublogarithmic slowdown factor in the output-sensitive cost, giving O(occ lg |Σ| n) time as a result. When the patterns are sufficiently long, we can use auxiliary data structures in O(n lg |Σ|) bits to obtain a total search bound of O(m/ lg |Σ| n + occ) time, which is optimal.
2005
We present the interpolation search tree (ISB-tree), a new cache-aware indexing scheme that supports update operations (insertions and deletions) in O(1) worst-case (w.c.) block transfers and search operations in O(logB log n) expected block transfers, where B represents the disk block size and n denotes the number of stored elements. The expected search bound holds with high probability for a large class of (unknown) input distributions. The w.c. search bound of our indexing scheme is O(logB n) block transfers. Our update and expected search bounds constitute a considerable improvement over the O(logB n) w.c. block transfer bounds for search and update operations achieved by the B-tree and its numerous variants. This is also suggested by a set of preliminary experiments we have carried out. Our indexing scheme is based on an externalization of a main memory data structure based on interpolation search.
2010
The field of compressed data structures seeks to achieve fast search time, but using a compressed representation, ideally requiring less space than that occupied by the original input data. The challenge is to construct a compressed representation that provides the same functionality and speed as traditional data structures. In this invited presentation, we discuss some breakthroughs in compressed data structures over the course of the last decade that have significantly reduced the space requirements for fast text and document indexing. One interesting consequence is that, for the first time, we can construct data structures for text indexing that are competitive in time and space with the well-known technique of inverted indexes, but that provide more general search capabilities. Several challenges remain, and we focus in this presentation on two in particular: building I/O-efficient search structures when the input data are so massive that external memory must be used, and incorporating notions of relevance in the reporting of query answers.
Loading Preview
Sorry, preview is currently unavailable. You can download the paper by clicking the button above.
Lecture Notes in Computer Science, 2003
Software: Practice and Experience, 1980
Theoretical Computer Science, 2011
wwwdb.inf.tu-dresden.de
2015 IEEE 31st International Conference on Data Engineering, 2015