Academia.edu no longer supports Internet Explorer.
To browse Academia.edu and the wider internet faster and more securely, please take a few seconds to upgrade your browser.
2008
Current data structures for searching large string collections are limited in that they either fail to achieve minimum space or they cause too many cache misses. In this paper, we discuss some edge linearizations of the classic trie data structure that are simultaneously cache-friendly and storable in compressed space. The widely-known frontcoding scheme [26] is one example of linearization; it is at the core of Prefix B-trees and many other disk-conscious compressed indexes for string collections. However, it is largely thought of as a space-effective heuristic without efficient search support.
Pattern matching on text data has been a fundamental field of Computer Science for nearly 40 years. Databases supporting full-text indexing functionality on text data are now widely used by biologists. In the theoretical literature, the most popular internal-memory index structures are the suffix trees and the suffix arrays, and the most popular external-memory index structure is the string B-tree. However, the practical applicability of these indexes has been limited mainly because of their space consumption and I/O issues. These structures use a lot more space (almost 20 to 50 times more) than the original text data and are often disk-resident. Ferragina and Manzini (2005) and Grossi and Vitter (2005) gave the first compressed text indexes with efficient query times in the internal-memory model. Recently, Chien et al (2008) presented a compact text index in the external memory based on the concept of Geometric Burrows-Wheeler Transform. They also presented lower bounds which suggested that it may be hard to obtain a good index structure in the external memory. In this paper, we investigate this issue from a practical point of view. On the positive side we show an external-memory text indexing structure (based on R-trees and KD-trees) that saves space by about an order of magnitude as compared to the standard String B-tree. While saving space, these structures also maintain a comparable I/O efficiency to that of String B-tree. We also show various space vs I/O efficiency trade-offs for our structures.
ACM Transactions on Algorithms, 2007
Let T be a string with n characters over an alphabet of constant size. The recent breakthrough on compressed indexing allows us to build an index for T in optimal space (i.e., O(n) bits), while supporting very efficient pattern matching ]. Yet the compressed nature of such indexes also makes them difficult to update dynamically.
Proceedings of the International Workshop on …
We introduce a practical disk-based compressed text index that, when the text is compressible, takes little more than the plain text size (and replaces it). It provides very good I/O times for searching, which in particular improve when the text is compressible. In this aspect our index is unique, as compressed indexes have been slower than their classical counterparts on secondary memory. We analyze our index and show experimentally that it is extremely competitive on compressible texts. ⋆
The proliferation of online text, such as found on the World Wide Web and in online databases, motivates the need for space-efficient text indexing methods that support fast string searching. We model this scenario as follows: Consider a text T consisting of n symbols drawn from a fixed alphabet Σ. The text T can be represented in n lg |Σ| bits by encoding each symbol with lg |Σ| bits. The goal is to support fast online queries for searching any string pattern P of m symbols, with T being fully scanned only once, namely, when the index is created at preprocessing time. The text indexing schemes published in the literature are greedy in terms of space usage: they require Ω(n lg n) additional bits of space in the worst case. For example, in the standard unit cost RAM, suffix trees and suffix arrays need Ω(n) memory words, each of Ω(lg n) bits. These indexes are larger than the text itself by a multiplicative factor of Ω(lg |Σ| n), which is significant when Σ is of constant size, such as in ascii or unicode. On the other hand, these indexes support fast searching, either in O(m lg |Σ|) time or in O(m + lg n) time, plus an output-sensitive cost O(occ) for listing the occ pattern occurrences. We present a new text index that is based upon compressed representations of suffix arrays and suffix trees. It achieves a fast O(m/ lg |Σ| n + lg |Σ| n) search time in the worst case, for any constant 0 < < ≤ 1, using at most −1 + O(1) n lg |Σ| bits of storage. Our result thus presents for the first time an efficient index whose size is provably linear in the size of the text in the worst case, and for many scenarios, the space is actually sublinear in practice. As a concrete example, the compressed suffix array for a typical 100 MB ascii file can require 30–40 MB or less, while the raw suffix array requires 500 MB. Our theoretical bounds improve both time and space of previous indexing schemes. Listing the pattern occurrences introduces a sublogarithmic slowdown factor in the output-sensitive cost, giving O(occ lg |Σ| n) time as a result. When the patterns are sufficiently long, we can use auxiliary data structures in O(n lg |Σ|) bits to obtain a total search bound of O(m/ lg |Σ| n + occ) time, which is optimal.
2010
The field of compressed data structures seeks to achieve fast search time, but using a compressed representation, ideally requiring less space than that occupied by the original input data. The challenge is to construct a compressed representation that provides the same functionality and speed as traditional data structures. In this invited presentation, we discuss some breakthroughs in compressed data structures over the course of the last decade that have significantly reduced the space requirements for fast text and document indexing. One interesting consequence is that, for the first time, we can construct data structures for text indexing that are competitive in time and space with the well-known technique of inverted indexes, but that provide more general search capabilities. Several challenges remain, and we focus in this presentation on two in particular: building I/O-efficient search structures when the input data are so massive that external memory must be used, and incorporating notions of relevance in the reporting of query answers.
2007
Compressed text (self-)indexes have matured up to a point where they can replace a text by a data structure that requires less space and, in addition to giving access to arbitrary text passages, support indexed text searches. At this point those indexes are competitive with traditional text indexes (which are very large) for counting the number of occurrences of a pattern in the text. Yet, they are still hundreds to thousands of times slower when it comes to locating those occurrences in the text. In this paper we introduce a new compression scheme for suffix arrays which permits locating the occurrences extremely fast, while still being much smaller than classical indexes. In addition, our index permits a very efficient secondary memory implementation, where compression permits reducing the amount of I/O needed to answer queries.
Computing Research Repository, 2007
A compressed full-text self-index represents a text in a compressed form and still answers queries efficiently. This technology represents a breakthrough over the text indexing techniques of the previous decade, whose indexes required several times the size of the text. Although it is relatively new, this technology has matured up to a point where theoretical research is giving way to practical developments. Nonetheless this requires significant programming skills, a deep engineering effort, and a strong algorithmic background to dig into the research results. To date only isolated implementations and focused comparisons of compressed indexes have been reported, and they missed a common API, which prevented their re-use or deployment within other applications.
A new trend in the field of pattern matching is to design indexing data structures which take space very close to that required by the indexed text (in entropy-compressed form) and also simultaneously achieve good query performance. Two popular indexes, namely the FM-index [Ferragina and Manzini, 2005] and the CSA [Grossi and Vitter 2005], achieve this goal by exploiting the Burrows-Wheeler transform (BWT) [Burrows and Wheeler, 1994]. However, due to the intricate permutation structure of BWT, no locality of reference can be guaranteed when we perform pattern matching with these indexes. Chien et al. [2008] gave an alternative text index which is based on sparsifying the traditional suffix tree and maintaining an auxiliary 2-D range query structure. Given a text T of length n drawn from a σ-sized alphabet set, they achieved O(n log σ)-bit index for T and showed that this index can preserve locality in pattern matching and hence is amenable to be used in external-memory settings. We improve upon this index and show how to apply entropy compression to reduce index space. Our index takes O(n(H k + 1)) + o(n log σ) bits of space where H k is the kth-order empirical entropy of the text. This is achieved by creating variable length blocks of text using arithmetic coding.
Theoretical Computer Science, 2019
The rise of repetitive datasets has lately generated a lot of interest in compressed self-indexes based on dictionary compression, a rich and heterogeneous family of techniques that exploits text repetitions in different ways. For each such compression scheme, several different indexing solutions have been proposed in the last two decades. To date, the fastest indexes for repetitive texts are based on the run-length compressed Burrows-Wheeler transform (BWT) and on the Compact Directed Acyclic Word Graph (CDAWG). The most space-efficient indexes, on the other hand, are based on the Lempel-Ziv parsing and on grammar compression. Indexes for more universal schemes such as collage systems and macro schemes have not yet been proposed. Very recently, Kempa and Prezza [STOC 2018] showed that all dictionary compressors can be interpreted as approximation algorithms for the smallest string attractor, that is, a set of text positions capturing all distinct substrings. Starting from this observation, in this paper we develop the first universal compressed self-index, that is, the first indexing data structure based on string attractors, which can therefore be built on top of any dictionary-compressed text representation. Let γ be the size of a string attractor for a text of length n. From known reductions, γ can be chosen to be asymptotically equal to any repetitiveness measure: number of runs in the BWT, size of the CDAWG, number of Lempel-Ziv phrases, number of rules in a grammar or collage system, size of a macro scheme. Our index takes O(γ lg(n/γ)) words of space and supports locating the occ occurrences of any pattern of length m in O(m lg n + occ lg n) time, for any constant > 0. This is, in particular, the first index for general macro schemes and collage systems. Our result shows that the relation between indexing and compression is much deeper than what was previously thought: the simple property standing at the core of all dictionary compressors is sufficient to support fast indexed queries.
Journal of Algorithms, 2003
New text indexing functionalities of the compressed suffix arrays are proposed. The compressed suffix array proposed by Grossi and Vitter is a space-efficient data structure for text indexing. It occupies only O(n) bits for a text of length n; however it also uses the text itself that occupies n log 2 |A| bits for the alphabet A. In this paper we modify the data structure so that pattern matching can be done without any access to the text. In addition to the original functions of the compressed suffix array, we add new operations search, decompress and inverse to the compressed suffix arrays. We show that the new index can find occ occurrences of any substring P of the text in O(|P | log n + occ log n) time for any fixed 1 > 0 without access to the text. The index also can decompress a part of the text of length m in O(m + log n) time. For a text of length n on an alphabet A such that |A| = polylog(n), our new index occupies only O(nH 0 + n log log |A|) bits where H 0 log |A| is the order-0 entropy of the text. Especially for = 1 the size is nH 0 + O(n log log |A|) bits. Therefore the index will be smaller than the text, which means we can perform fast queries from compressed texts.
Lecture Notes in Computer Science, 2010
We study parallel and distributed compressed indexes. Compressed indexes are a new and functional way to index text strings. They exploit the compressibility of the text, so that their size is a function of the compressed text size. Moreover, they support a considerable amount of functions, more than many classical indexes. We make use of this extended functionality to obtain, in a shared-memory parallel machine, near-optimal speedups for solving several stringology problems. We also show how to distribute compressed indexes across several machines.
Data Compression Conference, 2004. Proceedings. DCC 2004
This paper investigates how to index a text which is subject to updates. The best solution in the literature [6] is based on suffix tree using O(n log n) bits of storage, where n is the length of the text. It supports finding all occurrences of a pattern P in O(|P | + occ) time, where occ is the number of occurrences. Each text update consists of inserting or deleting a substring of length y and can be supported in O(y + √ n) time. In this paper, we initiate the study of compressed index using only O(n log |Σ|) bits of space, where Σ denotes the alphabet. Our solution supports finding all occurrences of a pattern P in O(|P | log 2 n(log n + log |Σ|) + occ log 1+ n) time, while insertion or deletion of a substring of length y can be done in O((y + √ n) log 2+ n) amortized time, where 0 < ≤ 1. The core part of our data structure is based on the recent work on Compressed Suffix Trees (CST) and Compressed Suffix Arrays (CSA).
Journal of the ACM, 2005
We design two compressed data structures for the full-text indexing problem that support efficient substring searches using roughly the space required for storing the text in compressed form.
2006
Abstract We report on a new experimental analysis of high-order entropy-compressed suffix arrays, which retains the theoretical performance of previous work and represents an improvement in practice. Our experiments indicate that the resulting text index offers state-of-the-art compression. In particular, we require roughly 20% of the original text size---without requiring a separate instance of the text.
Information Processing Letters, 2006
Lecture Notes in Computer Science, 2004
Let T be a string with n characters over an alphabet of bounded size. The recent breakthrough on compressed indexing allows us to build an index for T in optimal space (i.e., O(n) bits), while supporting very efficient pattern matching [2, 4]. This paper extends the work on optimal-space indexing to a dynamic collection of texts. Precisely, we give a compressed index using O(n) bits where n is the total length of texts, such that searching for a pattern P takes O(|P | log n + occ log 2 n) time where occ is the number of occurrences, and inserting or deleting a text T takes O(|T | log n) time.
2010
Let s be a string whose symbols are solely available through access (i), a read-only operation that probes s and returns the symbol at position i in s. Many compressed data structures for strings, trees, and graphs, require two kinds of queries on s: select (c, j), returning the position in s containing the jth occurrence of c, and rank (c, p), counting how many occurrences of c are found in the first p positions of s. We give matching upper and lower bounds for this problem.
Proceedings of the Twenty-Eighth Annual ACM-SIAM Symposium on Discrete Algorithms, 2017
We show that the compressed suffix array and the compressed suffix tree of a string $T$ can be built in $O(n)$ deterministic time using $O(n\log\sigma)$ bits of space, where $n$ is the string length and $\sigma$ is the alphabet size. Previously described deterministic algorithms either run in time that depends on the alphabet size or need $\omega(n\log \sigma)$ bits of working space. Our result has immediate applications to other problems, such as yielding the first linear-time LZ77 and LZ78 parsing algorithms that use $O(n \log\sigma)$ bits.
2000
A compressed text database based on the compressed sufffix array is proposed. The compressed suffix array of Grossi and Vitter occupies only O(n) bits for a text of length n; however it also uses the text itself that occupies $ O(n\log |\Sigma |) $ bits for the alphabet ∑. On the other hand, our data structure does not use the text itself, and supports important operations for text databases: inverse, search and decompress. Our algorithms can find occ occurrences of any substring P of the text in $ O(|P|\log n + occ\log ^\varepsilon n) $ time and decompress a part of the text of length l in $ O(l + \log ^e n) $ time for any given 1 ≥ ∈ > 0. Our data structure occupies only $ n(\frac{2} {\varepsilon }(\frac{3} {2} + H_0 + 2logH_0 ) + 2 + \frac{{4log^\varepsilon n}} {{log^\varepsilon n - 1}}) + o(n) + O(|\Sigma |log|\Sigma |) $ bits where $ {\rm H}0 \leqslant {\text{log}}\left| \sum \right| $ is the order-0 entropy of the text. We also show the relationship with the opportunistic data structure of Ferragina and Manzini.
In this paper, we develop a simple and practical storage scheme for compressed suffix arrays (CSA). Our CSA can be constructed in linear time and needs 2nH k + n + o(n) bits of space simultaneously for any k c 1 and any constant c <1, where H k denotes the k-th order entropy. We compare the performance of our method with two established compressed indexing methods, viz. the FM-index and the Sad-CSA. Experiments on the Canterbury Corpus and the Pizza&Chili Corpus show significant advantages of our algorithm over two other indices in terms of compression and query time. Our storage scheme achieves better performance on all types of data present in these two corpora, except for evenly distributed data, such as DNA. The source code for our CSA is available online.
Loading Preview
Sorry, preview is currently unavailable. You can download the paper by clicking the button above.