Academia.edu no longer supports Internet Explorer.
To browse Academia.edu and the wider internet faster and more securely, please take a few seconds to upgrade your browser.
2009, Journal of Experimental Algorithmics
…
35 pages
1 file
A compressed full-text self-index represents a text in a compressed form and still answers queries efficiently. This represents a significant advancement over the (full-)text indexing techniques of the previous decade, whose indexes required several times the size of the text. Although it is relatively new, this algorithmic technology has matured up to a point where theoretical research is giving way to practical developments. Nonetheless this requires significant programming skills, a deep engineering effort, and a strong algorithmic background to dig into the research results. To date only isolated implementations and focused comparisons of compressed indexes have been reported, and they missed a common API, which prevented their re-use or deployment within other applications. The goal of this article is to fill this gap. First, we present the existing implementations of compressed indexes from a practitioner's point of view. Second, we introduce the Pizza&Chili site, which of...
Computing Research Repository, 2007
A compressed full-text self-index represents a text in a compressed form and still answers queries efficiently. This technology represents a breakthrough over the text indexing techniques of the previous decade, whose indexes required several times the size of the text. Although it is relatively new, this technology has matured up to a point where theoretical research is giving way to practical developments. Nonetheless this requires significant programming skills, a deep engineering effort, and a strong algorithmic background to dig into the research results. To date only isolated implementations and focused comparisons of compressed indexes have been reported, and they missed a common API, which prevented their re-use or deployment within other applications.
1997
Keyword based search engines are the basic building block of text retrieval systems. Higher level systems like content sensitive search engines and knowledge-based systems still rely on keyword search as the underlying text retrieval mechanism. With the explosive growth in content, Internet and Intranet information repositories require efficient mechanisms to store as well as index data. In this paper we discuss the implementation of the Shrink and Search Engine (SASE) framework which unites text compression and indexing to maximize keyword search performance while reducing storage cost. SASE features the novel capability of being able to directly search through compressed text without explicit decompression. The implementation includes a search server architecture, which can be accessed from a Java front-end to perform keyword search on the Internet. The performance results show that the compression efficiency of SASE is within 7-17% of GZIP one of the best lossless compression sch...
The self-index is a kind of highly compressed, selfcontained full-text index. It is designed for indexing plain texts in order to reduce its permanent storage, as well as to enhance searching performance. Apart from being a sequence of characters, usually the text has specific internal structure. The data record, as a basic model of structured data, is therefore employed to represent and organize such form of data widespread. In this paper, we design and implement an approach to building the self-index for data records via text medium. Our approach indexes the data records through an intermediate text which accommodates aligned record fields by stuffing delimiters among them. By theoretical analysis, we give the upper bounds of permanent space of our approach in a worst case. In addition, we report a series of experimental results to validate the correctness and efficiency of the proposed approach.
2007
Compressed text (self-)indexes have matured up to a point where they can replace a text by a data structure that requires less space and, in addition to giving access to arbitrary text passages, support indexed text searches. At this point those indexes are competitive with traditional text indexes (which are very large) for counting the number of occurrences of a pattern in the text. Yet, they are still hundreds to thousands of times slower when it comes to locating those occurrences in the text. In this paper we introduce a new compression scheme for suffix arrays which permits locating the occurrences extremely fast, while still being much smaller than classical indexes. In addition, our index permits a very efficient secondary memory implementation, where compression permits reducing the amount of I/O needed to answer queries.
Theoretical Computer Science, 2019
The rise of repetitive datasets has lately generated a lot of interest in compressed self-indexes based on dictionary compression, a rich and heterogeneous family of techniques that exploits text repetitions in different ways. For each such compression scheme, several different indexing solutions have been proposed in the last two decades. To date, the fastest indexes for repetitive texts are based on the run-length compressed Burrows-Wheeler transform (BWT) and on the Compact Directed Acyclic Word Graph (CDAWG). The most space-efficient indexes, on the other hand, are based on the Lempel-Ziv parsing and on grammar compression. Indexes for more universal schemes such as collage systems and macro schemes have not yet been proposed. Very recently, Kempa and Prezza [STOC 2018] showed that all dictionary compressors can be interpreted as approximation algorithms for the smallest string attractor, that is, a set of text positions capturing all distinct substrings. Starting from this observation, in this paper we develop the first universal compressed self-index, that is, the first indexing data structure based on string attractors, which can therefore be built on top of any dictionary-compressed text representation. Let γ be the size of a string attractor for a text of length n. From known reductions, γ can be chosen to be asymptotically equal to any repetitiveness measure: number of runs in the BWT, size of the CDAWG, number of Lempel-Ziv phrases, number of rules in a grammar or collage system, size of a macro scheme. Our index takes O(γ lg(n/γ)) words of space and supports locating the occ occurrences of any pattern of length m in O(m lg n + occ lg n) time, for any constant > 0. This is, in particular, the first index for general macro schemes and collage systems. Our result shows that the relation between indexing and compression is much deeper than what was previously thought: the simple property standing at the core of all dictionary compressors is sufficient to support fast indexed queries.
2010
The field of compressed data structures seeks to achieve fast search time, but using a compressed representation, ideally requiring less space than that occupied by the original input data. The challenge is to construct a compressed representation that provides the same functionality and speed as traditional data structures. In this invited presentation, we discuss some breakthroughs in compressed data structures over the course of the last decade that have significantly reduced the space requirements for fast text and document indexing. One interesting consequence is that, for the first time, we can construct data structures for text indexing that are competitive in time and space with the well-known technique of inverted indexes, but that provide more general search capabilities. Several challenges remain, and we focus in this presentation on two in particular: building I/O-efficient search structures when the input data are so massive that external memory must be used, and incorporating notions of relevance in the reporting of query answers.
Pattern matching on text data has been a fundamental field of Computer Science for nearly 40 years. Databases supporting full-text indexing functionality on text data are now widely used by biologists. In the theoretical literature, the most popular internal-memory index structures are the suffix trees and the suffix arrays, and the most popular external-memory index structure is the string B-tree. However, the practical applicability of these indexes has been limited mainly because of their space consumption and I/O issues. These structures use a lot more space (almost 20 to 50 times more) than the original text data and are often disk-resident. Ferragina and Manzini (2005) and Grossi and Vitter (2005) gave the first compressed text indexes with efficient query times in the internal-memory model. Recently, Chien et al (2008) presented a compact text index in the external memory based on the concept of Geometric Burrows-Wheeler Transform. They also presented lower bounds which suggested that it may be hard to obtain a good index structure in the external memory. In this paper, we investigate this issue from a practical point of view. On the positive side we show an external-memory text indexing structure (based on R-trees and KD-trees) that saves space by about an order of magnitude as compared to the standard String B-tree. While saving space, these structures also maintain a comparable I/O efficiency to that of String B-tree. We also show various space vs I/O efficiency trade-offs for our structures.
1997
We present a technique to build an index based on su x arrays for compressed texts. We also propose a compression scheme for textual databases based on words that generates a compression code that preserves the lexicographical ordering of the text words. As a consequence it permits the sorting of the compressed strings to generate the su x array without decompressing. As the compressed text is under 30% of the size of the original text we are able to build the su x array twice as fast on the compressed text. The compressed text plus index is 55-60% of the size of the original text plus index and search times are reduced to approximately half the time. We also present analytical and experimental results for di erent variations of the word-oriented compression paradigm.
Computer, 2000
In this article we discuss recent methods for compressing the text and the index of text retrieval systems. By compressing both the complete text and the index, the total amount of space is less than half the size of the original text alone. Most surprisingly, the time required to build the index and also to answer a query is much less than if the index and text had not been compressed. This is one of the few cases where there is no space-time trade-o . Moreover, the text can be kept compressed all the time, allowing updates when changes occur in the compressed text.
2018
The rise of repetitive datasets has lately generated a lot of interest in compressed self-indexes based on dictionary compression, a rich and heterogeneous family of techniques that exploits text repetitions in different ways. For each such compression scheme, several different indexing solutions have been proposed in the last two decades. To date, the fastest indexes for repetitive texts are based on the run-length compressed Burrows–Wheeler transform (BWT) and on the Compact Directed Acyclic Word Graph (CDAWG). The most space-efficient indexes, on the other hand, are based on the Lempel–Ziv parsing and on grammar compression. Indexes for more universal schemes such as collage systems and macro schemes have not yet been proposed. Very recently, Kempa and Prezza [STOC 2018] showed that all dictionary compressors can be interpreted as approximation algorithms for the smallest string attractor, that is, a set of text positions capturing all distinct substrings. Starting from this obse...
Loading Preview
Sorry, preview is currently unavailable. You can download the paper by clicking the button above.
Proceedings of the 25th …, 2002
ACM Transactions on Algorithms, 2007
Data Compression Conference, 2004. Proceedings. DCC 2004
Lecture Notes in Computer Science, 2004
Lecture Notes in Computer Science, 2010
Proceedings of the International Workshop on …
Journal of Algorithms, 2003
TURKISH JOURNAL OF ELECTRICAL ENGINEERING & COMPUTER SCIENCES, 2016
ACM Transactions on Information Systems, 2000