Academia.eduAcademia.edu

Compressed text indexes

2009, Journal of Experimental Algorithmics

Abstract

A compressed full-text self-index represents a text in a compressed form and still answers queries efficiently. This represents a significant advancement over the (full-)text indexing techniques of the previous decade, whose indexes required several times the size of the text. Although it is relatively new, this algorithmic technology has matured up to a point where theoretical research is giving way to practical developments. Nonetheless this requires significant programming skills, a deep engineering effort, and a strong algorithmic background to dig into the research results. To date only isolated implementations and focused comparisons of compressed indexes have been reported, and they missed a common API, which prevented their re-use or deployment within other applications. The goal of this article is to fill this gap. First, we present the existing implementations of compressed indexes from a practitioner's point of view. Second, we introduce the Pizza&Chili site, which of...

Key takeaways

  • The use of compressed indexes is obviously not limited to plain text searching.
  • Several compressed indexes achieve O(nH k (T r )) bits of space, instead of O(nH k (T )), as they work on the contexts following (rather than preceding) the symbol to be encoded.
  • The compressed suffix array (CSA) was not originally a self-index, and required O(n log σ) bits of space [18].
  • We restricted our experiments to a few indexes: Succinct Suffix Array (version SSA v2 in Pizza&Chili), Alphabet-Friendly FM-index (version AF-index v2 in Pizza&Chili), Compressed Suffix Array (CSA in Pizza&Chili), and LZ-index (version LZ-index4 in Pizza&Chili), because they are the best representatives of the three classes of compressed indexes we discussed in Section 3.
  • Table 8 shows the locate time required by an implementation of the classical suffix array: it is between 100 and 1000 times faster than any compressed index, but always 5 times larger than the indexed text.