Academia.edu no longer supports Internet Explorer.
To browse Academia.edu and the wider internet faster and more securely, please take a few seconds to upgrade your browser.
2014, 2014 Data Compression Conference
List update is a key step during the Burrows-Wheeler transform (BWT) compression. Previous work has shown (e.g., ]) that careful study of the list update step leads to better BWT compression. Surprisingly, the theoretical study of list update algorithms for compression has lagged behind its use in real practice. To be more precise, the standard model by Sleator and Tarjan for list update considers a linear cost-of-access model while compression incurs a logarithmic cost of access, i.e. accessing item i in the list has cost Θ(i) in the standard model but Θ(log i) in compression applications 1 . These models have been shown, in general, not to be equivalent . This paper has two contributions:
Information Processing Letters, 1996
1999
Abstract One crucial concern of digital information processing consists precisely of developing increasingly e cient methods for the representation of information itself. Data compression studies algorithms, protocols, schemes, and codes capable to produce\ concise" representations of information.
Information Processing & Management, 2018
Text search engines are a fundamental tool nowadays. Their efficiency relies on a popular and simple data structure: inverted indexes. They store an inverted list per term of the vocabulary. The inverted list of a given term stores, among other things, the document identifiers (docIDs) of the documents that contain the term. Currently, inverted indexes can be stored efficiently using integer compression schemes. Previous research also studied how an optimized document ordering can be used to assign docIDs to the document database. This yields important improvements in index compression and query processing time. In this paper we show that using a hybrid compression approach on the inverted lists is more effective in this scenario, with two main contributions: • First, we introduce a document reordering approach that aims at generating runs of consecutive docIDs in a properly-selected subset of inverted lists of the index. • Second, we introduce hybrid compression approaches that combine gap and run-length encodings within inverted lists, in order to take advantage not only from small gaps, but also from long runs of consecutive docIDs generated by our document reordering approach. Our experimental results indicate a reduction of about 10%-30% in the space usage of the whole index (just regarding docIDs), compared with the most efficient state-of-the-art results. Also, decompression speed is up to 1.22 times faster if the runs of consecutive docIDs must be explicitly decompressed, and up to 4.58 times faster if implicit decompression of these runs is allowed (e.g., representing the runs as intervals in the output). Finally, we also improve the query processing time of AND queries (by up to 12%), WAND queries (by up to 23%), and full (non-ranked) OR queries (by up to 86%), outperforming the best existing approaches.
1992
Data may be compressed using textual substitution. Textual substitution identifies repeated substrings and replaces some or all substrings by pointers to another copy. We construct an incremental algorithm for a specific textual substitution method: coding a text with respect to a dictionary. With this incremental algorithm it is possible to combine two coded texts in constant time.
2013
1 Abstract: The size of data related to a wide range of applications is growing rapidly. Typically a character requires 1 Byte for storing. Compression of the data is very important for data management. Data compression is the process by which the physical size of data is reduced to save on memory or improve traffic speeds on a website. The purpose of data compression is to make a file smaller by minimizing the amount of data present. When a file is compressed, it can be reduced to as little as 25% of its original size which makes it easier to send to others over the internet. Data compression may take extensions such as zip, rar, ace, and BZ2. It is normally done using special compression software. Compression serves both to save storage space and to save transmission time. The aim of compression is to produce a new file, as short as possible, containing the compressed version of the same text. Grand challenges such as the human generated project involve very large distributed data...
2000
Abstract Greedy off-line textual substitution refers to the following approach to compression or structural inference. Given a long text string x, a substring W is identified such that replacing all instances of W in X except one by a suitable pair of pointers yields the highest possible contraction of X; the process is then repeated on the contracted text string until substrings capable of producing contractions can no longer be found.
2018
The rise of repetitive datasets has lately generated a lot of interest in compressed self-indexes based on dictionary compression, a rich and heterogeneous family of techniques that exploits text repetitions in different ways. For each such compression scheme, several different indexing solutions have been proposed in the last two decades. To date, the fastest indexes for repetitive texts are based on the run-length compressed Burrows–Wheeler transform (BWT) and on the Compact Directed Acyclic Word Graph (CDAWG). The most space-efficient indexes, on the other hand, are based on the Lempel–Ziv parsing and on grammar compression. Indexes for more universal schemes such as collage systems and macro schemes have not yet been proposed. Very recently, Kempa and Prezza [STOC 2018] showed that all dictionary compressors can be interpreted as approximation algorithms for the smallest string attractor, that is, a set of text positions capturing all distinct substrings. Starting from this obse...
arXiv (Cornell University), 2013
The advent of massive datasets and the consequent design of high-performing distributed storage systems-such as BigTable by Google [7], Cassandra by Facebook [5], Hadoop by Apache-have reignited the interest of the scientific and engineering community towards the design of lossless data compressors which achieve effective compression ratio and very efficient decompression speed. Lempel-Ziv's LZ77 algorithm is the de facto choice in this scenario because its decompression is significantly faster than other approaches, and its algorithmic structure is flexible enough to trade decompression speed versus compressedspace efficiency. This algorithm has been declined in many ways, the most famous ones are: the classic gzip, LZ4 and Google's Snappy. Each of these implementations offers a trade-off between space occupancy and decompression speed, so software engineers have to content themselves by picking the one which comes closer to the requirements of the application in their hands. Starting from these premises, and for the first time in the literature, we address in this paper the problem of trading optimally, and in a principled way, the consumption of these two resources by introducing and solving what we call the Bicriteria LZ77-Parsing problem. The goal is to determine an LZ77 parsing which minimizes the space occupancy in bits of the compressed file, provided that the decompression time is bounded by a fixed amount. Symmetrically, we can exchange the role of the two resources and thus ask for minimizing the decompression time provided that the compressed space is bounded by a fixed amount. This way, the software engineer can set its space (or time) requirements and then derive the LZ77 parsing which optimizes the decompression speed (or the space occupancy, respectively), thus resulting the best possible LZ77 compression under those constraints. We solve this problem in four stages: we turn it into a sort of weight-constrained shortest path problem (WCSPP) over a weighted graph derived from the LZ77-parsing of the input file; we argue that known solutions for WSCPP are inefficient and thus unusable in practice; we prove some interesting structural properties about that graph, and then design an O(n log 2 n)-time algorithm which computes a small additive approximation of the optimal LZ77 parsing. This additive approximation is logarithmic in the input size and thus totally negligible in practice. Finally, we sustain these arguments by performing some experiments which show that our algorithm combines the best properties of known compressors: its decompression time is close to the fastest Snappy's and LZ4's, and its compression ratio is close to the more succinct bzip2's and LZMA's. Actually, in many cases our compressor improves the best known engineered solutions mentioned above, so we can safely state that with our result software engineers have an algorithmic-knob to automatically trade in a principled way the time/space requirements of their applications. Summarizing, the three main contributions of the paper are: (i) we introduce the novel Bicriteria LZ77-Parsing problem which formalizes in a principled way what data-compressors have traditionally approached by means of heuristics; (ii) we solve this problem efficiently in O(n log 2 n) time and optimal linear space, by proving and deploying some specific structural properties of the weighted graph derived from the possible LZ77-parsings of the input file; (iii) we execute a preliminary set of experiments which show that our novel proposal dominates all the highly engineered competitors, hence offering a win-win situation in the-ory&practice.
Witasi, 2002
The growth of data interchange within computer networks, including the Internet, causes bandwidth requirements to increase. Instead of widening the bandwidth, the same may be obtained by limiting the amount of interchanged data through an application of data compression. There are several general compression methods, varying in speed and efficency. In this paper a new method of data compression is described, that combines elements of known methods to achieve better compression efficiency in less time. Performed tests show that this method achieves good results, especially for text data, so it could be used in network-related applications in order to limit the amount of interchanged data.
Information processing & management, 2005
Block-sorting is an innovative compression mechanism introduced in 1994 by Burrows and Wheeler. It involves three steps: permuting the input one block at a time through the use of the BurrowsWheeler transform (bwt); applying a move-to-front (mtf) transform to each of ...
Journal of the ACM, 2005
We provide a general boosting technique for Textual Data Compression. Qualitatively, it takes a good compression algorithm and turns it into an algorithm with a better compression Extended abstracts related to this article appeared in performance guarantee. It displays the following remarkable properties: (a) it can turn any memoryless compressor into a compression algorithm that uses the "best possible" contexts; (b) it is very simple and optimal in terms of time; and (c) it admits a decompression algorithm again optimal in time. To the best of our knowledge, this is the first boosting technique displaying these properties.
Proceedings of the 20th international conference on World wide web - WWW '11, 2011
Modern search engines are expected to make documents searchable shortly after they appear on the ever changing Web. To satisfy this requirement, the Web is frequently crawled. Due to the sheer size of their indexes, search engines distribute the crawled documents among thousands of servers in a scheme called local index-partitioning, such that each server indexes only several million pages. To ensure documents from the same host (e.g., www.nytimes.com) are distributed uniformly over the servers, for load balancing purposes, random routing of documents to servers is common. To expedite the time documents become searchable after being crawled, documents may be simply appended to the existing index partitions. However, indexing by merely appending documents, results in larger index sizes since document reordering for index compactness is no longer performed. This, in turn, degrades search query processing performance which depends heavily on index sizes.
Computer, 2000
In this article we discuss recent methods for compressing the text and the index of text retrieval systems. By compressing both the complete text and the index, the total amount of space is less than half the size of the original text alone. Most surprisingly, the time required to build the index and also to answer a query is much less than if the index and text had not been compressed. This is one of the few cases where there is no space-time trade-o . Moreover, the text can be kept compressed all the time, allowing updates when changes occur in the compressed text.
IEEE Transactions on Computers, 2005
Suffix sorting requires ordering all suffixes of all symbols in an input sequence and has applications in running queries on large texts and in universal lossless data compression based on the Burrows Wheeler transform (BWT). We propose a new suffix lists data structure that leads to three fast, antisequential, and memory-efficient algorithms for suffix sorting. For a length-N input over a size-jX j alphabet, the worst-case complexities of these algorithms are ÂðN 2 Þ, OðjXjN logð N jXj ÞÞ, and OðN ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi jX j logð N jXj Þ q Þ, respectively.
Computing Research Repository, 2007
A compressed full-text self-index represents a text in a compressed form and still answers queries efficiently. This technology represents a breakthrough over the text indexing techniques of the previous decade, whose indexes required several times the size of the text. Although it is relatively new, this technology has matured up to a point where theoretical research is giving way to practical developments. Nonetheless this requires significant programming skills, a deep engineering effort, and a strong algorithmic background to dig into the research results. To date only isolated implementations and focused comparisons of compressed indexes have been reported, and they missed a common API, which prevented their re-use or deployment within other applications.
A data compression scheme that exploits locality of reference, such as occurs when words are used frequently over short intervals and then fall into long periods of disuse, is described. The scheme is based on a simple heuristic for self-organizing sequential search and on variable-length encodings of integers. We prove that it never performs much worse than Huffman coding and can perform substantially better; experiments on real files show that its performance is usually quite close to that of Huffman coding. Our scheme has many implementation advantages: it is simple, allows fast encoding and decod- ing, and requires only one pass over the data to be com- pressed (static Huffman coding takes huo passes).
Communications of the ACM, 1986
A data compression scheme that exploits locality of reference, such as occurs when words are used frequently over short intervals and then fall into long periods of disuse, is described. The scheme is based on a simple heuristic for self-organizing sequential search and on variable-length encodings of integers. We prove that it never performs much worse than Huffman coding and can perform substantially better; experiments on real files show that its performance is usually quite close to that of Huffman coding. Our scheme has many implementation advantages: it is simple, allows fast encoding and decoding, and requires only one pass over the data to be compressed (static Huffman coding takes two passes).
Software - Practice and Experience, 2005
One of the attractive ways to increase text compression is to replace words with references to a text dictionary given in advance. Although there exist a few works in this area, they do not fully exploit the compression possibilities or consider alternative preprocessing variants for various compressors in the latter phase. In this paper, we discuss several aspects of dictionary-based compression, including compact dictionary representation, and present a PPM/BWCA oriented scheme, word replacing transformation, achieving compression ratios higher by 2-6% than stateof-the-art StarNT (2003) text preprocessor, working at a greater speed. We also present an alternative scheme designed for LZ77 compressors, with the advantage over StarNT reaching up to 14% in combination with gzip.
2015
Data compression is a requirement these days as it makes the communication of data faster. There are many techniques of data compression. The paper proposes a lossless method of data compression which is very simple and is very fast and efficient. It is a fixed length encoding of data and is very fast to decompress in comparison with the existing lossless compression techniques.
ACM Transactions on Algorithms, 2006
We report on a new experimental analysis of high-order entropy-compressed suffix arrays, which retains the theoretical performance of previous work and represents an improvement in practice.
Loading Preview
Sorry, preview is currently unavailable. You can download the paper by clicking the button above.