Academia.edu no longer supports Internet Explorer.
To browse Academia.edu and the wider internet faster and more securely, please take a few seconds to upgrade your browser.
2000, IEEE Transactions on Information Theory
A universal lossless data compression code called the multilevel pattern matching code (MPM code) is introduced. In processing a finite-alphabet data string of length , the MPM code operates at (log log) levels sequentially. At each level, the MPM code detects matching patterns in the input data string (substrings of the data appearing in two or more nonoverlapping positions). The matching patterns detected at each level are of a fixed length which decreases by a constant factor from level to level, until this fixed length becomes one at the final level. The MPM code represents information about the matching patterns at each level as a string of tokens, with each token string encoded by an arithmetic encoder. From the concatenated encoded token strings, the decoder can reconstruct the data string via several rounds of parallel substitutions. A (1 log) maximal redundancy/sample upper bound is established for the MPM code with respect to any class of finite state sources of uniformly bounded complexity. We also show that the MPM code is of linear complexity in terms of time and space requirements. The results of some MPM code compression experiments are reported.
2015
| The performance of the multilevel pattern matching (MPM) code for lossless image compression is rst analyzed. It is shown that the worst-case re-dundancy of the MPM code against all nite 2D con-text arithmetic codes is O(1= p log n), where n is the number of pixels in the image to be compressed. This result is in contrast to the redundancy of O(1 = log n) in the case of 1D data and is caused by the so-called 2D boundary eect. To alleviate the 2D boundary eect, we then extend the MPM code to the case of context modeling, yielding a context-dependentMPM code. Our experimental results show that the context-dependent MPM code, with a simple context model, signi cantly outperforms the original MPM code on a wide range of bi-level images. Comparing with JBIG, the context-dependent MPM code outperforms JBIG in progressive coding mode, and is comparable with JBIG in sequential coding mode. I
IEEE Transactions on Information Theory, 2001
MICAI 2004: Advances in Artificial Intelligence, 2004
Most modern lossless data compression techniques used today, are based in dictionaries. If some string of data being compressed matches a portion previously seen, then such string is included in the dictionary and its reference is included every time it appears. A possible generalization of this scheme is to consider not only strings made of consecutive symbols, but more general patterns with gaps between its symbols. The main problems with this approach are the complexity of pattern discovery algorithms and the complexity for the selection of a good subset of patterns. In this paper we address the last of these problems. We demonstrate that such problem is NP-complete and we provide some preliminary results about heuristics that points to its solution.
Communications in Information and Systems, 2002
A grammar-based code losslessly compresses each finite-alphabet data string x by compressing a context-free grammar G x which represents x in the sense that the language of G x is {x}. In an earlier paper, we showed that if the grammar G x is a type of grammar called irreducible grammar for every data string x, then the resulting grammar-based code has maximal redundancy/sample O(log log n/ log n) for n data samples. To further reduce the maximal redundancy/sample, in the present paper, we first decompose a context-free grammar into its structure and its data content, then encode the data content conditional on the structure, and finally replace the irreducible grammar condition with a mild condition on the structures of all grammars used to represent distinct data strings of a fixed length. The resulting grammar-based codes are called structured grammar-based codes. We prove a coding theorem which shows that a structured grammar-based code has maximal redundancy/sample O(1/ log n) provided that a weak regular structure condition is satisfied.
Conference Proceedings. IEEE Canadian Conference on Electrical and Computer Engineering (Cat. No.98TH8341), 1998
This paper presents a class of new lossless data compression algorithms. Each algorithm in this class first transforms the original data to be compressed into an irreducible table representation and then uses an arithmetic code to compress the irreducible table representation. F'rom the irreducible table representation, one can fully reconstruct the original data by performing multistage parallel substitution. A set of rules is described on how to perform hierarchical transformations from the original data to irreducible table representations. Theoretically, it is proved that all these algorithms outperform any finite state sequential compression algorithm and hence achieve the ultimate compression rate for any stationary and ergodic soupce. Furthermore, experiments on several standard images show that even a simple algorithm in this class, the so-called multi-level pattern matching algorithm, outperforms the Lempel-Ziv algorithms and arithmetic codes.
IEEE Transactions on Information Theory, 1998
Lossless and lossy data compression algorithms based on string matching are considered. In the lossless case, a result of Wyner and Ziv is extended. In the lossy case, a data compression algorithm based on approximate string matching is analyzed in the following two frameworks: 1) the database and the source together form a Markov chain of finite order; 2) the database and the source are independent with the database coming from a Markov model and the source from a general stationary, ergodic model. In either framework, it is shown that the resulting compression rate converges with probability one to a quantity computable as the infimum of an informationtheoretic functional over a set of auxiliary random variables; the quantity is strictly greater than the rate distortion function of the source except in some symmetric cases. In particular, this result implies that the lossy algorithm proposed by Steinberg and Gutman is not optimal, even for memoryless or Markov sources.
2021
Data compression is a challenging and increasingly important problem. As the amount of data generated daily continues to increase, efficient transmission and storage has never been more critical. In this study, a novel encoding algorithm is proposed, motivated by the compression of DNA data and associated characteristics. The proposed algorithm follows a divide-and-conquer approach by scanning the whole genome, classifying subsequences based on similarity patterns, and binning similar subsequences together. The data are then compressed in each bin independently. This approach is different than the currently known approaches: entropy, dictionary, predictive, or transform based methods. Proof-of-concept performance was evaluated using a benchmark dataset with seventeen genomes ranging in size from kilobytes to gigabytes. The results showed considerable improvement in the compression of each genome, preserving several megabytes compared with state-of-art tools. Moreover, the algorithm ...
International Journal of Engineering Research and, 2015
The main goal of data compression is to decrease redundancy in warehouse or communicated data, so growing effective data density. It is a common necessary for most of the applications. Data compression is very important relevancy in the area of file storage and distributed system just because of in distributed system data have to send from and to all system. Two configuration of data compression are there "lossy" and "lossless". But in this paper we only focus on Lossless data compression techniques. In lossless data compression, the wholeness of data is preserved. Data compression is a technique that decreases the data size, removing the extreme information. Data compression has many types of techniques that decrease redundancy. The methods which mentioned are Run Length Encoding, Shannon Fanon, Huffman, Arithmetic, adaptive Huffman, LZ77, LZ78 and LZW with its performance.
1992
Data may be compressed using textual substitution. Textual substitution identifies repeated substrings and replaces some or all substrings by pointers to another copy. We construct an incremental algorithm for a specific textual substitution method: coding a text with respect to a dictionary. With this incremental algorithm it is possible to combine two coded texts in constant time.
IEEE Transactions on Very Large Scale Integration (VLSI) Systems, 2000
In this paper, we propose a new two-stage hardware architecture that combines the features of both parallel dictionary LZW (PDLZW) and an approximated adaptive Huffman (AH) algorithms. In this architecture, an ordered list instead of the treebased structure is used in the AH algorithm for speeding up the compression data rate. The resulting architecture shows that it not only outperforms the AH algorithm at the cost of only one-fourth the hardware resource but it is also competitive to the performance of LZW algorithm (compress). In addition, both compression and decompression rates of the proposed architecture are greater than those of the AH algorithm even in the case realized by software.
IEEE Transactions on Information Theory, 2003
A grammar transform is a transformation that converts any data sequence to be compressed into a grammar from which the original data sequence can be fully reconstructed. In a grammar-based code, a data sequence is first converted into a grammar by a grammar transform and then losslessly encoded. In this paper, a greedy grammar transform is first presented; this grammar transform constructs sequentially a sequence of irreducible grammars from which the original data sequence can be recovered incrementally. Based on this grammar transform, three universal lossless data compression algorithms, a sequential algorithm, an improved sequential algorithm, and a hierarchical algorithm, are then developed. These algorithms combine the power of arithmetic coding with that of string matching. It is shown that these algorithms are all universal in the sense that they can achieve asymptotically the entropy rate of any stationary, ergodic source. Moreover, it is proved that their worst case redundancies among all individual sequences of length are upper-bounded by log log log , where is a constant.
2009
Given a sequence S of n symbols over some alphabet Σ, we develop a new compression method that is (i) very simple to implement; (ii) provides O(1) time random access to any symbol of the original sequence; (iii) allows efficient pattern matching over the compressed sequence. Our simplest solution uses at most 2h + o(h) bits of space, where h = n(H0(S) + 1), and H0(S) is the zeroth-order empirical entropy of S. We discuss a number of improvements and trade-offs over the basic method. The new method is applied to text compression. We also propose average case optimal string matching algorithms.
2007 IEEE International Symposium on Information Theory, 2007
In this paper we investigate universal data compression with side information at the decoder by leveraging traditional universal data compression algorithms. Specifically, consider a source network with feedback in which a finite alphabet source X = {Xi} ∞ i=0 is to be encoded and transmitted, and another finite alphabet source Y = {Yi} ∞ i=0 available only to the decoder as the side information correlated with X. Assuming that the encoder and decoder share a uniform i.i.d. (independent and identically distributed) random database that is independent of (X, Y), we propose a string matching-based (variable-rate) block coding algorithm with a simple progressive encoder for the feedback source network. Instead of using standard joint typicality decoding, this algorithm derives its decoding rule from the codeword length function of a traditional universal lossless coding algorithm. As a result, neither the encoder nor the decoder assumes any prior knowledge of the joint distribution of (X, Y) or even the achievable rates. It is proven that for any (X, Y) in the class of all stationary, ergodic source-side information pairs with finite alphabet, the average number of bits per letter transmitted from the encoder to the decoder (compression rate) goes arbitrarily close to the conditional entropy rate H(X|Y) of X given Y asymptotically, and the average number of bits per letter transmitted from the decoder to the encoder (feedback rate) goes to 0 asymptotically.
Most modern lossless data compression techniques used to- day, are based in dictionaries. If some string of data being compressed matches a portion previously seen, then such string is included in the dictionary and its reference is included every time it occurs. A possi- ble generalization of this scheme is to consider not only strings made of consecutive symbols, but more general patterns with gaps between its symbols. In this paper we introduce an o-line method based on this generalization. We address the main problems involved in such approach and provide a good approximation to its solution. Categories and Subject Descriptors: E.4 (Coding and Information Theory)-data compaction and compression; I.2.8 (Artificial Intelli- gence):Problem Solving, Control Methods, and Search-heuristic meth- ods. General Terms: Approximation algorithms, heuristics
2016
A new family of perspective variable length self-synchronazable binary codes with multiple pattern delimiters is introduced. Each delimiter consists of a run of consecutive ones surrounded by zero brackets. These codes are complete and universal. A simple bijective correspondence between natural numbers and any multi-delimiter code set is established. A fast byte aligned decoding algorithm is constructed. Comparisons of text compression rate and decoding speed for different multi-delimiter codes, the Fibonacci code Fib3 and (s, c)-dense codes are also presented.
International Series in Operations Research & Management Science, 2000
Hierarchicallossless. data compression is a compression technique that has been shown to effectively compress data in the face of uncertainty concerning a proper probabilistic model for the data. In this technique, one represents a data sequence x using one of three kinds of structures: (1) a tree called a pointer tree, which generates x via a procedure called "subtree copying"; (2) a data flow graph which generates x via a flow of data sequences along its edges; or (3) a context free grammar which generates x via parallel substitutions accomplished with the production rules of the grammar. The data sequence is then compressed indirectly via compression of the structure which represents it. This article is a survey of recent advances in the rapidly growing field of hierarchical lossless data compression. In the article, we illustrate how the three distinct structures for representing a data sequence are equivalent, outline a simple method for designing compact structures for representing a data sequence, and indicate the level of compression performance that can be obtained by compression of the structure representing a data sequence.
Communications of the ACM, 1987
TURKISH JOURNAL OF ELECTRICAL ENGINEERING & COMPUTER SCIENCES
In this study a new semistatic data compression model that has a fast coding process and that allows compressed pattern matching is introduced. The name of the proposed model is chosen as tagged word-based compression algorithm (TWBCA) since it has a word-based coding and word-based compressed matching algorithm. The model has two phases. In the first phase a dictionary is constructed by adding a phrase, paying attention to word boundaries, and in the second phase compression is done by using codewords of phrases in this dictionary. The first byte of the codeword determines whether the word is compressed or not. By paying attention to this rule, the CPM process can be conducted as word based. In addition, the proposed method makes it possible to also search for the group of consecutively compressed words. Any of the previous pattern matching algorithms can be chosen to use in compressed pattern matching as a black box. The duration of the CPM process is always less than the duration of the same process on the texts coded by Gzip tool. While matching longer patterns, compressed pattern matching takes more time on the texts coded by compress and end-tagged dense code (ETDC). However, searching shorter patterns takes less time on texts coded by our approach than the texts compressed with compress. Besides this, the compression ratio of our algorithm has a better performance against ETDC only on a file that has been written in Turkish. The compression performance of TWBCA is stable and does not vary over 6% on different text files.
A data compression scheme that exploits locality of reference, such as occurs when words are used frequently over short intervals and then fall into long periods of disuse, is described. The scheme is based on a simple heuristic for self-organizing sequential search and on variable-length encodings of integers. We prove that it never performs much worse than Huffman coding and can perform substantially better; experiments on real files show that its performance is usually quite close to that of Huffman coding. Our scheme has many implementation advantages: it is simple, allows fast encoding and decod- ing, and requires only one pass over the data to be com- pressed (static Huffman coding takes huo passes).
IEEE Transactions on Information Theory, 2004
It has long been known that the compression redundancy of independent and identically distributed (i.i.d.) strings increases to infinity as the alphabet size grows. It is also apparent that any string can be described by separately conveying its symbols, and its pattern-the order in which the symbols appear. Concentrating on the latter, we show that the patterns of i.i.d. strings over all, including infinite and even unknown, alphabets, can be compressed with diminishing redundancy, both in block and sequentially, and that the compression can be performed in linear time. To establish these results, we show that the number of patterns is the Bell number, that the number of patterns with a given number of symbols is the Stirling number of the second kind, and that the redundancy of patterns can be bounded using results of Hardy and Ramanujan on the number of integer partitions. The results also imply an asymptotically optimal solution for the Good-Turing probability-estimation problem.
Loading Preview
Sorry, preview is currently unavailable. You can download the paper by clicking the button above.