Academia.edu no longer supports Internet Explorer.
To browse Academia.edu and the wider internet faster and more securely, please take a few seconds to upgrade your browser.
1994, IEEE Transactions on Information Theory
We consider the problem of source coding. We investigate the cases of known and unknown statistics. The efficiency of the compression codes can be estimated by three characteristics: 1) the rebundancy (r), defined as the maximal difference between the average codeword length and Shannon entropy in case the letters are generated by a Bernoulli source;
2009 International Symposium on Signals, Circuits and Systems, 2009
We analyze the lossless compression for a large class of discrete complete and memoryless sources performed by a generalized Huffman with an alphabet consisting of M letters. Given the number of source messages, N, the alphabet size, M, and the number of code words, p, on each level in the graph, excepting the last two ones, we have determined the unknown encoding parameters, that is, the number n of the levels in the encoding graph, the number q of code words on the level n-1, the number k of groups of M nodes, and the remaining m nodes on the last level. The average code word length is also computed. Two extreme cases, when p=0 and p=M-1 have been analyzed.
In coding theory it is widely known that the optimal encoding for a given alphabet of symbol codes is the Shannon entropy times the number of symbols to be encoded. However, depending on the structure of the message to be encoded it is possible to beat this optimal by including only frequently occurring aggregates of symbols from the base alphabet. We prove that the change in compressed message length by the introduction of a new aggregate symbol can be expressed as the difference of two entropies, dependent only on the probabilities of the characters within the aggregate plus a correction term which involves only the probability and length of the introduced symbol. The expression is independent of the probability of all other symbols in the alphabet. This measure of information gain, for a new symbol, can be applied in data compression methods.
Problems of Information Transmission, 2012
The compression-complexity trade-off of lossy compression algorithms that are based on a random codebook or a random database is examined. Motivated, in part, by recent results of Gupta-Verdú-Weissman (GVW) and their underlying connections with the pattern-matching scheme of Kontoyiannis' lossy Lempel-Ziv algorithm, we introduce a non-universal version of the lossy Lempel-Ziv method (termed LLZ). The optimality of LLZ for memoryless sources is established, and its performance is compared to that of the GVW divide-and-conquer approach. Experimental results indicate that the GVW approach often yields better compression than LLZ, but at the price of much higher memory requirements. To combine the advantages of both, we introduce a hybrid algorithm (HYB) that utilizes both the divide-and-conquer idea of GVW and the single-database structure of LLZ. It is proved that HYB shares with GVW the exact same rate-distortion performance and implementation complexity, while, like LLZ, requiring less memory, by a factor which may become unbounded, depending on the choice of the relevant design parameters. Experimental results are also presented, illustrating the performance of all three methods on data generated by simple discrete memoryless sources. In particular, the HYB algorithm is shown to outperform existing schemes for the compression of some simple discrete sources with respect to the Hamming distortion criterion. 2 codes, turbo codes, and local message-passing decoding algorithms; see, e.g., [33][48][34], the texts [16][32][44], and the references therein.
2014
Huffman encoding is often improved by using block codes, for example a 3-block would be an alphabet consisting of each possible combination of three characters. We take the approach of starting with a base alphabet and expanding it to include frequently occurring aggregates of symbols. We prove that the change in compressed message length by the introduction of a new aggregate symbol can be expressed as the difference of two entropies, dependent only on the probabilities and length of the introduced symbol. The expression is independent of the probability of all other symbols in the alphabet. This measure of information gain, for a new symbol, can be applied in data compression methods. We also demonstrate that aggregate symbol alphabets, as opposed to mutually exclusive alphabets have the potential to provide good levels of compression, with a simple experiment. Finally, compression gain as defined in this paper may also be useful for feature selection.
IEEE Transactions on Information Theory, 1999
The problem of coding low-entropy information sources is considered. Since the run-length code was offered about 50 years ago by Shannon, it is known that for such sources there exist coding methods much simpler than for sources of a general type. However, known coding methods of low-entropy sources do not reach the given redundancy. In this correspondence, a new method of coding low-entropy sources is offered. It permits a given redundancy r with almost the same encoder and decoder memory size as that obtained by Ryabko for general methods, while encoding and decoding much faster.
Communications of the ACM, 1987
IEEE Transactions on Information Theory, 2000
We introduce a universal quantization scheme based on random coding, and we analyze its performance. This scheme consists of a source-independent random codebook (typically mismatched to the source distribution), followed by optimal entropy-coding that is matched to the quantized codeword distribution. A single-letter formula is derived for the rate achieved by this scheme at a given distortion, in the limit of large codebook dimension. The rate reduction due to entropycoding is quantified, and it is shown that it can be arbitrarily large. In the special case of "almost uniform" codebooks (e.g., an i.i.d. Gaussian codebook with large variance) and difference distortion measures, a novel connection is drawn between the compression achieved by the present scheme and the performance of "universal" entropy-coded dithered lattice quantizers. This connection generalizes the "half-a-bit" bound on the redundancy of dithered lattice quantizers. Moreover, it demonstrates a strong notion of universality where a single "almost uniform" codebook is near-optimal for any source and any difference distortion measure. The proofs are based on the fact that the limiting empirical distribution of the first matching codeword in a random codebook can be precisely identified. This is done using elaborate large deviations techniques, that allow the derivation of a new "almost sure" version of the conditional limit theorem.
2001
Kolmogorov Complexity (K (x)) is the optimal compression bound of a given string x. This incomputable yet fundamental property of information has vast implications and applications in the areas of network and system optimization, security, bioinformatics, and emergence (see [1],[2] for an introduction to Kolmogorov Complexity and [2],[3] and [7] for some applications). An ideal approach for compression of information with a known distribution is Huffman Coding, which approaches the entropy of the source distribution.
Communications of the ACM, 1989
gar H. sibieij Bij dynamically receding data on the basis of current intercharacter Panel Editor probabilities, the entropy of encoded messages can be significantly reduced.
Mathematics, 2023
Source coding maps elements from an information source to a sequence of alphabetic symbols. Then, the source symbols can be recovered exactly from the binary units. In this paper, we derive an approach that includes information variation in the source coding. The approach is more realistic than its standard version. We employ the Shannon entropy for coding the sequences of a source. Our approach is also helpful for short sequences when the central limit theorem does not apply. We rely on a quantifier of the information variation as a source. This quantifier corresponds to the second central moment of a random variable that measures the information content of a source symbol; that is, considering the standard deviation. An interpretation of typical sequences is also provided through this approach. We show how to use a binary memoryless source as an example. In addition, Monte Carlo simulation studies are conducted to evaluate the performance of our approach. We apply this approach to two real datasets related to purity and wheat prices in Brazil.
IEEE Transactions on Information Theory, 2004
It has long been known that the compression redundancy of independent and identically distributed (i.i.d.) strings increases to infinity as the alphabet size grows. It is also apparent that any string can be described by separately conveying its symbols, and its pattern-the order in which the symbols appear. Concentrating on the latter, we show that the patterns of i.i.d. strings over all, including infinite and even unknown, alphabets, can be compressed with diminishing redundancy, both in block and sequentially, and that the compression can be performed in linear time. To establish these results, we show that the number of patterns is the Bell number, that the number of patterns with a given number of symbols is the Stirling number of the second kind, and that the redundancy of patterns can be bounded using results of Hardy and Ramanujan on the number of integer partitions. The results also imply an asymptotically optimal solution for the Good-Turing probability-estimation problem.
Encyclopedia of GIS, 2008
This paper surveys a variety of data compression methods spanning almost 40 years of research, from the work of Shannon, Fano, and Huffman in the late 1940s to a technique developed in 1986. The aim of data compression is to reduce redundancy in stored or communicated data, thus increasing effective data density. Data compression has important application in the areas of file storage and distributed systems. Concepts from information theory as they relate to the goals and evaluation of data compression methods are discussed briefly. A framework for evaluation and comparison of methods is constructed and applied to the algorithms presented. Comparisons of both theoretical and empirical natures are reported, and possibilities for future research are suggested
2007
The order of letters is not always relevant in a communication task. This paper discusses the implications of order irrelevance on source coding, presenting results in several major branches of source coding theory: lossless coding, universal lossless coding, rate-distortion, high-rate quantization, and universal lossy coding. The main conclusions demonstrate that there is a significant rate savings when order is irrelevant. In particular, lossless coding of n letters from a finite alphabet requires Θ(log n) bits and universal lossless coding requires n + o(n) bits for many countable alphabet sources. However, there are no universal schemes that can drive a strong redundancy measure to zero. Results for lossy coding include distribution-free expressions for the rate savings from order irrelevance in various highrate quantization schemes. Rate-distortion bounds are given, and it is shown that the analogue of the Shannon lower bound is loose at all finite rates. DRAFT 1 Anagrams due to R. J. McEliece, 2004 Shannon Lecture. 2 In Shannon's sense of language approximation [2], first-order approximation requires that the distribution of letters matches the source, second-order approximation requires that the distribution of digrams matches the source, and so on; see also classical criticisms to this method of approximation [3].
IEEE Transactions on Information Theory, 2019
According to Kolmogorov complexity, every finite binary string is compressible to a shortest code -its information content -from which it is effectively recoverable. We investigate the extent to which this holds for infinite binary sequences (streams). We devise a new coding method which uniformly codes every stream X into an algorithmically random stream Y, in such a way that the first n bits of X are recoverable from the first I(X ↾ n ) bits of Y, where I is any partial computable information content measure which is defined on all prefixes of X, and where X ↾ n is the initial segment of X of length n. As a consequence, if g is any computable upper bound on the initial segment prefix-free complexity of X, then X is computable from an algorithmically random Y with oracle-use at most g. Alternatively (making no use of such a computable bound g) one can achieve an oracle-use bounded above by K(X ↾ n ) + log n. This provides a strong analogue of Shannon's source coding theorem for algorithmic information theory.
2010 48th Annual Allerton Conference on Communication, Control, and Computing (Allerton), 2010
Journal of Computer and Communications
In this paper, we analyze the complexity and entropy of different methods of data compression algorithms: LZW, Huffman, Fixed-length code (FLC), and Huffman after using Fixed-length code (HFLC). We test those algorithms on different files of different sizes and then conclude that: LZW is the best one in all compression scales that we tested especially on the large files, then Huffman, HFLC, and FLC, respectively. Data compression still is an important topic for research these days, and has many applications and uses needed. Therefore, we suggest continuing searching in this field and trying to combine two techniques in order to reach a best one, or use another source mapping (Hamming) like embedding a linear array into a Hypercube with other good techniques like Huffman and trying to reach good results.
2009 International Symposium on Signals, Circuits and Systems, 2009
In this paper an information analysis for lossless compression of a large class of discrete sources is performed. The lossless compression is performed by means of a Huffman code with an alphabet A of size M. Matrix characterization of the encoding as a source with memory is realized. The information quantities H(S,A), H(S), H(A), H(A|S), H(S|A), I(S,A) as well as the minimum average code word length are derived. Three extreme cases, p=M-1, p=0 and M=2, p=1 have been analyzed.
IEEE Transactions on Information Theory, 1979
Absstruct-A combiiatorial approach is proposed for proving the ciassical source coding theorems for a finite memoryleas stationary source (giving achievable rates and the error probability exponent). This approach provides a sound heuristic justification for the widespread appearence of entropy and divergence (Knliback's discrimination) in source coding. The results are based on the notion of composition class -a set made up of all the distinct source sequences of a given length which are permutations of one another. The asymptotic growth rate of any composition class is precisely an entropy. For a finite memoryless constant source ail members of a composition class have equal probability; the probability of any given class therefore is equal to the number of sequences in the class times the probability of an individual sequence in the class. The number of different composition classes is algebraic in block length, whereas the probability of a composition class is exponential, and the probability exponent is a divergence. Thus if a codeword is assigned to all sequences whose composition classes have rate less than some rate R, the probability of error is asymptotically the probability of the most probable composition class of rate greater than R. This is expressed in terms of a divergence. No use is made either of the law of large numbers or of Chebyshev's inequality.
2002
In lossless data compression, given a sequence of observations (Xn)n 1 and a family of probability distributions fQ g 2 , the estimators (~ n)n 1 obtained by minimizing the ideal Shannon code-lengths over the familyfQ g 2 ,
Loading Preview
Sorry, preview is currently unavailable. You can download the paper by clicking the button above.