Academia.edu no longer supports Internet Explorer.
To browse Academia.edu and the wider internet faster and more securely, please take a few seconds to upgrade your browser.
1992
Data may be compressed using textual substitution. Textual substitution identifies repeated substrings and replaces some or all substrings by pointers to another copy. We construct an incremental algorithm for a specific textual substitution method: coding a text with respect to a dictionary. With this incremental algorithm it is possible to combine two coded texts in constant time.
1999
Abstract One crucial concern of digital information processing consists precisely of developing increasingly e cient methods for the representation of information itself. Data compression studies algorithms, protocols, schemes, and codes capable to produce\ concise" representations of information.
Software - Practice and Experience, 2005
One of the attractive ways to increase text compression is to replace words with references to a text dictionary given in advance. Although there exist a few works in this area, they do not fully exploit the compression possibilities or consider alternative preprocessing variants for various compressors in the latter phase. In this paper, we discuss several aspects of dictionary-based compression, including compact dictionary representation, and present a PPM/BWCA oriented scheme, word replacing transformation, achieving compression ratios higher by 2-6% than stateof-the-art StarNT (2003) text preprocessor, working at a greater speed. We also present an alternative scheme designed for LZ77 compressors, with the advantage over StarNT reaching up to 14% in combination with gzip.
2013
1 Abstract: The size of data related to a wide range of applications is growing rapidly. Typically a character requires 1 Byte for storing. Compression of the data is very important for data management. Data compression is the process by which the physical size of data is reduced to save on memory or improve traffic speeds on a website. The purpose of data compression is to make a file smaller by minimizing the amount of data present. When a file is compressed, it can be reduced to as little as 25% of its original size which makes it easier to send to others over the internet. Data compression may take extensions such as zip, rar, ace, and BZ2. It is normally done using special compression software. Compression serves both to save storage space and to save transmission time. The aim of compression is to produce a new file, as short as possible, containing the compressed version of the same text. Grand challenges such as the human generated project involve very large distributed data...
A data compression scheme that exploits locality of reference, such as occurs when words are used frequently over short intervals and then fall into long periods of disuse, is described. The scheme is based on a simple heuristic for self-organizing sequential search and on variable-length encodings of integers. We prove that it never performs much worse than Huffman coding and can perform substantially better; experiments on real files show that its performance is usually quite close to that of Huffman coding. Our scheme has many implementation advantages: it is simple, allows fast encoding and decod- ing, and requires only one pass over the data to be com- pressed (static Huffman coding takes huo passes).
International Journal of Computer Applications, 2011
Compression is used just about everywhere. Reduction of both compression ratio and retrieval of data from large collection is important in today"s era. We propose a pre-compression technique that can be applied to text files. The output of our technique can be further applied to standard compression techniques available, such as arithmetic coding and BZIP2, which yields in better compression ratio. The algorithm suggested here uses the dynamic dictionary created at run-time and is also suitable for searching the phrases from the compressed file.
Witasi, 2002
The growth of data interchange within computer networks, including the Internet, causes bandwidth requirements to increase. Instead of widening the bandwidth, the same may be obtained by limiting the amount of interchanged data through an application of data compression. There are several general compression methods, varying in speed and efficency. In this paper a new method of data compression is described, that combines elements of known methods to achieve better compression efficiency in less time. Performed tests show that this method achieves good results, especially for text data, so it could be used in network-related applications in order to limit the amount of interchanged data.
Information Processing Letters, 1996
2010
In this paper we use ternary representation of numbers for compressing text data. We use a binary map for ternary digits and introduce a way to use the binary 11-pair, which has never been use for coding data before, and we futher use 4-Digits ternary representation of alphabet with lowercase and uppercase with some extra symbols that are most commonly used in day to day life. We find a way to minimize the length of the bits string, which is only possible in ternary representation thus drastically reducing the length of the code. We also find some connection between this technique of coding dat and Fibonacci numbers.
BIT, 1985
A new technique for compression of character strings is presented. The technique is based on the use of a dictionary forest which is built simultaneously with the encoding and decoding. Codes representing substrings are addresses in the dictionary forest. Experimental results show that the length of the text can be reduced more than 50 ~ with no a priori knowledge of the nature of the text.
Computing Research Repository, 2010
For storing a word or the whole text segment, we need a huge storage space. Typically a character requires 1 Byte for storing it in memory. Compression of the memory is very important for data management. In case of memory requirement compression for text data, lossless memory compression is needed. We are suggesting a lossless memory requirement compression method for text data compression. The proposed compression method will compress the text segment or the text file based on two level approaches firstly reduction and secondly compression. Reduction will be done using a word lookup table not using traditional indexing system, then compression will be done using currently available compression methods. The word lookup table will be a part of the operating system and the reduction will be done by the operating system. According to this method each word will be replaced by an address value. This method can quite effectively reduce the size of persistent memory required for text data. At the end of the first level compression with the use of word lookup table, a binary file containing the addresses will be generated. Since the proposed method does not use any compression algorithm in the first level so this file can be compressed using the popular compression algorithms and finally will provide a great deal of data compression on purely English text data.
International Journal of Engineering Research and, 2015
Dictionary Based Compression is a useful technique through which we can encode variable-length strings of symbols as single tokens. There are number of algorithms available for Dictionary Based Compression. It uses less computing resources so it is very effective compression technique. The purpose of this paper is to present and analyze a variety of dictionary based algorithms.
ACM Transactions on Information Systems, 2010
We address the problem of adaptive compression of natural language text, considering the case where the receiver is much less powerful than the sender, as in mobile applications. Our techniques achieve compression ratios around 32% and require very little effort from the receiver. Furthermore, the receiver is not only lighter, but it can also search the compressed text with less work than that necessary to decompress it. This is a novelty in two senses: it breaks the usual compressor/decompressor symmetry typical of adaptive schemes, and it contradicts the long-standing assumption that only semistatic codes could be searched more efficiently than the uncompressed text. Our novel compression methods are preferable in several aspects over the existing adaptive and semistatic compressors for natural language texts.
2000
Abstract Greedy off-line textual substitution refers to the following approach to compression or structural inference. Given a long text string x, a substring W is identified such that replacing all instances of W in X except one by a suitable pair of pointers yields the highest possible contraction of X; the process is then repeated on the contracted text string until substrings capable of producing contractions can no longer be found.
Abstract. ASCII characters have a different code representation where each character has a different numeric value between the characters to each other. The characters is usually used as a text message communication has the representation of a numeric code to each other or have a small difference. The value of the difference can be used as a substitution of the characters so it will generate a new message with a size that is a little more. This paper discusses the utilization value of the difference of characters ASCII in a message to a much simpler substitution by using a dynamic-sized window in order to obtain the difference from ASCII value contained on the window as the basis in determining the bit substitution on the file compression results.
2000
Most modern lossless data compression techniques used to- day, are based in dictionaries. If some string of data being compressed matches a portion previously seen, then such string is included in the dictionary and its reference is included every time it occurs. A possi- ble generalization of this scheme is to consider not only strings made of consecutive symbols, but
Communications of the ACM, 1986
A data compression scheme that exploits locality of reference, such as occurs when words are used frequently over short intervals and then fall into long periods of disuse, is described. The scheme is based on a simple heuristic for self-organizing sequential search and on variable-length encodings of integers. We prove that it never performs much worse than Huffman coding and can perform substantially better; experiments on real files show that its performance is usually quite close to that of Huffman coding. Our scheme has many implementation advantages: it is simple, allows fast encoding and decoding, and requires only one pass over the data to be compressed (static Huffman coding takes two passes).
2014 Data Compression Conference, 2014
List update is a key step during the Burrows-Wheeler transform (BWT) compression. Previous work has shown (e.g., ]) that careful study of the list update step leads to better BWT compression. Surprisingly, the theoretical study of list update algorithms for compression has lagged behind its use in real practice. To be more precise, the standard model by Sleator and Tarjan for list update considers a linear cost-of-access model while compression incurs a logarithmic cost of access, i.e. accessing item i in the list has cost Θ(i) in the standard model but Θ(log i) in compression applications 1 . These models have been shown, in general, not to be equivalent . This paper has two contributions:
2004
A new notion, that of semi-lossless text compression, is introduced, and its applicability in various settings is investigated. First results suggest that it might be hard to exploit the additional redundancy of English texts, but the new methods could be useful in applications where the correct spelling is not important, such as in short emails, and the new notion raises some interesting research problems in several different areas of Computer Science.
Journal of Discrete Algorithms, 2013
Dictionary-based compression schemes are the most commonly used data compression schemes since they appeared in the foundational paper of Ziv and Lempel in 1977, and generally referred to as LZ77. Their work is the base of Zip, gZip, 7-Zip and many other compression software utilities. Some of these compression schemes use variants of the greedy approach to parse the text into dictionary phrases; others have left the greedy approach to improve the compression ratio. Recently, two bit-optimal parsing algorithms have been presented filling the gap between theory and best practice. We present a survey on the parsing problem for dictionary-based text compression, identifying noticeable results of both a theoretical and practical nature, which have appeared in the last three decades. We follow the historical steps of the Zip scheme showing how the original optimal parsing problem of finding a parse formed by the minimum number of phrases has been replaced by the bit-optimal parsing problem where the goal is to minimise the length in bits of the encoded text.
Information Sciences, 1996
We present a methodology for on-line variable-length binary encoding of a dynamically growing set of integers. Our encoding maintains the prefix property that enables unique decoding of a string of integers from the set. In order to develop the formalism of this on-line binary encoding, we define a unique binary tree data structure called the "phase in binary tree." To show the utility of this on-line variable-length binary encoding, we apply this methodology to encode the pointers generated by the LZW algorithm. The experimental results obtained illustrate the superior performance of our algorithm compared to the most widely used algorithms. This on-line variable-length binary c.ncoding can be applied in other dictionary-based text compression schemes as well to effectively encode the output pointers, to enhance the compression ratio. 2 T. ACHARYA AND J. F. J.~J.~ addresses in the dictionary. Most of the adaptive dictionary-based text compression algorithms belong to a family of algorithms originated by Ziv and Lempel , popularly known as LZ coding. The basic concept of all the LZ coding algorithms is to replace the substrings (called phrases or words) with a pointer to where they have occurred earlier in the text. Different variations of these algorithms, which differ in the way the dictionary is referenced and how it is mapped onto a set of codes, have been described in . This mapping is a code function to represent a pointer to generate the compressed code, The maximum allowable number of bits in the pointer usually determines the size of the dictionary. When fixed-size pointers are used, the same number of bits are transmitted for each pointer, irrespective of the number of phrases in the dictionary at any stage. This "affects the compression performance at the beginning when more than half of the dictionary is empty. The most popular variation in this family is the LZW algorithm . Several variations and enhancements of the LZW have also been proposed in the literature . A popular variant of LZW is the implementation of the compress command available in UNIX, which is known as LZC . In LZC, the pointers (string numbers) are output in binary. The number of bits used to represent the pointer in any step varies according to the number of phrases (say M) currently contained in the dictionary. For example, when the value of M is in the range 256 to 511, each string number can be represented using a 9-bit binary number, and when M becomes in the range 512 to 1023, each string number will be represented as a 10-bit binary number, and so on. This technique is suboptimal because a fractional number of bits is still wasted unless M is a power of 2. For example, when M is 513, it is possible to encode the first 512 string numbers in the dictionary (0 through 511) using 10 bits, while 512 and 513 can still be represented as 2-bit binary numbers "10" and "11," respectively, without violating the pref'Lx property because the first 512 binary codes will start with the bit 0 and the last two codes start with the binary bit 1. In this paper, we will describe a general methodology of an on-line variable-length binary encoding of a set of integers which is growing dynamically. We will define a unique binary tree, called "phase in binary tree," in order to formulate this on-line variable-length binary encoding method. This binary encoding can be applied to encode the pointers generated in any dictionary-based text compression scheme. A particular case of this encoding is then applied to the LZW codes to show the effectiveness of the proposed scheme, and we call this variation thc "LZWAJ" scheme. The proposed variable-length binary encoding can be applied to other LZ encoding schemes as well.