Academia.edu no longer supports Internet Explorer.
To browse Academia.edu and the wider internet faster and more securely, please take a few seconds to upgrade your browser.
2020
…
80 pages
1 file
In this dissertation, our goal is to explore lossless DNA compression algorithms, experiment with predictive probabilistic models for DNA sequences, and perform a comprehensive analysis and comparison among different schemes. We omit lossy compressors as well as reference-based and pre-trained models to ensure that the algorithms are general and can also compress non-DNA files. The methods in this dissertation are based on techniques that cleanly separate the compression task into probabilistic models (predictors) and arithmetic coders (compressors). Existing methods based on arithmetic coding have become state-of-the-art in lossless compression in many domains. As a novel contribution, we implement probabilistic models ranging from traditional machine learning models, such as Random Forests, to deep neural network models which are not conventionally used with arithmetic coding. For each model, we supply a concise and robust mathematical formalisation. As background research, we ran...
Entropy, 2019
The development of efficient data compressors for DNA sequences is crucial not only for reducing the storage and the bandwidth for transmission, but also for analysis purposes. In particular, the development of improved compression models directly influences the outcome of anthropological and biomedical compression-based methods. In this paper, we describe a new lossless compressor with improved compression capabilities for DNA sequences representing different domains and kingdoms. The reference-free method uses a competitive prediction model to estimate, for each symbol, the best class of models to be used before applying arithmetic encoding. There are two classes of models: weighted context models (including substitutional tolerant context models) and weighted stochastic repeat models. Both classes of models use specific sub-programs to handle inverted repeats efficiently. The results show that the proposed method attains a higher compression ratio than state-of-the-art approaches, on a balanced and diverse benchmark, using a competitive level of computational resources. An efficient implementation of the method is publicly available, under the GPLv3 license.
GigaScience, 2020
Background The increasing production of genomic data has led to an intensified need for models that can cope efficiently with the lossless compression of DNA sequences. Important applications include long-term storage and compression-based data analysis. In the literature, only a few recent articles propose the use of neural networks for DNA sequence compression. However, they fall short when compared with specific DNA compression tools, such as GeCo2. This limitation is due to the absence of models specifically designed for DNA sequences. In this work, we combine the power of neural networks with specific DNA models. For this purpose, we created GeCo3, a new genomic sequence compressor that uses neural networks for mixing multiple context and substitution-tolerant context models. Findings We benchmark GeCo3 as a reference-free DNA compressor in 5 datasets, including a balanced and comprehensive dataset of DNA sequences, the Y-chromosome and human mitogenome, 2 compilations of archa...
2015
This thesis makes several contributions to the field of data compression. Lossless data compression algorithms shorten the description of input objects, such as sequences of text, in a way that allows perfect recovery of the original object. Such algorithms exploit the fact that input objects are not uniformly distributed: by allocating shorter descriptions to more probable objects and longer descriptions to less probable objects, the expected length of the compressed output can be made shorter than the object’s original description. Compression algorithms can be designed to match almost any given probability distribution over input objects. This thesis employs probabilistic modelling, Bayesian inference, and arithmetic coding to derive compression algorithms for a variety of applications, making the underlying probability distributions explicit throughout. A general compression toolbox is described, consisting of practical algorithms for compressing data distributed by various fund...
2013
Modern DNA sequencing instruments are able to generate huge amounts of genomic data. Those huge volumes of data require effective storage, fast transmission, provision of quick access to any record, and superior functionality. Data storage costs have an appreciable proportion of total cost in the creation and analysis of DNA sequences. In particular, the increase in the DNA sequences is highly manageable due to a tremendous increase in the disk storage capacity. Standard compression techniques failed to compress these sequences. Recently, new algorithms have been introduced specifically for this purpose. In this paper, we comparatively survey the main ideas and results of lossless compression algorithms that have been developed for DNA sequences.
Signal Processing Conference (EUSIPCO), 2014 Proceedings of the 22nd European, 2014
The pressure to find efficient genomic compression algorithms is being felt worldwide, as proved by several prizes and competitions. In this paper, we propose a compression algorithm that relies on a pre-analysis of the data before compression, with the aim of identifying regions of low complexity. This strategy enables us to use deeper context models, supported by hash-tables, without requiring huge amounts of memory. As an example, context depths as large as 32 are attainable for alphabets of four symbols, as is the case of genomic sequences. These deeper context models show very high compression capabilities in very repetitive genomic sequences, yielding improvements over previous algorithms. Furthermore, this method is universal, in the sense that it can be used in any type of textual data (such as quality-scores).
DNA Lossless Compression Algorithms: Review, 2013
"Abstract: Modern DNA sequencing instruments are able to generate huge amounts of genomic data. Those huge volumes of data require effective storage, fast transmission, provision of quick access to any record, and superior functionality. Data storage costs have an appreciable proportion of total cost in the creation and analysis of DNA sequences. In particular, the increase in the DNA sequences is highly manageable due to a tremendous increase in the disk storage capacity. Standard compression techniques failed to compress these sequences. Recently, new algorithms have been introduced specifically for this purpose. In this paper, we comparatively survey the main ideas and results of lossless compression algorithms that have been developed for DNA sequences. "
Signal Processing …, 2007
Based on the normalized maximum likelihood model ]
Data Compression …, 2003
We discuss how to use the normalized maximum likelihood (NML) model for encoding sequences known to have regularities in the form of approximate repetitions. We present a particular version of the NML model for discrete regression, which is shown to provide a very powerful yet simple model for encoding the approximate repeats in DNA sequences. Combining the model of repeats with a simple first order Markov model we obtain a fast lossless compression method, which compares favorably with the existing DNA compression programs. It is remarkable that a simple model, which recursively updates a small number of parameters, is able to reach the state of the art compression ratio for DNA sequences obtained with much more complex models. Being a minimum description length (MDL) model, the NML model may later prove to be useful in studying global and local features of DNA or possibly of other biological sequences.
International Journal of Computer Applications, 2018
One of the most difficult challenges in lossless data compression is finding the right model for the Deoxyribonucleic Acid (DNA) compression. DNA sequences include four chemical bases: adenine (A), guanine (G), cytosine (C), and thymine (T) where the information in DNA is stored as a code made up of these four chemical bases and these sequences show these are not random, if they are totally random then store them in two bits, this is the most efficient and logical way. This paper proposed an algorithm called A2 for DNA data compression. The proposed algorithm consists of four stages to build a substitutional model. The first stage used a modified version of run-length coding, in second and third stages mapping model for formatting data to be suitable for the final stage fed into Burrows-Wheeler Transform to use permutation technique that group related symbols as possible to improve dictionary coding using Lempel-Ziv (LZ77) and output file stored as (.a2) extension. The A2 algorithm implemented and tested on data from GenBank and shows acceptable file size and processing time ratio.
arXiv (Cornell University), 2023
Lossy compressors are increasingly adopted in scientific research, tackling volumes of data from experiments or parallel numerical simulations and facilitating data storage and movement. In contrast with the notion of entropy in lossless compression, no theoretical or data-based quantification of lossy compressibility exists for scientific data. Users rely on trial and error to assess lossy compression performance. As a strong data-driven effort toward quantifying lossy compressibility of scientific datasets, we provide a statistical framework to predict compression ratios of lossy compressors. Our method is a two-step framework where (i) compressor-agnostic predictors are computed and (ii) statistical prediction models relying on these predictors are trained on observed compression ratios. Proposed predictors exploit spatial correlations and notions of entropy and lossyness via the quantized entropy. We study 8+ compressors on 6 scientific datasets and achieve a median percentage prediction error less than 12%, which is substantially smaller than that of other methods while achieving at least a 8.8× speedup for searching for a specific compression ratio and 7.8× speedup for determining the best compressor out of a collection.
Loading Preview
Sorry, preview is currently unavailable. You can download the paper by clicking the button above.
Fusion: Practice and Applications, 2021
2007 Data Compression Conference (DCC'07), 2007
2019 15th International Conference on eScience (eScience)
Pacific Symposium on Biocomputing. Pacific Symposium on Biocomputing, 1998
Proceedings DCC '98 Data Compression Conference (Cat. No.98TB100225), 1998
Problems of Information Transmission, 2012
International Journal of Computers and Applications, 2011
2018 IEEE International Conference on Systems, Man, and Cybernetics (SMC), 2018
2016 Data Compression Conference (DCC), 2016