Academia.eduAcademia.edu

Lossless DNA Compression

2020

Abstract

In this dissertation, our goal is to explore lossless DNA compression algorithms, experiment with predictive probabilistic models for DNA sequences, and perform a comprehensive analysis and comparison among different schemes. We omit lossy compressors as well as reference-based and pre-trained models to ensure that the algorithms are general and can also compress non-DNA files. The methods in this dissertation are based on techniques that cleanly separate the compression task into probabilistic models (predictors) and arithmetic coders (compressors). Existing methods based on arithmetic coding have become state-of-the-art in lossless compression in many domains. As a novel contribution, we implement probabilistic models ranging from traditional machine learning models, such as Random Forests, to deep neural network models which are not conventionally used with arithmetic coding. For each model, we supply a concise and robust mathematical formalisation. As background research, we ran...