Academia.eduAcademia.edu

Minimum description length

985 papers
17 followers
AI Powered
Minimum description length (MDL) is a principle in information theory and statistics that seeks to find the best model for a given dataset by minimizing the total length of the description of the model and the data given the model. It balances model complexity and goodness of fit.
We propose an unprecedented approach to post-hoc interpretable machine learning. Facing a complex phenomenon, rather than fully capturing its mechanisms through a universal learner, albeit structured in modular building blocks, we train a... more
In this paper, we propose a novel information criteria-based approach to select the dimensionality of the word2vec Skip-gram (SG). From the perspective of the probability theory, SG is considered as an implicit probability distribution... more
Curve evolution implementations [3][17] [18] of the Mumford-Shah functional are of broad interest in image segmentation. These implementations, however, have initialization problems . A mathematical analysis of the initialization problem... more
Continuation processes in chemical and/or biotechnical plants always generate a large amount of time series data. However, since conventional process models are described as a set of control models, it is difficult to explain the... more
Junctions are important features for image analysis and form a critical aspect of image understanding tasks such as object recognition. We present a unified approach to detecting (location of the center of the junction), classifying (by... more
No statistical model is right or wrong, true or false in a strict sense. We only evaluate and compare their contributions. Based on this theme, Jorma Rissanen has written a short but beautiful book titled "Information and Complexity in... more
The contribution of this work is the creation and development of a novel procedure for signals estimation, procedure LDM-G. The procedure will be developed with details and will consist of algorithms implementation for signals estimation... more
We present a method for detecting repeated structures, which is applied on facade images for describing the regularity of their windows. Our approach finds and explicitly represents repetitive structures and thus gives initial... more
We present a method for detecting repeated structures, which is applied on facade images for describing the regularity of their windows. Our approach finds and explicitly represents repetitive structures and thus gives initial... more
We present a method for detecting repeated structures, which is applied on facade images for describing the regularity of their windows. Our approach finds and explicitly represents repetitive structures and thus gives initial... more
Some model-selection criteria for choosing among a set of alternative models are reviewed. Particular modelselection problems considered here include the choice of a regression equation for prediction, choice of the number of bins for a... more
Because many databases contain or can be embellished with structural information, a method for identifying interesting and repetitive substructures is an essential component to discovering knowledge in such databases. This paper describes... more
The ability to identify interesting and repetitive substructures is an essential component to discovering knowledge in structural data. We describe a new version of our SUBDUE substructure discovery system based on the minimum description... more
Because many databases contain or can be embellished with structural information, a method for identifying interesting and repetitive substructures is an essential component to discovering knowledge in such databases. This paper describes... more
Scene geometry can be inferred from point correspondences between two images. The inference process includes the selection of a model. Four models are considered: background (or null), collineation, affine fundamental matrix and... more
This work plants a method of estimate of harmful in structure by no lineal signals separation, based in wavelet transform. The obtained results are superior in capacity of harmful estimate for reported in the literature. By system... more
Accelerograms are the random time series representation of the strong ground motions called earthquake. Processing of accelerograms is done before using them for any seismic and engineering applications. This article focuses the impact of... more
Combining (i) a statistical interpretation of the minimum of a Weighted Least Squares cost function and (ii) the principle of parsimony, a model selection strategy is proposed. First, it is compared via simulation to model selection... more
An approach is presented to automatically segment and label a continuous observation sequence of hand gestures for a complete unsupervised model acquisition. The method is based on the assumption that gestures can be viewed as repetitive... more
By a "covering" we mean a Gaussian mixture model fit to observed data. Approximations of the Bayes factor can be availed of to judge model fit to the data within a given Gaussian mixture model. Between families of Gaussian mixture models,... more
We study the problem of estimating the overall mutual information in M independent parallel discrete-time memory-less Gaussian channels from N independent data sample pairs per channel (inputs and outputs). We focus on the case where the... more
In this work, we propose a Compression Rate Distance, a new distance measure for time series data. The main idea behind this distance is based on the Minimum Description Length (MDL) principle. The higher compression rate between two time... more
Modélisation 3D de bâtiments Recalage cinétique à topologie variable de toits polyédriques et Reconstruction automatique de superstructures de toits
Wavelet transforms enable us to represent signals with a high degree of sparsity. This is the principle behind a non-linear wavelet based signal estimation technique known as wavelet denoising. In this report we explore wavelet denoising... more
The design of an architecture for a family of products is a challenging task. Traditionally, only a single product case is considered when optimization techniques are used to define the optimum clustering of the components, or in other... more
We present an approach to automatically segment and label a continuous observation sequence of hand gestures for a complete unsupervised model acquisition. The method is based on the assumption that gestures can be viewed as repetitive... more
An approach is presented to automatically segment and label a continuous observation sequence of hand gestures for a complete unsupervised model acquisition. The method is based on the assumption that gestures can be viewed as repetitive... more
An approach is presented to automatically segment and label a continuous observation sequence of hand gestures for a complete unsupervised model acquisition. The method is based on the assumption that gestures can be viewed as repetitive... more
We consider the perturbed harmonic oscillator T D ψ = −ψ ′′ + x 2 ψ + q(x)ψ, ψ(0) = 0 in L 2 (R +), where q ∈ H + = {q ′ , xq ∈ L 2 (R +)} is a real-valued potential. We prove that the mapping q → spectral data = {eigenvalues of T D } ⊕... more
We consider a new class of information sources called wordvalued sources in order to investigate coding algorithms based upon string parsing. A word-valued source is defined as a pair of an independent and identically distributed (i.i.d.)... more
by H. Te
Universal coding for the Slepian-Wolf data compression system is considered. We shall demonstrate based on a simple observation that the error exponent given by Csiszhr and Komer for the universal coding system can strictly be sharpened... more
There exists a substantial problem in obtaining good generalisation performance in the application of arttficial neural network technology where training data is limited. Generalisation ability is analysed for a number of computational... more
The problem of fitting a model composed of a number of superimposed signals to noisy data using the maximum likelihood criterion is considered. It is shown, using the Cra-m&-Rao bound for the estimation accuracy, that in many instances... more
We develop a code length principle which is invariant to the choice of parameterization on the model distributions. An invariant approximation formula for easy computation of the marginal distribution is provided for gaussian likelihood... more
Standard system identification methods often provide biased estimates with closed-loop data. With the prediction error method (PEM), the bias issue is solved by using a noise model that is flexible enough to capture the noise spectrum.... more
Over the years, ensemble methods have become a staple of machine learning. Similarly, generalized linear models (GLMs) have become very popular for a wide variety of statistical inference tasks. The former have been shown to enhance... more
Pattern mining based on data compression has been successfully applied in many data mining tasks. For itemset data, the Krimp algorithm based on the minimum description length (MDL) principle was shown to be very effective in solving the... more
We propose a streaming algorithm, based on the minimal description length (MDL) principle, for extracting non-redundant sequential patterns. For static databases, the MDL-based approach that selects patterns based on their capacity to... more
A deep-learning-based approach to estimating the number of coherent sources in radar is presented. A proper estimate of the number of sources in a signal enables improved angle-of-arrival (AoA) estimation common in applications such as... more
Nonlinear time series modeling with a multilayer perceptron network is presented. An important aspect of this modeling is the model selection, i.e., the problem of determining the size as well as the complexity of the model. To overcome... more
This paper presents a model-selection strategy based on minimum description length (MDL) that keeps the kernel least-mean-square (KLMS) model tuned to the complexity of the input data. The proposed KLMS-MDL filter adapts its model order... more
Reference assisted assembly requires the use of a reference sequence, as a model, to assist in the assembly of the novel genome. The standard method for identifying the best reference sequence for the assembly of a novel genome aims at... more
Bioinformatics skills required for genome sequencing often represent a significant hurdle for many researchers working in computational biology. This dissertation highlights the significance of genome assembly as a research area, focuses... more
A novel genetic algorithm (GA) using minimal representation size cluster (MRSC) analysis is designed and implemented for solving multimodal function optimization problems. The problem of multimodal function optimization is framed within a... more
XML is rapidly emerging as the new standard for data representation and exchange on the Web. An XML document can be accompanied by a Document Type Descriptor (DTD) which plays the role of a schema for an XML data collection. DTDs contain... more