Papers by Robert Vandermeulen

arXiv: Machine Learning, Feb 23, 2015
Finite mixture models are statistical models which appear in many problems in statistics and mach... more Finite mixture models are statistical models which appear in many problems in statistics and machine learning. In such models it is assumed that data are drawn from random probability measures, called mixture components, which are themselves drawn from a probability measure P over probability measures. When estimating mixture models, it is common to make assumptions on the mixture components, such as parametric assumptions. In this paper, we make no assumption on the mixture components, and instead assume that observations from the mixture model are grouped, such that observations in the same group are known to be drawn from the same component. We show that any mixture of m probability measures can be uniquely identified provided there are 2m − 1 observations per group. Moreover we show that, for any m, there exists a mixture of m probability measures that cannot be uniquely identified when groups have 2m − 2 observations. Our results hold for any sample space with more than one element.

The construction and theoretical analysis of the most popular universally consistent nonparametri... more The construction and theoretical analysis of the most popular universally consistent nonparametric density estimators hinge on one functional property: smoothness. In this paper we investigate the theoretical implications of incorporating a multiview latent variable model, a type of low-rank model, into nonparametric density estimation. To do this we perform extensive analysis on histogram style estimators that integrate a multi-view model. Our analysis culminates in showing that there exists a universally consistent histogram style estimator that converges to any multi-view model with a finite number of Lipschitz continuous components at a rate of Õ(1/ 3 √ n) in L error, compared to the standard histogram estimator which can converge at a rate slower than 1/ d √ n on the same class of densities. We also introduce a new nonparametric latent variable model based on the Tucker decomposition. A rudimentary implementation of the ideas in our paper experimentally demonstrates considerabl...

When estimating finite mixture models, it is common to make assumptions on the mixture components... more When estimating finite mixture models, it is common to make assumptions on the mixture components, such as parametric assumptions. In this work, we make no distributional assumptions on the mixture components and instead assume that observations from the mixture model are grouped, such that observations in the same group are known to be drawn from the same mixture component. We precisely characterize the number of observations n per group needed for the mixture model to be identifiable, as a function of the number m of mixture components. In addition to our assumption-free analysis, we also study the settings where the mixture components are either linearly independent or jointly irreducible. Furthermore, our analysis considers two kinds of identifiability -- where the mixture model is the simplest one explaining the data, and where it is the only one. As an application of these results, we precisely characterize identifiability of multinomial mixture models. Our analysis relies on ...

Deep one-class classification variants for anomaly detection learn a mapping that concentrates no... more Deep one-class classification variants for anomaly detection learn a mapping that concentrates nominal samples in feature space causing anomalies to be mapped away. Because this transformation is highly non-linear, finding interpretations poses a significant challenge. In this paper we present an explainable deep one-class classification method, Fully Convolutional Data Description (FCDD), where the mapped samples are themselves also an explanation heatmap. FCDD yields competitive detection performance and provides reasonable explanations on common anomaly detection benchmarks with CIFAR-10 and ImageNet. On MVTec-AD, a recent manufacturing dataset offering ground-truth anomaly maps, FCDD sets a new state of the art in the unsupervised setting. Our method can incorporate ground-truth anomaly maps during training and using even a few of these (~5) improves performance significantly. Finally, using FCDD's explanations we demonstrate the vulnerability of deep one-class classificatio...

Deep learning approaches to anomaly detection have recently improved the state of the art in dete... more Deep learning approaches to anomaly detection have recently improved the state of the art in detection performance on complex datasets such as large collections of images or text. These results have sparked a renewed interest in the anomaly detection problem and led to the introduction of a great variety of new methods. With the emergence of numerous such methods, including approaches based on generative models, one-class classification, and reconstruction, there is a growing need to bring methods of this field into a systematic and unified perspective. In this review we aim to identify the common underlying principles as well as the assumptions that are often made implicitly by various methods. In particular, we draw connections between classic 'shallow' and novel deep approaches and show how this relation might cross-fertilize or extend both directions. We further provide an empirical assessment of major existing methods that is enriched by the use of recent explainability...
ArXiv, 2020
Regularizing the input gradient has shown to be effective in promoting the robustness of neural n... more Regularizing the input gradient has shown to be effective in promoting the robustness of neural networks. The regularization of the input's Hessian is therefore a natural next step. A key challenge here is the computational complexity. Computing the Hessian of inputs is computationally infeasible. In this paper we propose an efficient algorithm to train deep neural networks with Hessian operator-norm regularization. We analyze the approach theoretically and prove that the Hessian operator norm relates to the ability of a neural network to withstand an adversarial attack. We give a preliminary experimental evaluation on the MNIST and FMNIST datasets, which demonstrates that the new regularizer can, indeed, be feasible and, furthermore, that it increases the robustness of neural networks over input gradient regularization.

When estimating finite mixture models, it is common to make assumptions on the mixture components... more When estimating finite mixture models, it is common to make assumptions on the mixture components, such as parametric assumptions. In this work, we make no distributional assumptions on the mixture components and instead assume that observations from the mixture model are grouped, such that observations in the same group are known to be drawn from the same mixture component. We precisely characterize the number of observations n per group needed for the mixture model to be identifiable, as a function of the number m of mixture components. In addition to our assumption-free analysis, we also study the settings where the mixture components are either linearly independent or jointly irreducible. Furthermore, our analysis considers two kinds of identifiability – where the mixture model is the simplest one explaining the data, and where it is the only one. As an application of these results, we precisely characterize identifiability of multinomial mixture models. Our analysis relies on a...

ArXiv, 2021
Deep one-class classification variants for anomaly detection learn a mapping that concentrates no... more Deep one-class classification variants for anomaly detection learn a mapping that concentrates nominal samples in feature space causing anomalies to be mapped away. Because this transformation is highly non-linear, finding interpretations poses a significant challenge. In this paper we present an explainable deep one-class classification method, Fully Convolutional Data Description (FCDD), where the mapped samples are themselves also an explanation heatmap. FCDD yields competitive detection performance and provides reasonable explanations on common anomaly detection benchmarks with CIFAR-10 and ImageNet. On MVTec-AD, a recent manufacturing dataset offering ground-truth anomaly maps, FCDD meets the state of the art in an unsupervised setting, and outperforms its competitors in a semi-supervised setting. Finally, using FCDD's explanations we demonstrate the vulnerability of deep one-class classification models to spurious image features such as image watermarks.

ArXiv, 2020
While nonparametric density estimators often perform well on low dimensional data, their performa... more While nonparametric density estimators often perform well on low dimensional data, their performance can suffer when applied to higher dimensional data, owing presumably to the curse of dimensionality. One technique for avoiding this is to assume no dependence between features and that the data are sampled from a separable density. This allows one to estimate each marginal distribution independently thereby avoiding the slow rates associated with estimating the full joint density. This is a strategy employed in naive Bayes models and is analogous to estimating a rank-one tensor. In this paper we investigate whether these improvements can be extended to other simplified dependence assumptions which we model via nonnegative tensor decompositions. In our central theoretical results we prove that restricting estimation to low-rank nonnegative PARAFAC or Tucker decompositions removes the dimensionality exponent on bin width rates for multidimensional histograms. These results are validat...
The kernel density estimator (KDE) based on a radial positive-semidefinite kernel may be viewed a... more The kernel density estimator (KDE) based on a radial positive-semidefinite kernel may be viewed as a sample mean in a reproducing kernel Hilbert space. This mean can be viewed as the solution of a least squares problem in that space. Replacing the squared loss with a robust loss yields a robust kernel density estimator (RKDE). Previous work has shown that RKDEs are weighted kernel density estimators which have desirable robustness properties. In this paper we establish asymptotic L1 consistency of the RKDE for a class of losses and show that the RKDE converges with the same rate on bandwidth required for the traditional KDE. We also present a novel proof of the consistency of the traditional KDE.

ArXiv, 2021
We propose a novel training methodology---Concept Group Learning (CGL)---that encourages training... more We propose a novel training methodology---Concept Group Learning (CGL)---that encourages training of interpretable CNN filters by partitioning filters in each layer into \emph{concept groups}, each of which is trained to learn a single visual concept. We achieve this through a novel regularization strategy that forces filters in the same group to be active in similar image regions for a given layer. We additionally use a regularizer to encourage a sparse weighting of the concept groups in each layer so that a few concept groups can have greater importance than others. We quantitatively evaluate CGL's model interpretability using standard interpretability evaluation techniques and find that our method increases interpretability scores in most cases. Qualitatively we compare the image regions which are most active under filters learned using CGL versus filters learned without CGL and find that CGL activation regions more strongly concentrate around semantically relevant features.

ArXiv, 2020
Deep approaches to anomaly detection have recently shown promising results over shallow methods o... more Deep approaches to anomaly detection have recently shown promising results over shallow methods on large and complex datasets. Typically anomaly detection is treated as an unsupervised learning problem. In practice however, one may have---in addition to a large set of unlabeled samples---access to a small pool of labeled samples, e.g. a subset verified by some domain expert as being normal or anomalous. Semi-supervised approaches to anomaly detection aim to utilize such labeled samples, but most proposed methods are limited to merely including labeled normal samples. Only a few methods take advantage of labeled anomalies, with existing deep approaches being domain-specific. In this work we present Deep SAD, an end-to-end deep methodology for general semi-supervised anomaly detection. We further introduce an information-theoretic framework for deep anomaly detection based on the idea that the entropy of the latent distribution for normal data should be lower than the entropy of the a...

Despite the great advances made by deep learning in many machine learning problems, there is a re... more Despite the great advances made by deep learning in many machine learning problems, there is a relative dearth of deep learning approaches for anomaly detection. Those approaches which do exist involve networks trained to perform a task other than anomaly detection, namely generative models or compression, which are in turn adapted for use in anomaly detection; they are not trained on an anomaly detection based objective. In this paper we introduce a new anomaly detection method—Deep Support Vector Data Description—, which is trained on an anomaly detection based objective. The adaptation to the deep regime necessitates that our neural network and training procedure satisfy certain properties, which we demonstrate theoretically. We show the effectiveness of our method on MNIST and CIFAR-10 image benchmark datasets as well as on the detection of adversarial examples of GTSRB stop signs.

Functional Analytic Perspectives on Nonparametric Density Estimation by Robert A. Vandermeulen Ch... more Functional Analytic Perspectives on Nonparametric Density Estimation by Robert A. Vandermeulen Chair: Clayton Scott Nonparametric density estimation is a classic problem in statistics. In the standard estimation setting, when one has access to iid samples from an unknown distribution, there exist several established and well-studied nonparametric density estimators. Yet there remains interesting alternative settings which are less well-studied. This work considers two such settings. First we consider the case where the data contains some contamination, i.e. a portion of the data is not distributed according to the density we would like to estimate. In this setting one would like an estimator which is robust to the contaminating data. An approach to this was suggested in Kim and Scott (2012). The estimator in that paper was analytically and experimentally shown to be robust, but no consistency result was presented. In Chapter II it is demonstrated that this estimator is indeed consis...
The kernel density estimator (KDE) based on a radial positive-semiden ite kernel may be viewed as... more The kernel density estimator (KDE) based on a radial positive-semiden ite kernel may be viewed as a sample mean in a reproducing kernel Hilbert space. This mean can be viewed as the solution of a least squares problem in that space. Replacing the squared loss with a robust loss yields a robust kernel density estimator (RKDE). Previous work has shown that RKDEs are weighted kernel density estimators which have desirable robustness properties. In this paper we establish asymptotic L 1 consistency of the RKDE for a class of losses and show that the RKDE converges with the same rate on bandwidth required for the traditional KDE. We also present a novel proof of the consistency of the traditional KDE.
ArXiv, 2020
Recent research has established sufficient conditions for finite mixture models to be identifiabl... more Recent research has established sufficient conditions for finite mixture models to be identifiable from grouped observations. These conditions allow the mixture components to be nonparametric and have substantial (or even total) overlap. This work proposes an algorithm that consistently estimates any identifiable mixture model from grouped observations. Our analysis leverages an oracle inequality for weighted kernel density estimators of the distribution on groups, together with a general result showing that consistent estimation of the distribution on groups implies consistent estimation of mixture components. A practical implementation is provided for paired observations, and the approach is shown to outperform existing methods, especially when mixture components overlap significantly.
Though anomaly detection (AD) can be viewed as a classification problem (nominal vs. anomalous) i... more Though anomaly detection (AD) can be viewed as a classification problem (nominal vs. anomalous) it is usually treated in an unsupervised manner since one typically does not have access to, or it is infeasible to utilize, a dataset that sufficiently characterizes what it means to be "anomalous." In this paper we present results demonstrating that this intuition surprisingly does not extend to deep AD on images. For a recent AD benchmark on ImageNet, classifiers trained to discern between normal samples and just a few (64) random natural images are able to outperform the current state of the art in deep AD. We find that this approach is also very effective at other common image AD benchmarks. Experimentally we discover that the multiscale structure of image data makes example anomalies exceptionally informative.
Deep anomaly detection is a difficult task since, in high dimensions, it is hard to completely ch... more Deep anomaly detection is a difficult task since, in high dimensions, it is hard to completely characterize a notion of "differentness" when given only examples of normality. In this paper we propose a novel approach to deep anomaly detection based on augmenting large pretrained networks with residual corrections that adjusts them to the task of anomaly detection. Our method gives rise to a highly parameter-efficient learning mechanism, enhances disentanglement of representations in the pretrained model, and outperforms all existing anomaly detection methods including other baselines utilizing pretrained networks. On the CIFAR-10 one-versus-rest benchmark, for example, our technique raises the state of the art from 96.1 to 99.0 mean AUC.
Detecting semantic anomalies is challenging due to the countless ways in which they may appear in... more Detecting semantic anomalies is challenging due to the countless ways in which they may appear in real-world data. While enhancing the robustness of networks may be sufficient for modeling simplistic anomalies, there is no good known way of preparing models for all potential and unseen anomalies that can potentially occur, such as the appearance of new object classes. In this paper, we show that a previously overlooked strategy for anomaly detection (AD) is to introduce an explicit inductive bias toward representations transferred over from some large and varied semantic task. We rigorously verify our hypothesis in controlled trials that utilize intervention, and show that it gives rise to surprisingly effective auxiliary objectives that outperform previous AD paradigms.

Finite mixture models are statistical models which appear in many problems in statistics and mach... more Finite mixture models are statistical models which appear in many problems in statistics and machine learning. In such models it is assumed that data are drawn from random probability measures, called mixture components, which are themselves drawn from a probability measure P over probability measures. When estimating mixture models, it is common to make assumptions on the mixture components, such as parametric assumptions. In this paper, we make no assumption on the mixture components, and instead assume that observations from the mixture model are grouped, such that observations in the same group are known to be drawn from the same component. We show that any mixture of m probability measures can be uniquely identified provided there are 2m-1 observations per group. Moreover we show that, for any m, there exists a mixture of m probability measures that cannot be uniquely identified when groups have 2m-2 observations. Our results hold for any sample space with more than one element.
Uploads
Papers by Robert Vandermeulen