Bob Durrant

Followers

Following

Public Views

Brandon T Willard

University of Chicago

University of Twente

Interests

Uploads

Papers by Bob Durrant

• Implementation • Experiments & results

Learning with Lq<1 vs L1-norm regularisation with exponentially many

Download

Maximum Gradient Dimensionality Reduction

2018 24th International Conference on Pattern Recognition (ICPR), 2018

We propose a novel dimensionality reduction approach based on the gradient of the regression func... more We propose a novel dimensionality reduction approach based on the gradient of the regression function. Our approach is conceptually similar to Principal Component Analysis, however instead of seeking a low dimensional representation of the predictors that preserve the sample variance, we project onto a basis that preserves those predictors which induce the greatest change in the response. Our approach has the benefits of being simple and easy to implement and interpret, while still remaining very competitive with sophisticated state-of-the-art approaches.

Download

Linear dimensionality reduction in linear time: Johnson-Lindenstrauss-type guarantees for random subspace

arXiv: Machine Learning, 2017

We consider the problem of efficient randomized dimensionality reduction with norm-preservation g... more We consider the problem of efficient randomized dimensionality reduction with norm-preservation guarantees. Specifically we prove data-dependent Johnson-Lindenstrauss-type geometry preservation guarantees for Ho's random subspace method: When data satisfy a mild regularity condition -- the extent of which can be estimated by sampling from the data -- then random subspace approximately preserves the Euclidean geometry of the data with high probability. Our guarantees are of the same order as those for random projection, namely the required dimension for projection is logarithmic in the number of data points, but have a larger constant term in the bound which depends upon this regularity. A challenging situation is when the original data have a sparse representation, since this implies a very large projection dimension is required: We show how this situation can be improved for sparse binary data by applying an efficient `densifying' preprocessing, which neither changes the Eu...

Download

Structure-aware error bounds for linear classification with the zero-one loss

We prove risk bounds for binary classification in high-dimensional settings when the sample size ... more We prove risk bounds for binary classification in high-dimensional settings when the sample size is allowed to be smaller than the dimensionality of the training set observations. In particular, we prove upper bounds for both 'compressive learning' by empirical risk minimization (ERM) (that is when the ERM classifier is learned from data that have been projected from high-dimensions onto a randomly selected low-dimensional subspace) as well as uniform upper bounds in the full high-dimensional space. A novel tool we employ in both settings is the 'flipping probability' of Durrant and Kaban (ICML 2013) which we use to capture benign geometric structures that make a classification problem 'easy' in the sense of demanding a relatively low sample size for guarantees of good generalization. Furthermore our bounds also enable us to explain or draw connections between several existing successful classification algorithms. Finally we show empirically that our bounds a...

Download

Maximum Margin Principal Components

ArXiv, 2017

Principal Component Analysis (PCA) is a very successful dimensionality reduction technique, widel... more Principal Component Analysis (PCA) is a very successful dimensionality reduction technique, widely used in predictive modeling. A key factor in its widespread use in this domain is the fact that the projection of a dataset onto its first $K$ principal components minimizes the sum of squared errors between the original data and the projected data over all possible rank $K$ projections. Thus, PCA provides optimal low-rank representations of data for least-squares linear regression under standard modeling assumptions. On the other hand, when the loss function for a prediction problem is not the least-squares error, PCA is typically a heuristic choice of dimensionality reduction -- in particular for classification problems under the zero-one loss. In this paper we target classification problems by proposing a straightforward alternative to PCA that aims to minimize the difference in margin distribution between the original and the projected data. Extensive experiments show that our simp...

Download

Random Projections for Machine Learning and Data Mining: Theory and Applications

LP-type Problems Definition [MS03] An abstract LP-type problem is a pair (H,w) where: H a finite ... more LP-type Problems Definition [MS03] An abstract LP-type problem is a pair (H,w) where: H a finite set of constraints. w : H2 → R ∪ (−∞,∞) an objective function to be minimized which satisfies, for any h ∈ H and any F ⊆ G ⊆ H: Monotonicity: w(F ) 6 w(G) 6 w(H). Locality: If w(F ) = w(G) = w(F ∪ h) then w(F ) = w(G ∪ h). Interpretation: w(G) is the minimum value of a solution satisfying all constraints on G. R.J.Durrant & A.Kaban (U.Birmingham) RP for Machine Learning & Data Mining ECML-PKDD 2012 67 / 123 Basis and Combinatorial Dimension Definitions: Basis, Combinatorial Dimension L = (H,w) abstract LP-type problem then: A basis for F ⊆ H is a minimal set of constraints B ⊆ F such that w(B) = w(F ). The combinatorial dimension of L is the size of the largest basis. Combinatorial dimension examples for problems in Rd : Smallest enclosing ball, d + 1 Linear program, d + 1 Distance between hyperplanes, d + 2 R.J.Durrant & A.Kaban (U.Birmingham) RP for Machine Learning & Data Mining ECML-...

Download

A Diversity-aware Model for Majority Vote Ensemble Accuracy

Ensemble classifiers are a successful and popular approach for classification, and are frequently... more Ensemble classifiers are a successful and popular approach for classification, and are frequently found to have better generalization performance than single models in practice. Although it is widely recognized that ‘diversity’ between ensemble members is important in achieving these performance gains, for classification ensembles it is not widely understood which diversity measures are most predictive of ensemble performance, nor how large an ensemble should be for a particular application. In this paper, we explore the predictive power of several common diversity measures and show – with extensive experiments – that contrary to earlier work that finds no clear link between these diversity measures (in isolation) and ensemble accuracy instead by using the ρ diversity measure of Sneath and Sokal as an estimator for the dispersion parameter of a Polya-Eggenberger distribution we can predict, independently of the choice of base classifier family, the accuracy of a majority vote classi...

Download

Structure from Randomness in Halfspace Learning with the Zero-One Loss

Journal of Artificial Intelligence Research, 2020

We prove risk bounds for halfspace learning when the data dimensionality is allowed to be larger ... more We prove risk bounds for halfspace learning when the data dimensionality is allowed to be larger than the sample size, using a notion of compressibility by random projection. In particular, we give upper bounds for the empirical risk minimizer learned efficiently from randomly projected data, as well as uniform upper bounds in the full high-dimensional space. Our main findings are the following: i) In both settings, the obtained bounds are able to discover and take advantage of benign geometric structure, which turns out to depend on the cosine similarities between the classifier and points of the input space, and provide a new interpretation of margin distribution type arguments. ii) Furthermore our bounds allow us to draw new connections between several existing successful classification algorithms, and we also demonstrate that our theory is predictive of empirically observed performance in numerical simulations and experiments. iii) Taken together, these results suggest that the ...

Download

How effective is Cauchy-EDA in high dimensions?

2016 IEEE Congress on Evolutionary Computation (CEC), 2016

We consider the problem of high dimensional blackbox optimisation via Estimation of Distribution ... more We consider the problem of high dimensional blackbox optimisation via Estimation of Distribution Algorithms (EDA) and the use of heavy-tailed search distributions in this setting. Some authors have suggested that employing a heavy tailed search distribution, such as a Cauchy, may make EDA better explore a high dimensional search space. However, other authors have found Cauchy search distributions are less effective than Gaussian search distributions in high dimensional problems. In this paper, we set out to resolve this controversy. To achieve this we run extensive experiments on a battery of high-dimensional test functions, and develop some theory which shows that small search steps are always more likely to move the search distribution towards the global optimum than large ones and, in particular, large search steps in high-dimensional spaces nearly always do badly in this respect. We hypothesise that, since exploration by large steps is mostly counterproductive in high dimensions, and since the fraction of good directions decays exponentially fast with increasing dimension, instead one should focus mainly on finding the right direction in which to move the search distribution. We propose a minor change to standard Gaussian EDA which implicitly achieves this aim, and our experiments on a sequence of test functions confirm the good performance of our new approach.

Download

Random Projections as Regularizers–Supplementary Material

Download

Sharp generalization error bounds for randomly-projected classifiers

We derive sharp bounds on the generalization error of a generic linear classifier trained by empi... more We derive sharp bounds on the generalization error of a generic linear classifier trained by empirical risk minimization on randomlyprojected data. We make no restrictive assumptions (such as sparsity or separability) on the data: Instead we use the fact that, in a classification setting, the question of interest is really 'what is the effect of random projection on the predicted class labels?' and we therefore derive the exact probability of 'label flipping' under Gaussian random projection in order to quantify this effect precisely in our bounds.

Download

Random projections as regularizers: learning a linear discriminant from fewer observations than dimensions

Machine Learning, 2014

We prove theoretical guarantees for an averaging-ensemble of randomly projected Fisher linear dis... more We prove theoretical guarantees for an averaging-ensemble of randomly projected Fisher linear discriminant classifiers, focusing on the case when there are fewer training observations than data dimensions. The specific form and simplicity of this ensemble permits a direct and much more detailed analysis than existing generic tools in previous works. In particular, we are able to derive the exact form of the generalization error of our ensemble, conditional on the training set, and based on this we give theoretical guarantees which directly link the performance of the ensemble to that of the corresponding linear discriminant learned in the full data space. To the best of our knowledge these are the first theoretical results to prove such an explicit link for any classifier and classifier ensemble pair. Furthermore we show that the randomly projected ensemble is equivalent to implementing a sophisticated regularization scheme to the linear discriminant learned in the original data space and this prevents overfitting in conditions of small sample size where pseudo-inverse FLD learned in the data space is provably poor. Our ensemble is learned from a set of randomly projected representations of the original high dimensional data and therefore for this approach data can be collected, stored and processed in such a compressed form. We confirm our theoretical findings with experiments, and demonstrate the utility of our approach on several datasets from the bioinformatics domain and one very high dimensional dataset from the drug discovery domain, both settings in which fewer observations than dimensions are the norm.

Download

Learning with L q<1 vs L 1-norm regularisation with exponentially Many Irrelevant Features

Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), 2008

We study the use of fractional norms for regularisation in supervised learning from high dimensio... more We study the use of fractional norms for regularisation in supervised learning from high dimensional data, in conditions of a large number of irrelevant features, focusing on logistic regression. We develop a variational method for parameter estimation, and show an equivalence between two approximations recently proposed in the statistics literature. Building on previous work by A.Ng, we show the fractional norm regularised logistic regression enjoys a sample complexity that grows logarithmically with the data dimensions and polynomially with the number of relevant dimensions. In addition, extensive empirical testing indicates that fractional-norm regularisation is more suitable than L1 in cases when the number of relevant features is very small, and works very well despite a large number of irrelevant features. 1 L q<1-Regularised Logistic Regression Consider a training set of pairs z = {(x j , y j)} n j=1 drawn i.i.d. from some unknown distribution P. x j ∈ R m are m-dimensional input points and y j ∈ {−1, 1} are the associated target labels for these points. Given z, the aim in supervised learning is to learn a mapping from inputs to targets that is then able to predict the target values for previously unseen points that follow the same distribution as the training data. We are interested in problems with large number m of input features, of which only a few r << m are relevant to the target. In particular, we focus on a form of regularised logistic regression for this purpose: max w n j=1 log p(y j |x j , w) (1) subject to||w|| q ≤ A (2) or, in the Lagrangian formulation: max w n j=1 log p(y j |x j , w) − α||w|| q q (3)

Download

Compressed Fisher Linear Discriminant analysis: Classification of randomly projected data

Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2010

We consider random projections in conjunction with classification, specifically the analysis of F... more We consider random projections in conjunction with classification, specifically the analysis of Fisher's Linear Discriminant (FLD) classifier in randomly projected data spaces. Unlike previous analyses of other classifiers in this setting, we avoid the unnatural effects that arise when one insists that all pairwise distances are approximately preserved under projection. We impose no sparsity or underlying lowdimensional structure constraints on the data; we instead take advantage of the class structure inherent in the problem. We obtain a reasonably tight upper bound on the estimated misclassification error on average over the random choice of the projection, which, in contrast to early distance preserving approaches, tightens in a natural way as the number of training examples increases. It follows that, for good generalisation of FLD, the required projection dimension grows logarithmically with the number of classes. We also show that the error contribution of a covariance misspecification is always no worse in the low-dimensional space than in the initial high-dimensional space. We contrast our findings to previous related work, and discuss our insights.

Download

A bound on the performance of LDA in randomly projected data spaces

Proceedings - International Conference on Pattern Recognition, 2010

We consider the problem of classification in nonadaptive dimensionality reduction. Specifically, ... more We consider the problem of classification in nonadaptive dimensionality reduction. Specifically, we bound the increase in classification error of Fisher's Linear Discriminant classifier resulting from randomly projecting the high dimensional data into a lower dimensional space and both learning the classifier and performing the classification in the projected space. Our bound is reasonably tight, and unlike existing bounds on learning from randomly projected data, it becomes tighter as the quantity of training data increases without requiring any sparsity structure from the data.

Download

Towards large scale continuous EDA: A random matrix theory perspective

GECCO 2013 - Proceedings of the 2013 Genetic and Evolutionary Computation Conference, 2013

Estimation of distribution algorithms (EDA) are a major branch of evolutionary algorithms (EA) wi... more Estimation of distribution algorithms (EDA) are a major branch of evolutionary algorithms (EA) with some unique advantages in principle. They are able to take advantage of correlation structure to drive the search more efficiently, and they are able to provide insights about the structure of the search space. However, model building in high dimensions is extremely challenging and as a result existing EDAs lose their strengths in large scale problems. Large scale continuous global optimisation is key to many real-world problems of modern days. Scaling up EAs to large scale problems has become one of the biggest challenges of the field. This paper pins down some fundamental roots of the problem and makes a start at developing a new and generic framework to yield effective EDA-type algorithms for large scale continuous global optimisation problems. Our concept is to introduce an ensemble of random projections of the set of fittest search points to low dimensions as a basis for developing a new and generic divide-and-conquer methodology. This is rooted in the theory of random projections developed in theoretical computer science, and will exploit recent advances of non-asymptotic random matrix theory.

Download

Dimension-Adaptive Bounds on Compressive FLD Classification

Efficient dimensionality reduction by random projections (RP) gains popularity, hence the learnin... more Efficient dimensionality reduction by random projections (RP) gains popularity, hence the learning guarantees achievable in RP spaces are of great interest. In finite dimensional setting, it has been shown for the compressive Fisher Linear Discriminant (FLD) classifier that for good generalisation the required target dimension grows only as the log of the number of classes and is not adversely affected by the number of projected data points. However these bounds depend on the dimensionality d of the original data space. In this paper we give further guarantees that remove d from the bounds under certain conditions of regularity on the data density structure. In particular, if the data density does not fill the ambient space then the error of compressive FLD is independent of the ambient dimension and depends only on a notion of 'intrinsic dimension'.

Download

A tight bound on the performance of Fisher’s linear discriminant in randomly projected data spaces

Pattern Recognition Letters, 2012

We consider the problem of classification in non-adaptive dimensionality reduction. Specifically,... more We consider the problem of classification in non-adaptive dimensionality reduction. Specifically, we give an average-case bound on the classification error of Fisher's Linear Discriminant classifier when the classifier only has access to randomly projected versions of a given training set. By considering the system of random projection and classifier together as a whole, we are able to take advantage of the simple class structure inherent in the problem, and so derive a non-trivial performance bound without imposing any sparsity or underlying low-dimensional structure restrictions on the data. Our analysis also reveals and quantifies the effect of class 'flipping'-a potential issue when randomly projecting a finite sample. Our bound is reasonably tight, and unlike existing bounds on learning from randomly projected data, it becomes tighter as the quantity of training data increases. A preliminary version of this work received an IBM Best Student Paper Award at the 20th International Conference on Pattern Recognition.

Download