Academia.edu no longer supports Internet Explorer.
To browse Academia.edu and the wider internet faster and more securely, please take a few seconds to upgrade your browser.
2013, Neural Computation
…
59 pages
1 file
This review examines kernel methods for online learning, in particular, multiclass classification. We examine margin-based approaches, stemming from Rosenblatt's original perceptron algorithm, as well as nonparametric probabilistic approaches that are based on the popular gaussian process framework. We also examine approaches to online learning that use combinations of kernels-online multiple kernel learning. We present empirical validation of a wide range of methods on a protein fold recognition data set, where different biological feature types are available, and two object recognition data sets, Caltech101 and Caltech256, where multiple feature spaces are available in terms of different image feature extraction methods. Neural Computation 25, 567-625 (2013) c 2013 Massachusetts Institute of Technology
2016 IEEE 26th International Workshop on Machine Learning for Signal Processing (MLSP), 2016
We study the relationship between online Gaussian process (GP) regression and kernel least mean squares (KLMS) algorithms. While the latter have no capacity of storing the entire posterior distribution during online learning, we discover that their operation corresponds to the assumption of a fixed posterior covariance that follows a simple parametric model. Interestingly, several well-known KLMS algorithms correspond to specific cases of this model. The probabilistic perspective allows us to understand how each of them handles uncertainty, which could explain some of their performance differences.
1999
I describe a framework for interpreting Support Vector Machines (SVMs) as maximum a posteriori (MAP) solutions to inference problems with Gaussian Process priors. This can provide intuitive guidelines for choosing a 'good' SVM kernel. It can also assign (by evidence maximization) optimal values to parameters such as the noise level C which cannot be determined unambiguously from properties of the MAP solution alone (such as cross-validation error). I illustrate this using a simple approximate expression for the SVM evidence. Once C has been determined, error bars on SVM predictions can also be obtained.
Neural Computation, 2002
The Bayesian evidence framework has been successfully applied to the design of multilayer perceptrons (MLPs) in the work of MacKay. Nevertheless, the training of MLPs suffers from drawbacks like the non-convex optimization problem and the choice of the number of hidden units. In Support Vector Machines (SVMs) for classification, as introduced by Vapnik, a nonlinear decision boundary is obtained by mapping the input vector first in a nonlinear way to a high dimensional kernel-induced feature space in which a linear large margin classifier is constructed. Practical expressions are formulated in the dual space in terms of the related kernel function and the solution follows from a (convex) quadratic programming (QP) problem. In Least Squares SVMs (LS-SVMs), the SVM problem formulation is modified by introducing a least squares cost function and equality instead of inequality constraints and the solution follows from a linear system in the dual space. Implicitly, the least squares formulation corresponds to a regression formulation and is also related to kernel Fisher Discriminant Analysis. The least squares regression formulation has advantages for deriving analytic expressions in a Bayesian evidence framework in contrast with the classification formulations, e.g., used in Gaussian Processes (GPs). The LS-SVM formulation has clear primal-dual interpretations and without the bias term one explicitly constructs a model which yields the same expressions as have been obtained with Gaussian Processes (GPs) for regression. In this paper, the Bayesian evidence framework is combined with the LS-SVM classifier formulation. Starting from the feature space formulation, analytic expressions are obtained in the dual space on the different levels of Bayesian inference, while posterior class probabilities are obtained by marginalizing over the model parameters. Empirical results obtained on ten public domain datasets show that the LS-SVM classifier designed within the Bayesian evidence framework consistently yields good generalization performances.
Statistical Analysis and Data Mining, 2014
Recent advances in data mining have integrated kernel functions with Bayesian probabilistic analysis of Gaussian distributions. These machine learning approaches can incorporate prior information with new data to calculate probabilistic rather than deterministic values for unknown parameters. This paper extensively analyzes a specific Bayesian kernel model that uses a kernel function to calculate a posterior beta distribution that is conjugate to the prior beta distribution. Numerical testing of the beta kernel model on several benchmark data sets reveals that this model's accuracy is comparable with those of the support vector machine, relevance vector machine, naive Bayes, and logistic regression, and the model runs more quickly than other algorithms. When one class occurs much more frequently than the other class, the beta kernel model often outperforms other strategies to handle imbalanced data sets, including undersampling, over-sampling, and the Synthetic Minority Over-Sampling Technique. If data arrive sequentially over time, the beta kernel model easily and quickly updates the probability distribution, and this model is more accurate than an incremental support vector machine algorithm for online learning.
Machine Learning, 2002
I describe a framework for interpreting Support Vector Machines (SVMs) as maximum a posteriori (MAP) solutions to inference problems with Gaussian Process priors. This probabilistic interpretation can provide intuitive guidelines for choosing a 'good' SVM kernel. Beyond this, it allows Bayesian methods to be used for tackling two of the outstanding challenges in SVM classification: how to tune hyperparameters-the misclassification penalty C, and any parameters specifying the kernel-and how to obtain predictive class probabilities rather than the conventional deterministic class label predictions. Hyperparameters can be set by maximizing the evidence; I explain how the latter can be defined and properly normalized. Both analytical approximations and numerical methods (Monte Carlo chaining) for estimating the evidence are discussed. I also compare different methods of estimating class probabilities, ranging from simple evaluation at the MAP or at the posterior average to full averaging over the posterior. A simple toy application illustrates the various concepts and techniques.
The Annals of Statistics, 2008
We review machine learning methods employing positive definite kernels. These methods formulate learning and estimation problems in a reproducing kernel Hilbert space (RKHS) of functions defined on the data domain, expanded in terms of a kernel. Working in linear spaces of function has the benefit of facilitating the construction and analysis of learning algorithms while at the same time allowing large classes of functions. The latter include nonlinear functions as well as functions defined on nonvectorial data. We cover a wide range of methods, ranging from binary classifiers to sophisticated methods for estimation with structured data.
2004
Sparse approximations to Bayesian inference for nonparametric Gaussian Process models scale linearly in the number of training points, allowing for the application of these powerful kernel-based models to large datasets. We show how to generalize the binary classification informative vector machine (IVM) [6] to multiple classes. In contrast to earlier efficient approaches to kernel-based non-binary classification, our method is a principled approximation to Bayesian inference which yields valid uncertainty estimates and allows for hyperparameter estimation via marginal likelihood maximization. While most earlier proposals suggest fitting independent binary discriminants to heuristically chosen partitions of the data and combining these in a heuristic manner, our method operates jointly on the data for all classes. Crucially, we still achieve a linear scaling in both the number of classes and the number of training points. 1 Traditional methods such as cross-validation are not useful, because different covariance functions (kernels) should be used for every class leading to at least O(C) hyperparameters. 2 In the light of a confused anonymous referee, it seems necessary to clarify our scaling statements. Under the assumption that C < d n the dominant contribution to the scaling is O(n C d 2). As with any other method in this domain, there are additional O(d 3), O(d C 3) and other contributions which are subdominant under these assumptions. Especially, our method cannot be used in a large C domain without further modifications not discussed here. Unless otherwise said, claims about the scaling behaviour concentrate on the dominant term (under these assumptions) which is linear in the training set size n.
Studies in Applied Mathematics, 2010
Gaussians are important tools for learning from data of large dimensions. The variance of a Gaussian kernel is a measurement of the frequency range of function components or features retrieved by learning algorithms induced by the Gaussian. The learning ability and approximation power increase when the variance of the Gaussian decreases. Thus, it is natural to use Gaussians with decreasing variances for online algorithms when samples are imposed one by one. In this paper, we consider fully online classification algorithms associated with a general loss function and varying Gaussians which are closely related to regularization schemes in reproducing kernel Hilbert spaces. Learning rates are derived in terms of the smoothness of a target function associated with the probability measure controlling sampling and the loss function. A critical estimate is given for the norm of the difference of regularized target functions as the variance of the Gaussian changes. Concrete learning rates are presented for the online learning algorithm with the least square loss function.
2005
Very high dimensional learning systems become theoretically possible when training examples are abundant. The computing cost then becomes the limiting factor. Any efficient learning algorithm should at least take a brief look at each example. But should all examples be given equal attention? This contribution proposes an empirical answer. We first present an online SVM algorithm based on this premise. LASVM yields competitive misclassification rates after a single pass over the training examples, outspeeding state-of-the-art SVM solvers. Then we show how active example selection can yield faster training, higher accuracies, and simpler models, using only a fraction of the training example labels.
Loading Preview
Sorry, preview is currently unavailable. You can download the paper by clicking the button above.
Lecture Notes in Computer Science, 2019
The European Symposium on Artificial Neural Networks, 2015
International Journal of Big Data and Analytics in Healthcare, 2021
IEEE Transactions on Neural Networks, 2000
International Journal of Computer Vision, 2010
Advances in Neural Information …, 2005
IEEE Transactions on Pattern Analysis and Machine Intelligence, 2000
Statistics and Computing, 2011
Lecture Notes in Computer Science, 2011
2014 IEEE International Conference on Image Processing (ICIP), 2014