Academia.edu no longer supports Internet Explorer.
To browse Academia.edu and the wider internet faster and more securely, please take a few seconds to upgrade your browser.
2014, Neural Networks
…
30 pages
1 file
Kernel learning methods, whether Bayesian or frequentist, typically involve multiple levels of inference, with the coefficients of the kernel expansion being determined at the first level and the kernel and regularisation parameters carefully tuned at the second level, a process known as model selection. Model selection for kernel machines is commonly performed via optimisation of a suitable model selection criterion, often based on cross-validation or theoretical performance bounds. However, if there are a large number of kernel parameters, as for instance in the case of automatic relevance determination (ARD), there is a substantial risk of over-fitting the model selection criterion, resulting in poor generalisation performance. In this paper we investigate the possibility of learning the kernel, for the Least-Squares Support Vector Machine (LS-SVM) classifier, at the first level of inference, i.e. parameter optimisation. The kernel parameters and the coefficients of the kernel expansion are jointly optimised at the first level of inference, minimising a training criterion with an additional regularisation term acting on the kernel parameters. The key advantage of this approach is that the values of only two regularisation parameters need be determined in model selection, substantially alleviating the problem of over-fitting the model selection criterion. The benefits of this approach are demonstrated using a suite of synthetic and real-world binary classification benchmark problems, where kernel learning at the first level of inference is shown to be statistically superior to the conventional approach, improves on our previous work (Cawley and Talbot, 2007) and is competitive with Multiple Kernel Learning approaches, but with reduced computational expense.
The problem of variable selection in binary kernel classification is addressed in this thesis. Kernel methods are fairly recent additions to the statistical toolbox, having originated approximately two decades ago in machine learning and artificial intelligence. These methods are growing in popularity and are already frequently applied in regression and classification problems. A special thank you also to my dad, Klopper Oosthuizen, for many investments in me, and for his love and support, and to my family and friends. VIII CONTENTS CHAPTER 1: INTRODUCTION 1.1 NOTATION 1.2 OVERVIEW OF THE THESIS CHAPTER 2: VARIABLE SELECTION FOR KERNEL METHODS 2.1 INTRODUCTION 2.2 AN OVERVIEW OF KERNEL METHODS 2.2.1 BASIC CONCEPTS 2.2.2 KERNEL FUNCTIONS AND THE KERNEL TRICK 2.2.3 CONSTRUCTING A KERNEL CLASSIFIER 2.2.4 A REGULARISATION PERSPECTIVE 2.3 VARIABLE SELECTION IN BINARY CLASSIFICATION: IMPORTANT ASPECTS 2.3.1 THE RELEVANCE OF VARIABLES 2.3.2 SELECTION STRATEGIES AND CRITERIA 2.4 VARIABLE SELECTION FOR KERNEL METHODS 2.4.1 THE NEED FOR VARIABLE SELECTION 2.4.2 COMPLICATING FACTORS AND POSSIBLE APPROACHES 2.5 SUMMARY CHAPTER 3: KERNEL VARIABLE SELECTION IN INPUT SPACE 1 K 3.4 MONTE CARLO SIMULATION STUDY IX 3.4.1 EXPERIMENTAL DESIGN 3.4.2 STEPS IN EACH SIMULATION REPETITION 3.4.3 GENERATING THE TRAINING AND TEST DATA 3.4.4 HYPERPARAMETER SPECIFICATION 3.4.5 THE VARIABLE SELECTION PROCEDURES 3.4.6 RESULTS AND CONCLUSIONS 3.5 SUMMARY CHAPTER 4: ALGORITHM-INDEPENDENT AND ALGORITHM-DEPENDENT SELECTION IN FEATURE SPACE 4.1 INTRODUCTION 4.2 SUPPORT VECTOR MACHINES 4.2.1 THE TRAINING DATA ARE LINEARLY SEPARABLE IN INPUT SPACE 4.2.2 THE TRAINING DATA ARE LINEARLY SEPARABLE IN FEATURE SPACE 4.2.3 HANDLING NOISY DATA 4.3 KERNEL FISHER DISCRIMINANT ANALYSIS 4.3.1 LINEAR DISCRIMINANT ANALYSIS 4.3.2 THE KERNEL FISHER DISCRIMINANT FUNCTION
IRJET, 2022
Support vector machine (SVM) is capable of outcompeting every other learned model algorithm in terms of accuracy and other high-performance metrics by its high dimensional data projection for classification. Nevertheless, the performance of the Support vector machine is greatly affected by the choice of the kernel function which helps in the same. This paper discusses the working of SVM and its dependency on the kernel function, along with the explanation of the types of kernels. The focus is on choosing the optimal kernel for three different types of data that vary on volume of features and classes to conclude the optimal choice of the kernel for each type of the three datasets. For performance measures, we used metrics such as accuracy, kappa, specificity and sensitivity. This study statistically examines and compares each type of kernel against the mentioned metrics.
Journal of Machine Learning Research, 2010
The problem of automatic feature selection/weighting in kernel methods is examined. We work on a formulation that optimizes both the weights of features and the parameters of the kernel model simultaneously, using L 1 regularization for feature selection. Under quite general choices of kernels, we prove that there exists a unique regularization path for this problem, that runs from 0 to a stationary point of the non-regularized problem. We propose an ODE-based homotopy method to follow this trajectory. By following the path, our algorithm is able to automatically discard irrelevant features and to automatically go back and forth to avoid local optima. Experiments on synthetic and real datasets show that the method achieves low prediction error and is efficient in separating relevant from irrelevant features.
2010
his paper presents a novel feature selection approach (KP-SVR) that determines a non-linear regression function with minimal error and simultaneously minimizes the number of features by penalizing their use in the dual formulation of SVR. The approach optimizes the width of an anisotropic RBF Kernel using an iterative algorithm based on the gradient descent method, eliminating features that have low relevance for the regression model. Our approach presents an explicit stopping criterion, indicating clearly when eliminating further features begins to affect negatively the model's performance. Experiments with two real-world benchmark problems demonstrate that our approach accomplishes the best performance compared to well-known feature selection methods using consistently a small number of features.
Machine Learning, 2006
This paper presents a convex optimization perspective towards the task of tuning the regularization trade-off with validation and cross-validation criteria in the context of kernel machines. We focus on the problem of tuning the regularization trade-off in the context of Least Squares Support Vector Machines (LS-SVMs) for function approximation and classification. By adopting an additive regularization trade-off scheme, the task of tuning the regularization trade-off with respect to a validation and cross-validation criterion can be written as a convex optimization problem. The solution of this problem then contains both the optimal regularization constants with respect to the model selection criterion at hand, and the corresponding training solution. We refer to such formulations as the fusion of training with model selection. The major tool to accomplish this task is found in the primal-dual derivations as occuring in convex optimization theory. The paper advances the discussion by relating the additive regularization trade-off scheme with the classical Tikhonov scheme. Motivations are given for the usefulness of the former scheme. Furthermore, it is illustrated how to restrict the additive trade-off scheme towards the solution path corresponding with a Tikhonov scheme while retaining convexity of the overall problem of fusion of model selection and training. We relate such a scheme with an ensemble learning problem and with stability of learning machines. The approach is illustrated on a number of artificial and benchmark datasets relating the proposed method with the classical practice of tuning the Tikhonov scheme with a cross-validation measure.
Journal of Machine Learning Research, 2007
While the model parameters of a kernel machine are typically given by the solution of a convex optimisation problem, with a single global optimum, the selection of good values for the regularisation and kernel parameters is much less straightforward. Fortunately the leave-one-out cross-validation procedure can be performed or a least approximated very efficiently in closed form for a wide variety of kernel learning methods, providing a convenient means for model selection. Leave-one-out cross-validation based estimates of performance, however, generally exhibit a relatively high variance and are therefore prone to over-fitting. In this paper, we investigate the novel use of Bayesian regularisation at the second level of inference, adding a regularisation term to the model selection criterion corresponding to a prior over the hyper-parameter values, where the additional regularisation parameters are integrated out analytically. Results obtained on a suite of thirteen real-world and synthetic benchmark data sets clearly demonstrate the benefit of this approach.
Pattern Recognition Letters, 2013
In several supervised learning applications, it happens that reconstruction methods have to be applied repeatedly before being able to achieve the final solution. In these situations, the availability of learning algorithms able to provide effective predictors in a very short time may lead to remarkable improvements in the overall computational requirement. In this paper we consider the kernel ridge regression problem and we look for solutions given by a linear combination of kernel functions plus a constant term. In particular, we show that the unknown coefficients of the linear combination and the constant term can be obtained very fastly by applying specific regularization algorithms directly to the linear system arising from the Empirical Risk Minimization problem. From the numerical experiments carried out on benchmark datasets, we observed that in some cases the same results achieved after hours of calculations can be obtained in few seconds, thus showing that these strategies are very well-suited for time-consuming applications.
International Joint Conference Neural Networks (IJCNN 2007). , 2007
The generalised linear model (GLM) is the standard approach in classical statistics for regression tasks where it is appropriate to measure the data misfit using a likelihood drawn from the exponential family of distributions. In this paper, we apply the kernel trick to give a non-linear variant of the GLM, the generalised kernel machine (GKM), in which a regularised GLM is constructed in a fixed feature space implicitly defined by a Mercer kernel. The MATLAB symbolic maths toolbox is used to automatically create a suite of generalised kernel machines, including methods for automated model selection based on approximate leave-one-out cross-validation. In doing so, we provide a common framework encompassing a wide range of existing and novel kernel learning methods, and highlight their connections with earlier techniques from classical statistics. Examples including kernel ridge regression, kernel logistic regression and kernel Poisson regression are given to demonstrate the flexibility and utility of the generalised kernel machine.
Neurocomputing, 2003
We address the problem of model selection for Support Vector Machine (SVM) classification. For fixed functional form of the kernel, model selection amounts to tuning kernel parameters and the slack penalty coefficientC. We begin by reviewing a recently developed probabilistic framework for SVM classification. An extension to the case of SVMs with quadratic slack penalties is given and a simple
2009
Abstract. A Bayesian learning algorithm is presented that is based on a sparse Bayesian linear model (the Relevance Vector Machine (RVM)) and learns the parameters of the kernels during model training. The novel characteristic of the method is that it enables the introduction of parameters called 'scaling factors' that measure the significance of each feature. Using the Bayesian framework, a sparsity promoting prior is then imposed on the scaling factors in order to eliminate irrelevant features.
Loading Preview
Sorry, preview is currently unavailable. You can download the paper by clicking the button above.
The Florida AI Research Society Conference, 2005
Information Sciences, 2011
Proceedings of the Second International Conference on Computer Vision Theory and Applications, 2007
Bulletin of Electrical Engineering and Informatics, 2021
Neural Networks, 2005
Lecture Notes in Computer Science, 2009
International Journal of Advanced Research in Computer Science and Software Engineering
Neural Computation, 2002
Applied Soft Computing, 2017
Pattern Recognition, 2007
Evolutionary Intelligence, 2012