Academia.edu no longer supports Internet Explorer.
To browse Academia.edu and the wider internet faster and more securely, please take a few seconds to upgrade your browser.
2012
…
22 pages
1 file
Learning with Lq<1 vs L1-norm regularisation with exponentially many
Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), 2008
We study the use of fractional norms for regularisation in supervised learning from high dimensional data, in conditions of a large number of irrelevant features, focusing on logistic regression. We develop a variational method for parameter estimation, and show an equivalence between two approximations recently proposed in the statistics literature. Building on previous work by A.Ng, we show the fractional norm regularised logistic regression enjoys a sample complexity that grows logarithmically with the data dimensions and polynomially with the number of relevant dimensions. In addition, extensive empirical testing indicates that fractional-norm regularisation is more suitable than L1 in cases when the number of relevant features is very small, and works very well despite a large number of irrelevant features. 1 L q<1-Regularised Logistic Regression Consider a training set of pairs z = {(x j , y j)} n j=1 drawn i.i.d. from some unknown distribution P. x j ∈ R m are m-dimensional input points and y j ∈ {−1, 1} are the associated target labels for these points. Given z, the aim in supervised learning is to learn a mapping from inputs to targets that is then able to predict the target values for previously unseen points that follow the same distribution as the training data. We are interested in problems with large number m of input features, of which only a few r << m are relevant to the target. In particular, we focus on a form of regularised logistic regression for this purpose: max w n j=1 log p(y j |x j , w) (1) subject to||w|| q ≤ A (2) or, in the Lagrangian formulation: max w n j=1 log p(y j |x j , w) − α||w|| q q (3)
With the advent of high-throughput technologies, ℓ 1 regularized learning algorithms have attracted much at-tention recently. Dozens of algorithms have been pro-posed for fast implementation, using various advanced optimization techniques. In this paper, we demon-strate that ℓ 1 regularized learning problems can be eas-ily solved by using gradient-descent techniques. The ba-sic idea is to transform a convex optimization problem with a non-differentiable objective function into an un-constrained non-convex problem, upon which, via gra-dient descent, reaching a globally optimum solution is guaranteed. We present detailed implementation of the algorithm using ℓ 1 regularized logistic regression as a particular application. We conduct large-scale experi-ments to compare the new approach with other state-of-the-art algorithms on eight medium and large-scale problems. We demonstrate that our algorithm, though simple, performs similarly or even better than other ad-vanced algorithms in ter...
2009
There is growing body of learning problems for which it is natural to organize the parameters into matrix, so as to appropriately regularize the parameters under some matrix norm (in order to impose some more sophisticated prior knowledge). This work describes and analyzes a systematic method for constructing such matrix-based, regularization methods. In particular, we focus on how the underlying statistical properties of a given problem can help us decide which regularization function is appropriate. Our methodology is based on the known duality fact: that a function is strongly convex with respect to some norm if and only if its conjugate function is strongly smooth with respect to the dual norm. This result has already been found to be a key component in deriving and analyzing several learning algorithms. We demonstrate the potential of this framework by deriving novel generalization and regret bounds for multi-task learning, multi-class learning, and kernel learning.
2006
We consider learning algorithms induced by regularization methods in the regression setting. We show that previously obtained error bounds for these algorithms using a-priori choices of the regularization parameter, can be attained using a suitable a-posteriori choice based on validation. In particular, these results prove adaptation of the rate of convergence of the estimators to the minimax rate induced by the "effective dimension" of the problem. We also show universal consistency for this class methods.
The last equality then leads to (I). Next, we discuss the implementation of evaluating Lj (0) as it is the main operation at each inner iteration. As we mention in Section 6. 2, GLMNET explicitly normalizes xi by (6.2). Here we use Lj (0; X) to denote Lj (0) on the scaled data. If we define yi={1 if yi= 1, 0 if yi=− 1, then Lj (0; X) can be computed by
We consider the problem of supervised learning with convex loss functions and propose a new form of iterative regularization based on the subgradient method. Unlike other regularization approaches, in iterative regularization no constraint or penalization is considered, and generalization is achieved by (early) stopping an empirical iteration. We consider a nonparametric setting, in the framework of reproducing kernel Hilbert spaces, and prove finite sample bounds on the excess risk under general regularity conditions. Our study provides a new class of efficient regularized learning algorithms and gives insights on the interplay between statistics and optimization in machine learning.
Electronics, 2022
Over the last decade, learning theory performed significant progress in the development of sophisticated algorithms and their theoretical foundations. The theory builds on concepts that exploit ideas and methodologies from mathematical areas such as optimization theory. Regularization is probably the key to address the challenging problem of overfitting, which usually occurs in high-dimensional learning. Its primary goal is to make the machine learning algorithm “learn” and not “memorize” by penalizing the algorithm to reduce its generalization error in order to avoid the risk of overfitting. As a result, the variance of the model is significantly reduced, without substantial increase in its bias and without losing any important properties in the data.
2009
In recent years the l1, ∞ norm has been proposed for joint regularization. In essence, this type of regularization aims at extending the l1 framework for learning sparse models to a setting where the goal is to learn a set of jointly sparse models. In this paper we derive a simple and effective projected gradient method for optimization of l1, ∞ regularized problems. The main challenge in developing such a method resides on being able to compute efficient projections to the l1, ∞ ball. We present an algorithm that works in O(nlog n) time and O(n) memory where n is the number of parameters. We test our algorithm in a multi-task image annotation problem. Our results show that l1,∞ leads to better performance than both l2 and l1 regularization and that it is is effective in discovering jointly sparse solutions. 1.
Computational Optimization and Applications, 2016
We study the performance of first-and second-order optimization methods for 1-regularized sparse least-squares problems as the conditioning of the problem changes and the dimensions of the problem increase up to one trillion. A rigorously defined generator is presented which allows control of the dimensions, the conditioning and the sparsity of the problem. The generator has very low memory requirements and scales well with the dimensions of the problem. Keywords 1-Regularised least-squares • First-order methods • Second-order methods • Sparse least squares instance generator • Ill-conditioned problems B Kimon Fountoulakis
Journal of Machine Learning Research, 2010
ℓ 1 -regularized logistic regression, also known as sparse logistic regression, is widely used in machine learning, computer vision, data mining, bioinformatics and neural signal processing. The use of ℓ 1 regularization attributes attractive properties to the classifier, such as feature selection, robustness to noise, and as a result, classifier generality in the context of supervised learning. When a sparse logistic regression problem has large-scale data in high dimensions, it is computationally expensive to minimize the non-differentiable ℓ 1 -norm in the objective function. Motivated by recent work , we propose a novel hybrid algorithm based on combining two types of optimization iterations: one being very fast and memory friendly while the other being slower but more accurate. Called hybrid iterative shrinkage (HIS), the resulting algorithm is comprised of a fixed point continuation phase and an interior point phase. The first phase is based completely on memory efficient operations such as matrix-vector multiplications, while the second phase is based on a truncated Newton's method. Furthermore, we show that various optimization techniques, including line search and continuation, can significantly accelerate convergence. The algorithm has global convergence at a geometric rate (a Q-linear rate in optimization terminology). We present a numerical comparison with several existing algorithms, including an analysis using benchmark data from the UCI machine learning repository, and show our algorithm is the most computationally efficient without loss of accuracy.
Loading Preview
Sorry, preview is currently unavailable. You can download the paper by clicking the button above.
Lecture Notes in Computer Science
Mathematical Programming, 2015
Abstract and Applied Analysis, 2013
IEEE Transactions on Information Theory, 2020
IEEE Transactions on Information Theory
Journal of Complexity, 2009
Foundations of Computational Mathematics, 2006
Analysis and Applications, 2016
Physical Review E, 1998