Aaai06 Efficientl1logisticregression

Uploaded by

dixade1732

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

21 views8 pages

Aaai06 Efficientl1logisticregression

Uploaded by

dixade1732

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 8

Efficient L1 Regularized Logistic Regression

Su-In Lee, Honglak Lee, Pieter Abbeel and Andrew Y. Ng

Computer Science Department
Stanford University
Stanford, CA 94305

Abstract ter vector, then the optimization problem becomes a con-

straint optimization problem with the number of constraints
L1 regularized logistic regression is now a workhorse of equal to twice the number of parameters to be learned. The
machine learning: it is widely used for many classifica-
running time of standard algorithms to solve convex opti-
tion problems, particularly ones with many features. L1
regularized logistic regression requires solving a convex mization problems increases with the number of constraints.
optimization problem. However, standard algorithms If, alternatively, the L1 regularization is implemented by
for solving convex optimization problems do not scale adding (a constant times) the L1 norm of the parameter vec-
well enough to handle the large datasets encountered tor to the objective, then the objective is not continuously
in many practical settings. In this paper, we propose differentiable anymore. This precludes the use of, for exam-
an efficient algorithm for L1 regularized logistic regres- ple, Newton’s method to solve the optimization problem ef-
sion. Our algorithm iteratively approximates the objec- ficiently. Thus, both formulations of L1 regularized logistic
tive function by a quadratic approximation at the current regression lead to a more complicated optimization problem
point, while maintaining the L1 constraint. In each iter- than unregularized logistic regression. However, the L1 con-
ation, it uses the efficient LARS (Least Angle Regres-
straint (or penalty term) has significant structure. For exam-
sion) algorithm to solve the resulting L1 constrained
quadratic optimization problem. Our theoretical results ple, if one knew a priori the signs of each of the parameters,
show that our algorithm is guaranteed to converge to the then the L1 constraint would become a single linear con-
global optimum. Our experiments show that our algo- straint. Or, in case of regularization by adding the L1 term
rithm significantly outperforms standard algorithms for to the objective, the L1 term would become linear, resulting
solving convex optimization problems. Moreover, our in a continuously differentiable objective. These (and sim-
algorithm outperforms four previously published algo- ilar) observations have led to various algorithms that try to
rithms that were specifically designed to solve the L1 exploit the structure in the optimization problem and solve
regularized logistic regression problem. it more efficiently.
Goodman (2004) proposed an algorithm for learning con-
Introduction ditional maximum entropy models (of which logistic regres-
sion is a special case) with an exponential prior (also referred
Logistic regression is widely used in machine learning for to as a one-sided Laplacian prior). His algorithm is based
classification problems. It is well-known that regularization on generalized iterative scaling (GIS). The idea is to itera-
is required to avoid over-fitting, especially when there is a tively (efficiently) maximize a differentiable lower bound of
only small number of training examples, or when there are the objective function. The proposed learning algorithm can
a large number of parameters to be learned. In particular, easily be extended to the case of logistic regression with a
L1 regularized logistic regression is often used for feature Laplacian prior by duplicating all the features with the op-
selection, and has been shown to have good generalization posite sign. Logistic regression with a Laplacian prior is
performance in the presence of many irrelevant features. (Ng equivalent to L1 regularized logistic regression.
2004; Goodman 2004)
Perkins et al. (2003) proposed a method called grafting.
Unregularized logistic regression is an unconstrained con-
The key idea in grafting is to incrementally build a subset
vex optimization problem with a continuously differentiable
of the parameters, which are allowed to differ from zero.
objective function. As a consequence, it can be solved fairly
Grafting uses a local derivative test in each iteration of the
efficiently with standard convex optimization methods, such
conjugate gradient method, to choose an additional feature
as Newton’s method or conjugate gradient. However, adding
that is allowed to differ from zero. This way, the conjugate
the L1 regularization makes the optimization problem com-
gradient method can be applied without explicitly dealing
putationally more expensive to solve. If the L1 regulariza-
with the discontinuous first derivatives.
tion is enforced by an L1 norm constraint on the parame-
Roth (2004) proposed an algorithm called generalized
c 2006, American Association for Artificial Intelli-
Copyright ° LASSO that extends a LASSO algorithm proposed by Os-
gence (www.aaai.org). All rights reserved. borne et al. (2000). (The LASSO refers to an L1 regularized
least squares problem. See, Tibshirani (1996) for details.) the following alternative parameterization of the L1 regular-
In this paper, we propose a new, efficient algorithm for ized logistic regression problem:
solving L1 regularized logistic regression. Our algorithm is M
based on the iteratively reweighted least squares (IRLS) for- X
min − log p(y (i) |x(i) ; θ), (3)
mulation of logistic regression. More specifically, in each θ
i=1
iteration, our algorithm finds a step direction by optimizing
the quadratic approximation of the objective function at the subject to kθk1 ≤ C.
current point subject to the L1 norm constraint. The IRLS The optimization problems (2) and (3) are equivalent, in the
formulation of logistic regression allows us to (iteratively) following sense: for any choice of β, there is a choice of C
reformulate the quadratic approximation as a least squares such that both optimization problems have the same mini-
objective. Thus our algorithm ends up solving an L1 con- mizing argument. This follows from the fact that, (up to a
strained least squares problem in every iteration. The L1 constant that does not depend on θ) the optimization prob-
constrained least squares problem can be solved very effi- lem (2) is the Lagrangian of the constrained optimization
ciently using least angle regression (LARS), proposed by problem (3), where β is the Lagrange multiplier. (Rockafel-
Efron et al. (2004). lar 1970) In practice, C (and/or β) may be either chosen
Our theoretical results show that our algorithm is guar- manually or via cross-validation; for the sake of simplicity,
anteed to converge to a global optimum. (In fact, our the- in this paper, we will consider solving the problem (3) for a
oretical results could be generalized to a larger family of given, fixed, value of C.
optimization problems, as discussed later.)
Our experiments show that our algorithm significantly Our Learning Algorithm
outperforms the previously proposed algorithms for L1 reg-
ularized logistic regression. Our algorithm also signifi- In this section, we describe our learning algorithm for L1
cantly outperforms standard gradient-based algorithms, such regularized logistic regression. We also formally prove that
as conjugate gradient and Newton’s method.1 our learning algorithm converges to the global optimum of
We note that Lokhorst (1999) also proposed an algorithm the optimization problem (3).
that uses the IRLS formulation of logistic regression. How-
ever he used a different LASSO algorithm (Osborne, Pres- Preliminaries
nell, & Turlach 2000) in the inner loop and did not provide IRLS for unregularized logistic regression Our learning
any formal convergence guarantees. algorithm is based on iteratively reweighted least squares
(IRLS). (Green 1984; Minka 2003) IRLS reformulates the
problem of finding the step direction for Newton’s method
L1 Regularized Logistic Regression as a weighted ordinary least squares problem.
We consider a supervised learning task where we are given Consider the problem of finding the maximum likelihood
M training instances {(x(i) , y (i) ), i = 1, . . . , M }. Here, estimate (MLE) of the parameters θ for the unregularized
each x(i) ∈ RN is an N -dimensional feature vector, and logistic regression model. The optimization problem can be
y (i) ∈ {0, 1} is a class label. Logistic regression models the written as:
probability distribution of the class label y given a feature M
X
vector x as follows: min − log p(y (i) |x(i) ; θ). (4)
θ
1 i=1
p(y = 1|x; θ) = σ(θ> x) = . (1)
1 + exp(−θ> x) In every iteration, Newton’s method first finds a step di-
rection by approximating the objective by the second order
Here θ ∈ RN are the parameters of the logistic regression Taylor expansion at the current point, and optimizing the
model; and σ(·) is the sigmoid function, defined by the sec- quadratic approximation in closed-form. More specifically,
ond equality. let θ(k) denote the current point in the k’th iteration. Then
Under the Laplacian prior p(θ) = (β/2)N exp(−βkθk1 ) Newton’s method finds a step direction by computing the
(with β > 0), the maximum a posteriori (MAP) estimate of optimum γ (k) of the quadratic approximation at θ(k) as fol-
the parameters θ is given by: lows:
(k)
M γ (k) = θ(k) − H−1 (θ(k) )g(θ ). (5)
X
min − log p(y (i) |x(i) ; θ) + βkθk1 . (2) Here H(θ) and g(θ) represent the Hessian and gradient of
θ
i=1 the objective function in (4) evaluated at θ. Once the step
direction γ (k) is computed, Newton’s method computes the
This optimization problem is referred to as L1 regularized next iterate
logistic regression. Often, it will be convenient to consider
θ(k+1) = (1 − t)θ(k) + tγ (k) (6)
1
The standard gradient-based algorithms are not directly appli-
cable, because the objective function of the L1 regularized logistic by a line search over the step size parameter t.
regression has discontinuous first derivatives. In our experiments, We now show how the Newton step direction can be com-
we used a smooth approximation of the L1 loss function. puted by solving a weighted ordinary least squares problem
rather than using Equation (5). (See Green 1984, or Minka squares problems. In our algorithm, we use LARS with the
2003 for details of this derivation.) Let X(∈ RN ×M ) denote LASSO modification. (Efron et al. 2004) LARS with the
the design matrix: LASSO modification uses a stagewise procedure, based on
a simple incremental update, to efficiently solve the L1 reg-
X = [x(1) x(2) . . . x(M ) ]. ularized least squares problem.
Let the diagonal matrix Λ and the vector z be defined as
follows: for all i = 1, 2, . . . , M : Our algorithm
Our algorithm iteratively solves the L1 constrained logistic
Λii = σ(θ(k)> x(i) )[1 − σ(θ(k)> x(i) )], (7) regression problem of Equation (3). In the k’th iteration, it
[1 − σ(y (i) θ(k)> x(i) )]y (i) finds a step direction γ (k) by solving the constrained least
zi = x(i)> θ(k) + . (8) squares problem of Equation (11). Then it performs a back-
Λii
tracking line search (see Boyd & Vandenberghe, 2004 for
Then, we have that H(θ(k) ) = −XΛX> and g(θ(k) ) = details) over the step size parameter t ∈ [0, 1] to find the
XΛ(z − X> θ(k) ), and thus Equation (5) can be rewritten next iterate θ(k+1) = (1 − t)θ(k) + tγ (k) . To summarize, we
as: propose the following algorithm, which we refer to as IRLS-
γ (k) = (XΛX> )−1 XΛz. (9) LARS. (We discuss the stopping criterion used in more de-
tail when discussing our experimental setup and results.)
(k)
Thus γ is the solution to the following weighted least
squares problem: Algorithm 1 IRLS-LARS
1 1
γ (k) = arg min k(Λ X> )γ − Λ zk22 .
2 2 (10)
γ
1 Set θ(0) = ~0.
Therefore, the Newton step direction can be computed by 2 for (k = 0 to MaxIterations)
solving the least squares problem (10), and thus the logistic 3 Compute Λ and z using Equation (7)
regression optimization problem can be solved using iter- and Equation (8) (for θ = θ(k) ).
atively reweighted least squares (IRLS). This least squares 4 Use LARS to solve the L1 constrained least squares
formulation will play an important role in our algorithm. problem of Equation (11), and let γ (k) be the solution.
The term IRLS refers to the fact that in every iteration the 5 Set θ(k+1) = (1 − t)θ(k) + tγ (k) , where t is found using
least squares problem of Equation (10) has a different di- a backtracking line-search that minimizes the objective
agonal weighting matrix Λ and “right-hand side” z, but the function given in Equation (3).
same design matrix X. (Note that, although we did not make 6 Evaluate the objective function, given in
the dependence explicit, the diagonal matrix Λ and the vec- Equation (3) at θ(k+1) .
7 if (the stopping criterion is satisfied)
tor z both depend on θ(k) and therefore change in every iter-
8 Break;
ation.) 9 end
IRLS for L1 regularized logistic regression For the case 10 end
of L1 regularized logistic regression, as formulated in Equa-
tion (3), the objective is equal to the unregularized logis-
tic regression objective. By augmenting the IRLS formu- We note that our algorithm can be applied to, not only
lation of the unregularized logistic regression with the L1 L1 constrained logistic regression, but rather to any L1 con-
constraint, we get our IRLS formulation for L1 regularized strained optimization problems, so long as the unconstrained
logistic regression (leaving out the dependencies on k for problem can be solved using IRLS. In particular, our algo-
clarity): rithm can be used for parameter learning for L1 constrained
generalized linear models. (Nelder & Wedderbum 1972;
1 1
McCullagh & Nelder 1989)
min k(Λ 2 X> )γ − Λ 2 zk22 , (11)
γ
Convergence and optimality guarantees
subject to kγk1 ≤ C The following theorem shows that the IRLS-LARS algo-
Our algorithm (presented in more detail below) iteratively rithm presented in this paper converges to a global opti-
solves the L1 regularized logistic regression problem by (in mum.2
every iteration) first solving the L1 constrained least squares Theorem 1. The IRLS-LARS algorithm presented in the pa-
problem (11) to find a step direction and then doing a line per converges to a global optimum of the L1 constrained
search to determine the step size. Thus our algorithm com- regularized logistic regression problem of Equation (3).
bines a local quadratic approximation with the global con-
2
straints. In fact—although the theorem statement is specific to our
algorithm—the proof easily generalizes to any algorithm that it-
LARS for solving L1 constrained least squares. The L1 eratively minimizes the local quadratic approximation subject to
constrained least squares problem of Equation (11) is also the constraints (followed by a line search) so long as the objective
known as a LASSO problem. (Tibshirani 1996) Several al- function has a continuous 3rd order derivative, and the feasible set
gorithms have been developed to solve L1 constrained least is convex, closed and bounded.
Proof. Our convergence proof uses the following fact: Let t be the step size resulting from the line search. Then
we have that:
Fact 1. Let any point θ0 that is not a global optimum be
given. Then there is an open subset Sθ0 that includes θ0 , f˜φ (φ0 + t(φ1 − φ0 )) ≤ (1 − t)f˜φ (φ0 ) + tf˜φ (φ1 )
0 0 0
and there is a constant Kθ0 > 0 such that the following δ
holds: for all feasible φ0 ∈ Sθ0 we have that one iteration ≤ f (φ0 ) − t . (17)
2
of our algorithm, starting from φ0 , will give us a (feasible)
point φ1 that satisfies f (φ1 ) ≤ f (φ0 ) − Kθ0 . Here we used, in order: f˜φ0 is convex; f˜φ0 (φ0 ) = f (φ0 )
and Equation (16).
Proof of Fact 1. We have a convex objective function f with Since f is three times continuously differentiable and the
convex constraints. Since θ0 is not a global optimum, any feasible set is bounded, there exists C > 0, such that for all
open set that includes θ0 includes a feasible point with φ0 ∈ Sθ0 the following holds for all θ in the feasible set:
strictly lower value of the objective function f . (In fact, all
points on a line segment connecting θ0 with a global op- f (θ) ≤ f˜φ0 (θ) + Ckθ − φ0 k32 . (18)
timum are feasible and have a strictly lower value of the Now, we choose θ = φ0 + t(φ1 − φ0 ) in Equation (18), and
objective function f .) The objective f is convex and has we use Equation (17) to obtain:
bounded second derivative over the feasible set, thus for any
sufficiently small open set that includes θ0 , we have that f (φ0 + t(φ1 − φ0 )) ≤ f (φ0 ) − t 2δ + Ckt(φ1 − φ0 )k32 . (19)
this set includes a feasible point with strictly lower value Let R be the radius of a sphere that includes the (bounded)
of the local quadratic approximation at θ0 , which we de- feasible set, then we get that
note by f˜θ0 . Thus when our algorithm globally optimizes f (φ0 + t(φ1 − φ0 )) ≤ f (φ0 ) − t 2δ + Ct3 (2R)3 . (20)
the quadratic approximation over the feasible set, the result- q
ing point θ1 must satisfy f˜θ0 (θ1 ) < f˜θ0 (θ0 ). Now define Choosing t = δ
48CR3 gives us that for all φ0 ∈ Sθ0 , we
δ > 0 as follows:
have that our algorithm finds a φ∗ such that:
δ = f˜θ (θ0 ) − f˜θ (θ1 ). (12) r
0 0
∗ δ3
Since f is three times continuously differentiable and f (φ ) ≤ f (φ0 ) − .
192CR3
since the feasible set is bounded, there exists ² > 0, such q
δ3
that for all φ0 ∈ Sθ0 = {θ : kθ − θ0 k2 ≤ ²} the following Choosing Kθ0 = 192CR 3 proves Fact 1.
hold:
δ Now we established Fact 1, we are ready to prove the
|f (θ0 ) − f (φ0 )| ≤ , (13)
8 theorem. Let any δ > 0 be fixed. Let Pδ = {θ :
and for all θ in the feasible set: θis feasible , ∀ global optima θ∗ , kθ − θ∗ k ≥ δ}. We will
show convergence to a global optimum by showing that the
δ
|f˜θ0 (θ) − f˜φ0 (θ)| ≤ . (14) algorithm can only spend a finite number of iterations in the
8 set Pδ .
From Equation (14) we have for all φ0 ∈ Sθ0 that From Fact 1 we have that for all θ ∈ Pδ , there exists
an open set Sθ and a positive constant Kθ > 0 such that
| minθ f˜θ0 (θ) − minθ f˜φ0 (θ)| ≤ 4δ . (15) the algorithm improves the objective by at least Kθ in one
Therefore, for all φ0 ∈ Sθ0 the following holds. Let φ1 be iteration, when started from a point in the set Sθ . Since such
the global optimum (within the feasible set) of the quadratic a set Sθ exists for all θ ∈ Pδ we have that
approximation at φ0 , then we have that: Pδ ⊆ ∪θ∈Pδ Sθ .
δ Since the set Pδ is a closed and bounded subset of RN , we
f˜φ0 (φ1 ) ≤ f˜θ0 (θ1 ) + have (from the Heine-Borel theorem) that Pδ is compact. By
4
δ definition of compactness, for every open cover of a compact
≤ f˜θ0 (θ0 ) − δ + set there exists a finite sub-cover. (See, e.g., Royden, 1988,
4
for more details.) Thus, there is a finite set Qδ such that
3δ
= f (θ0 ) − Pδ ⊆ ∪θ∈Qδ Sθ . (21)
4
3δ Since Qδ is finite, we have that there is some Cδ > 0 such
= f (φ0 ) + f (θ0 ) − f (φ0 ) −
4 that
δ 3δ inf Kθ = Cδ .
≤ f (φ0 ) + − θ∈Qδ
8 4
Since Qδ covers Pδ , we have that for all points θ ∈ Pδ our
δ
≤ f (φ0 ) − . (16) algorithm improves the objective by at least Cδ in every iter-
2 ation. The objective is bounded, so this means the algorithm
Here we used, in order: Equation (15); Equation (12); sim- has to leave the set Pδ after a finite number of iterations.
plification (f˜θ0 (θ0 ) = f (θ0 )); adding and subtracting the Since this holds true for arbitrary δ > 0, we showed that the
same term; Equation (14); simplification. algorithm converges to a global optimum.
Name Description Name Number of Number of
IRLS-LARS Our algorithm features samples
GenLASSO Generalized LASSO (Roth 2004) Arrhythmia 279 420
SCGIS Sequential Conditional GIS (Goodman Hepatitis 19 80
2004) Ionosphere 33 351
Gl1ce Gl1ce (Lokhorst 1999) Promoters 57 106
Grafting Grafting (Perkins & Theiler 2003) Spambase 57 4601
CG-L1 Conjugate gradient with exact L1 Spect 22 267
NM-epsL1 Newton’s method with epsL1 Spectf 44 349
CG-epsL1 Conjugate gradient with epsL1 Splice 60 1527
CG-Huber Conjugate gradient with Huber Wisconsin breast cancer 30 569
Madelon 500 2600
Table 1: The nine algorithms that were used in our analysis. (See Microarray1 2000 62
text for details.) Microarray2 3571 72

Table 2: The 12 datasets we used in our experiments. (See text for

Experimental Results details.)
We compared the computational efficiency of our algorithm
with the eight other algorithms listed in Table 1. We tested method. These algorithms are not directly applicable to L1
each algorithm’s performance on 12 different datasets, con- logistic regression because the objective function has dis-
sisting of 9 UCI datasets (Newman et al. 1998), one artificial continuous first derivatives. Thus, we used a smooth ap-
dataset called Madelon from the NIPS 2003 workshop on proximation of the L1 penalty term. More specifically, we
feature extraction,3 and two gene expression datasets (Mi- have that the L1 term is given by
croarray 1 and 2).4 Table 2 gives details on the number
N
X
of training examples and the number of features for each
dataset. kθk1 = |θj |.
j=1

We used the “epsL1” function defined as:

q
Experimental details on the other algorithms
epsL1(θj , ²) = θj2 + ²,
We compared our algorithm (IRLS-LARS) to four previ-
ously published algorithms: Grafting (Perkins & Theiler to obtain a smooth approximation for each of the terms |θj |.
2003), Generalized LASSO (Roth 2004), SCGIS (Goodman We similarly used the Huber loss function (see, e.g., Boyd
2004), and Gl1ce (Lokhorst 1999). & Vandenberghe 2004):
Three of these algorithms (Gl1ce, SCGIS and Graft- ½ 2
θj /(2²) if |θj | < ²
ing) and our algorithm (IRLS-LARS) were implemented in Huber(θj , ²) =
|θj | − ²/2 otherwise.
MATLAB. In our implementations, we went through great
efforts to carefully implement all the optimization tech- We also tested conjugate gradient with the exact (non-
niques suggested by the corresponding papers. We used differentiable) L1 norm penalty (CG-L1). In this case, we
Volker Roth’s C-implementation for GenLASSO.5 treated the first derivative of the L1 terms as zero when eval-
∂|θ |
uated at zero. More specifically we treated ∂θjj as being
We also compared IRLS-LARS to standard optimization al- zero for θj = 0.
gorithms, such as conjugate gradient method and Newton’s We went through great efforts to ensure a fair comparison
by optimizing the performance of all other algorithms. We
3
Madelon is available at: carefully profiled all algorithms and made optimizations ac-
http://www.clopinet.com/isabelle/Projects/ cordingly. We also spent significant amounts of time tuning
NIPS2003/. each algorithm’s parameters (when applicable). For exam-
4
Microarray 1 is a colon cancer microarray dataset available at: ple, for the algorithms that are based on conjugate gradient
http://microarray.princeton.edu/oncology/ and Newton’s method, we extensively tuned the parameters
Microarray 2 is a leukemia microarray dataset available at: of the line-search algorithm; for the algorithms using con-
http://www.broad.mit.edu/cgi-bin/cancer/ jugate gradient (Grafting, CG-epsL1, CG-Huber and CG-
publications/pub_paper.cgi?mode=view&paper_ L1), we tested both our own conjugate gradient implemen-
id=43.
5 tation as well as the MATLAB optimization toolbox’s con-
The code for GenLASSO is available at: http:
jugate gradient; for the algorithms that use the approximate
//www.informatik.uni-bonn.de/˜roth/GenLASSO/
index.html. His implementation contained some unnecessary L1 penalty term, we tried many choices for ² in the range
steps (for the purposes of our evaluation), such as automatically (10−15 < ² < 0.01) and chose the one with the shortest
performing the cross-validation for determining the regularization running time; etc. In all cases, we are reporting the running
parameter C. We went carefully through his code and removed times for those parameter settings (and other choices) which
these additional computations. resulted in the shortest running times.
Choice of the regularization parameters Stopping criterion For each algorithm the stopping cri-
For a fixed dataset, the choice of the regularization param- terion is based on how close the current objective function
eter C (or β) greatly affects the running time of the al- value (denoted by obj) is to the optimal objective function
gorithms. Thus, we ran 10-fold cross validation (for each value (denoted by obj ∗ ). For each dataset, we first compute
dataset) and chose the regularization parameter that achieves obj ∗ as follows. First, we run our algorithm (IRLS-LARS)
the highest (hold-out) accuracy. Doing so ensures that we for an extensive amount of time to obtain the solution θ∗ .
evaluate the algorithms in the regime we care about for clas- Then, we use θ∗ to compute β ∗ from Equation (24). Finally,
sification. we verify whether the KKT optimality conditions (see Boyd
& Vandenberghe, 2004) are satisfied for the values for θ∗
In particular, we ran our algorithm (IRLS-LARS) to deter-
and β ∗ we found. In all cases, we found that the KKT con-
mine the regularization parameter C. This value of C was
ditions were satisfied up to a very small error.8
then used for all learning algorithms that use the constrained
version of L1 regularized logistic regression, namely, IRLS- Using the optimal objective obj ∗ , we defined the stopping
LARS, GenLASSO, and Gl1ce. criterion as:
The other algorithms solve the formulation with the L1 |obj ∗ − obj|
penalty term (or a smooth version thereof), as given in Equa- < τ. (25)
|obj ∗ |
tion (2). As discussed in more detail in our description of L1
regularized logistic regression, the optimization problems of We used τ = 10−6 for our algorithm, Grafting, SCGIS, and
Equation (2) and Equation (3) are equivalent, in the follow- GenLASSO. For the algorithms with the approximate L1
ing sense: for any choice of C, there is a choice of β such functions, we used a less stringent bound (τ = 10−5 ) since
that the solution to both problems is the same. We now show the objective function with the approximate L1 penalty term
how to find β explicitly for a given value of C. is slightly different from from the original one formulated in
From the Karush-Kuhn-Tucker (KKT) conditions, we Equation (2).
have that the optimal primal solution θ∗ and the optimal dual
solution β ∗ (of the constrained optimization problem) sat- Discussion of experimental results
isfy the following conditions: Figure 1 shows the results for the five algorithms specifi-
PM ¯ cally designed for L1 regularized logistic regression (IRLS-
∂ i=1 log p(y (i) |x(i) ; θ) ¯¯ LARS, Grafting, SCGIS, GenLASSO and Gl1ce) on 12
¯ = β ∗ sign(θj ), (22)
∂θj ¯ ∗ datasets. Since the running times for the standard opti-
θ
mization methods (CG-L1, NM-epsL1, CG-epsL1 and CG-
θj∗
for all 6= 0. From Lagrange duality, we also have that the Huber) tend to be significantly higher, we do not report them
optimal primal solution θ∗ is given by in the same figure. (Doing so allows us to visualize more
clearly the differences in performance for the first set of al-
M
X gorithms.) We report the performance of the standard op-
θ∗ = arg min − log p(y (i) |x(i) ; θ) + β ∗ kθk1 . (23) timization algorithms (and our algorithm) in Table 3. In
θ
i=1
all cases we report the average running time (in seconds)
Thus, by choosing over 100 trials of the same experiment (dataset and algo-
¯ rithm combination).
PM ¯
∂ log p(y (i) (i)
|x ; θ) ¯ The results reported in Figure 1 show that our algorithm
β = β ∗ = sign(θj ) i=1
¯ (24) (IRLS-LARS) was more efficient than all other algorithms
∂θj ¯
θ∗ in 11 out of 12 datasets. On one dataset (Microarray2) Gen-
(for any j such that θ∗ j 6= 0). We ensure that the optimal Lasso and Gl1ce were slightly faster. All other algorithms
solution of the formulation with L1 penalty term gives the were slower than our algorithm on every dataset.
same optimal solution as the formulation with the L1 con- IRLS-LARS significantly outperformed Grafting and
straint (kθk1 ≤ C).6 SCGIS. More specifically, in 8 (out of 12) datasets our
method was more than 8 times faster than Grafting. (In the
other 4, it was also faster, although not by a factor of 8.) Our
Running time
method was at least 10 times faster than SCGIS in 10 out of
The efficiency of an algorithm was measured based on the 12 datasets.
running time, i.e., the CPU time measured in seconds until GenLASSO and Gl1ce are the strongest competitors, and
the algorithm meets the stopping criterion. Each algorithm even then our algorithm is more than twice as fast as Gl1ce
was run on a Linux machine with AMD Opteron 2GHz with
4.6GB RAM.7 8
In all cases, the primal constraint (kθk1 ≤ C) was satisfied
with an error less than 10−13 , and Equation (24) was satisfied with
6
Effectively the KKT condition as in (22) is not exactly satisfied an error less than 5 × 10−8 . All the other KKT conditions were
for approximate L1 functions. However, typical ² value turned out satisfied exactly. We performed the same procedure using Gen-
to be very small(≈ 10−10 ) such that the resulting approximate L1 LASSO and obtained similar results. In all cases, the absolute
functions were very close to the true L1 functions. difference between the obj ∗ ’s obtained from running IRLS-LARS
7
For the runs that took longer than 500 seconds, we truncated and the obj ∗ ’s obtained from running GenLASSO was smaller than
the running time to 500 seconds. 10−8 ; and the relative difference was always smaller than 10−10 .
Data IRLS- NM- CG- CG- CG- Acknowledgments
LARS epsL1 epsL1 Huber L1 We give warm thanks to John Lafferty, Alexis Battle, Ben
Arrhythmia 0.0420 2.8450 14.384 13.660 14.065 Packer, and Vivek Farias for helpful comments. H. Lee
Hepatitis 0.0046 0.0258 0.0819 0.6580 0.6347 was supported by Stanford Graduate Fellowship (SGF). This
Ionosphere 0.0516 0.1534 0.1746 0.9717 0.9237 work was supported in part by the Department of the Inte-
Promoters 0.0182 0.0632 0.2047 0.8418 0.7328 rior/DARPA under contract number NBCHD030010.
Spambase 4.0804 21.172 10.151 13.410 12.016
Spect 0.0345 0.0704 0.1206 0.7651 0.7293 References
Spectf 0.0646 0.2330 0.5493 1.1761 1.1235
Splice 0.8376 2.9828 0.9424 4.3407 4.8007 Boyd, S., and Vandenberghe, L. 2004. Convex Optimiza-
WBCa 0.1222 0.7225 4.9592 1.4854 1.3777 tion. Cambridge University Press.
Madelon 0.8602 55.060 113.22 95.223 104.50 Efron, B.; Hastie, T.; Johnstone, I.; and Tibshirani,
Microarray1 0.0330 242.88 500.00 500.00 500.00 R. 2004. Least angle regression. Annals of Statistics
Microarray2 0.4069 500.00 500.00 500.00 500.00 32(2):407–499.
a
Goodman, J. 2004. Exponential priors for maximum en-
Wisconsin breast cancer tropy models. In Proceedings of the Annual Meeting of the
Association for Computational Linguistics.
Table 3: The running time (seconds) of IRLS-LARS, CG-L1 and
three algorithms with the approximate L1 functions (NM-epsL1, Green, P. J. 1984. Iteratively reweighted least squares for
CG-epsL1 and CG-Huber). (See text for details.) maximum likelihood estimation, and some robust and re-
sistant alternatives. Journal of the Royal Statistical Society,
Series B, Methodological 46:149–192.
for 5 out of 12 datasets and more than twice as fast as Gen- Lokhorst, J. 1999. The LASSO and Generalised Linear
LASSO for 8 out of 12 datasets. We note that (as opposed Models. Honors Project, Department of Statistics, The Uni-
to ours and all the other methods) GenLASSO is highly- versity of Adelaide, South Australia, Australia.
optimized C code (linked with highly optimized numerical McCullagh, P., and Nelder, J. A. 1989. Generalized Linear
libraries). It is widely accepted that C code is faster than Models. Chapman & Hall.
MATLAB code, and thus we are giving GenLASSO a some-
Minka, T. P. 2003. A comparison of numerical op-
what unfair advantage in our comparison.9
timizers for logistic regression. http://research.
As reported in Table 3, conjugate gradient and Newton’s microsoft.com/˜minka/papers/logreg/.
method are an order of magnitude slower than our IRLS-
LARS algorithm in most of the datasets (despite the less Nelder, J. A., and Wedderbum, R. W. M. 1972. General-
stringent stopping criterion). Among these methods, NM- ized linear models. Journal of the Royal Statistical Society,
epsL1 seems to be the second best algorithm on the datasets Series A, General 135(3):370–384.
with a small number of features. However, this method Newman, D. J.; Hettich, S.; Blake, C. L.; and Merz,
quickly becomes inefficient as the number of features in- C. 1998. UCI repository of machine learning
creases. Also other methods similarly suffered from the in- databases. http://www.ics.uci.edu/˜mlearn/
creasing number of features. We observe that CG-L1 is very MLRepository.html.
competitive with CG-epsL1 and CG-Huber. Ng, A. Y. 2004. Feature selection, l1 vs. l2 regularization,
and rotational invariance. In International Conference on
Machine Learning.
Osborne, M.; Presnell, B.; and Turlach, B. 2000. On the
Conclusions lasso and its dual. Journal of Computational and Graphical
Statistics 9:319–337.
In this paper, we proposed a new algorithm for learning L1
regularized logistic regression. Our theoretical results show Perkins, S., and Theiler, J. 2003. Online feature selection
that our algorithm is guaranteed to converge to a global op- using grafting. In International Conference on Machine
timum. Our experiments showed that our algorithm is com- Learning.
putationally very efficient. In particular, it outperforms stan- Rockafellar, R. T. 1970. Convex analysis. Princeton Univ.
dard algorithms for solving convex optimization problems, Press.
as well as other algorithms previously developed specifically Roth, V. 2004. The generalized lasso. IEEE Trans. Neural
to solve the L1 regularized logistic regression problem. Networks 15(1):16–28.
9
Royden, H. 1988. Real Analysis. Prentice Hall.
We have implemented GenLASSO in MATLAB. However, the
extensive optimization tricks used in the C-code and numerical li- Tibshirani, R. 1996. Regression shrinkage and selection
braries were hard to be duplicated in MATLAB, making it hard to via the lasso. Journal of the Royal Statistical Society, Series
evaluate GenLASSO correctly with our MATLAB code. Since C B 58(1):267–288.
code is widely accepted to be at least as fast as MATLAB code, we
decided to report the C code results (possibly favoring GenLASSO
somewhat in our comparisons).
Figure 1: Comparison of the running time (seconds) among five algorithms (IRLS-LARS, Grafting, SCGIS, GenLASSO and Gl1ce) over 12
datasets. Each plot corresponds to one dataset. The name of the dataset is indicated at the top of the plot, and the number of features (N ) and
the number of training instances (M ) are shown as “Name of the data: N × M ”. For each experiment, we report the running time in seconds,
and the running time relative to IRLS-LARS in parentheses. (See text for discussion.)

Aaai06 L1logisticregression
No ratings yet
Aaai06 L1logisticregression
8 pages
Feature Selection, L1 vs. L2 Regularization, and Rotational Invariance - A NG 2004
No ratings yet
Feature Selection, L1 vs. L2 Regularization, and Rotational Invariance - A NG 2004
8 pages
23 Regularized Logistic Regression (Guide)
No ratings yet
23 Regularized Logistic Regression (Guide)
2 pages
Solutions Problem Set 1
No ratings yet
Solutions Problem Set 1
7 pages
Intro to Classification & Regression
No ratings yet
Intro to Classification & Regression
42 pages
Chapter 2 - Logistic Regression
No ratings yet
Chapter 2 - Logistic Regression
88 pages
Least Squares Optimization With L1-Norm Regularization
No ratings yet
Least Squares Optimization With L1-Norm Regularization
12 pages
Module 3.3 Classification Models, An Overview
No ratings yet
Module 3.3 Classification Models, An Overview
11 pages
Scribe Notes Fall 2022
No ratings yet
Scribe Notes Fall 2022
41 pages
Ridge and Lasso Regresssion
No ratings yet
Ridge and Lasso Regresssion
22 pages
Andrew Rosenberg - Lecture 5: Linear Regression With Regularization CSC 84020 - Machine Learning
No ratings yet
Andrew Rosenberg - Lecture 5: Linear Regression With Regularization CSC 84020 - Machine Learning
38 pages
Group30 Linear Regression
No ratings yet
Group30 Linear Regression
20 pages
A Proximal-Gradient Homotopy Method For The Sparse Least-Squares Problem
No ratings yet
A Proximal-Gradient Homotopy Method For The Sparse Least-Squares Problem
37 pages
Cs 7265 Big Data Analytics Regularization On Linear Model: Mingon Kang, PH.D Computer Science, Kennesaw State University
No ratings yet
Cs 7265 Big Data Analytics Regularization On Linear Model: Mingon Kang, PH.D Computer Science, Kennesaw State University
24 pages
Lasso Regression in Logistic Models
No ratings yet
Lasso Regression in Logistic Models
43 pages
CS2011 2
No ratings yet
CS2011 2
14 pages
02 - Linear Models - C - Regularization - Logistic - Regression
No ratings yet
02 - Linear Models - C - Regularization - Logistic - Regression
16 pages
Output 23
No ratings yet
Output 23
6 pages
Lecture Ai
No ratings yet
Lecture Ai
40 pages
L1-Norm Regularized Quantile Regression
No ratings yet
L1-Norm Regularized Quantile Regression
23 pages
Group 30
No ratings yet
Group 30
33 pages
Regularization
No ratings yet
Regularization
25 pages
Regression Techniques Overview
No ratings yet
Regression Techniques Overview
5 pages
CSCI-43646364 S25 - Lecture 4
No ratings yet
CSCI-43646364 S25 - Lecture 4
92 pages
02 Lecturenote GD
No ratings yet
02 Lecturenote GD
10 pages
07: Regularization: The Problem of Overfitting
No ratings yet
07: Regularization: The Problem of Overfitting
5 pages
Logistic Regression Loss Comparison
No ratings yet
Logistic Regression Loss Comparison
7 pages
Regularization Techniques in Deep Learning
No ratings yet
Regularization Techniques in Deep Learning
30 pages
ML and DL Remedial Classes Overview
No ratings yet
ML and DL Remedial Classes Overview
95 pages
Introduction To Machine Learning: Dr. Muhammad Amjad Iqbal
No ratings yet
Introduction To Machine Learning: Dr. Muhammad Amjad Iqbal
20 pages
Lecture03d Ridge
No ratings yet
Lecture03d Ridge
13 pages
Machine Learning Using Optimization and Logistic Regression and Sigmoid Function - GRP 06
No ratings yet
Machine Learning Using Optimization and Logistic Regression and Sigmoid Function - GRP 06
31 pages
ISYE 8803 - Kamran - M6 - LD Learning Using Regularization
No ratings yet
ISYE 8803 - Kamran - M6 - LD Learning Using Regularization
25 pages
Smooth Convex Optimization Lecture
No ratings yet
Smooth Convex Optimization Lecture
28 pages
PCCAIML601
No ratings yet
PCCAIML601
7 pages
U4 PDF
No ratings yet
U4 PDF
18 pages
Machine Learning PPT Part II
No ratings yet
Machine Learning PPT Part II
56 pages
Logistic Regression and Regularization Techniques
No ratings yet
Logistic Regression and Regularization Techniques
39 pages
HW 4
No ratings yet
HW 4
7 pages
Regularization
No ratings yet
Regularization
8 pages
Overfitting in Linear Regression
No ratings yet
Overfitting in Linear Regression
8 pages
Optimizing Model Complexity in ML
No ratings yet
Optimizing Model Complexity in ML
32 pages
Midterm 2010 F
No ratings yet
Midterm 2010 F
15 pages
ML Lec 4
No ratings yet
ML Lec 4
38 pages
NN&DL Unit-IV Regularization For Deep Learning
No ratings yet
NN&DL Unit-IV Regularization For Deep Learning
16 pages
Homework 2
No ratings yet
Homework 2
5 pages
DL Chpter 3
No ratings yet
DL Chpter 3
8 pages
Regularization for Overfitting Prevention
No ratings yet
Regularization for Overfitting Prevention
7 pages
4.OptionalReadingHomework GUIDE
No ratings yet
4.OptionalReadingHomework GUIDE
4 pages
Regularization & Generalized Linear Models
No ratings yet
Regularization & Generalized Linear Models
135 pages
AML - Lecture 3 Logistic Regression. Neural Networks
No ratings yet
AML - Lecture 3 Logistic Regression. Neural Networks
59 pages
Regularization 1704650055
No ratings yet
Regularization 1704650055
32 pages
Logistic Regression Explained
No ratings yet
Logistic Regression Explained
3 pages
Surrogates
No ratings yet
Surrogates
35 pages
Overfitting Problem Regularization (Ridge, Lasso, Elastic) Dropout and Early Stopping
No ratings yet
Overfitting Problem Regularization (Ridge, Lasso, Elastic) Dropout and Early Stopping
17 pages
Week 7
No ratings yet
Week 7
21 pages
Understanding Logistic Regression Basics
No ratings yet
Understanding Logistic Regression Basics
21 pages
NetBackup105 AdminGuide NAS
No ratings yet
NetBackup105 AdminGuide NAS
244 pages
NetBackup105 AdvDisk Guide
No ratings yet
NetBackup105 AdvDisk Guide
98 pages
NetBackup105 AdminGuide SnapshotManager
No ratings yet
NetBackup105 AdminGuide SnapshotManager
272 pages
NetBackup105 AdminGuide SharePoint
No ratings yet
NetBackup105 AdminGuide SharePoint
114 pages
NetBackup105 AdminGuide Oracle
No ratings yet
NetBackup105 AdminGuide Oracle
320 pages
NetBackup10 EEB Guide
No ratings yet
NetBackup10 EEB Guide
184 pages
NetBackup105 AdminGuide MariaDB
No ratings yet
NetBackup105 AdminGuide MariaDB
35 pages
NetBackup105 AdminGuide Informix
No ratings yet
NetBackup105 AdminGuide Informix
59 pages
NetBackup105 AdminGuide DB2
No ratings yet
NetBackup105 AdminGuide DB2
136 pages
NetBackup105 AdminGuide RHV
No ratings yet
NetBackup105 AdminGuide RHV
50 pages
NetBackup83 AdminGuide OpsCenter
No ratings yet
NetBackup83 AdminGuide OpsCenter
709 pages
NetBackup83 AdminGuideII Server
No ratings yet
NetBackup83 AdminGuideII Server
221 pages
2018 Scheme VII AND VIII Final Before BOS
No ratings yet
2018 Scheme VII AND VIII Final Before BOS
44 pages
NetBackup83 AdminGuide VMware
No ratings yet
NetBackup83 AdminGuide VMware
386 pages
NetBackup83 AdminGuide Cloud
100% (1)
NetBackup83 AdminGuide Cloud
223 pages
B.E. Industrial & Production Syllabus
No ratings yet
B.E. Industrial & Production Syllabus
31 pages
NetBackup812 AdminGuide BMR
No ratings yet
NetBackup812 AdminGuide BMR
275 pages
NetBackup812 AdminGuide MSExchg Win
No ratings yet
NetBackup812 AdminGuide MSExchg Win
190 pages
NetBackup83 AzureStack AdminGuide
No ratings yet
NetBackup83 AzureStack AdminGuide
51 pages
NetBackup82 Add-In For SCVMM Console Guide
No ratings yet
NetBackup82 Add-In For SCVMM Console Guide
56 pages
NetBackup8301 AdminGuideI Server
No ratings yet
NetBackup8301 AdminGuideI Server
1,225 pages
NetBackup83 MySQL AdminGuide
No ratings yet
NetBackup83 MySQL AdminGuide
43 pages
NetBackup83 CloudPoint InstallGuide
No ratings yet
NetBackup83 CloudPoint InstallGuide
163 pages
NetBackup8301 SANClient Guide
No ratings yet
NetBackup8301 SANClient Guide
114 pages
NetBackup8301 Troubleshooting Guide
No ratings yet
NetBackup8301 Troubleshooting Guide
290 pages
NetBackup102 VSphereClient HTML5 Plug-In Guide
No ratings yet
NetBackup102 VSphereClient HTML5 Plug-In Guide
79 pages
NetBackup102 Self Service Installation Guide
No ratings yet
NetBackup102 Self Service Installation Guide
47 pages
CM412 - DL - Model Paper
No ratings yet
CM412 - DL - Model Paper
5 pages
Generative AI's Impact on SME Revenue
No ratings yet
Generative AI's Impact on SME Revenue
21 pages
Applications of Machine Learning To Machine Fault Diagnosis A Review and Roadmap
No ratings yet
Applications of Machine Learning To Machine Fault Diagnosis A Review and Roadmap
136 pages
An Introduction To Variable and Feature Selection
No ratings yet
An Introduction To Variable and Feature Selection
26 pages
Choosing Prediction Over Explanation
No ratings yet
Choosing Prediction Over Explanation
8 pages
DL 4
No ratings yet
DL 4
15 pages
开题报告四稿 zh-CN en
No ratings yet
开题报告四稿 zh-CN en
23 pages
ADMM for Distributed Optimization
No ratings yet
ADMM for Distributed Optimization
23 pages
Xu, Chang, & Zhang (DLGA-PDE 2020)
No ratings yet
Xu, Chang, & Zhang (DLGA-PDE 2020)
14 pages
Infrared Image Destriping Model
No ratings yet
Infrared Image Destriping Model
6 pages
MC4301 - ML Unit 2 (Model Evaluation and Feature Engineering)
No ratings yet
MC4301 - ML Unit 2 (Model Evaluation and Feature Engineering)
40 pages
Geophysics: Smooth MT Inversion
No ratings yet
Geophysics: Smooth MT Inversion
12 pages
DNN Assignment 01 Group 15.ipynb
No ratings yet
DNN Assignment 01 Group 15.ipynb
33 pages
Samsung Prism PPT 2
No ratings yet
Samsung Prism PPT 2
11 pages
MCQS ML
No ratings yet
MCQS ML
27 pages
15 Types of Regression in Data Science PDF
No ratings yet
15 Types of Regression in Data Science PDF
42 pages
Preprocessing Techniques To Improve CNN
No ratings yet
Preprocessing Techniques To Improve CNN
20 pages
Practice Questions On Machine Learning Algorithms
No ratings yet
Practice Questions On Machine Learning Algorithms
2 pages
Deep Learning Fundamentals Guide
No ratings yet
Deep Learning Fundamentals Guide
142 pages
Machine Learning Refined (Foundations, Algorithms, and Applications) (2nd Edition) Watt
No ratings yet
Machine Learning Refined (Foundations, Algorithms, and Applications) (2nd Edition) Watt
10 pages
Tunnel Shotcrete Design with AI
No ratings yet
Tunnel Shotcrete Design with AI
21 pages
Deep Learning in The Life Sciences: MIT 6.874/6.802/20.390/20.490/HST.506
No ratings yet
Deep Learning in The Life Sciences: MIT 6.874/6.802/20.390/20.490/HST.506
145 pages
MTech AI & Data Science Syllabus 2024
No ratings yet
MTech AI & Data Science Syllabus 2024
27 pages
Data Science Interview Questions Answer
No ratings yet
Data Science Interview Questions Answer
17 pages
Deep Learning Ans Semantic Fused FRS
No ratings yet
Deep Learning Ans Semantic Fused FRS
10 pages
Unit 3 MCQ
No ratings yet
Unit 3 MCQ
20 pages
Online Regression With Model Selection: Lizhe Sun Adrian Barbu
No ratings yet
Online Regression With Model Selection: Lizhe Sun Adrian Barbu
9 pages
Machine Learning Solutions Guide
No ratings yet
Machine Learning Solutions Guide
45 pages

Aaai06 Efficientl1logisticregression

Uploaded by

Aaai06 Efficientl1logisticregression

Uploaded by

Efficient L1 Regularized Logistic Regression

Su-In Lee, Honglak Lee, Pieter Abbeel and Andrew Y. Ng

Abstract ter vector, then the optimization problem becomes a con-

Table 2: The 12 datasets we used in our experiments. (See text for

We used the “epsL1” function defined as:

You might also like