0% found this document useful (0 votes)
21 views8 pages

Aaai06 Efficientl1logisticregression

Uploaded by

dixade1732
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
21 views8 pages

Aaai06 Efficientl1logisticregression

Uploaded by

dixade1732
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 8

Efficient L1 Regularized Logistic Regression

Su-In Lee, Honglak Lee, Pieter Abbeel and Andrew Y. Ng


Computer Science Department
Stanford University
Stanford, CA 94305

Abstract ter vector, then the optimization problem becomes a con-


straint optimization problem with the number of constraints
L1 regularized logistic regression is now a workhorse of equal to twice the number of parameters to be learned. The
machine learning: it is widely used for many classifica-
running time of standard algorithms to solve convex opti-
tion problems, particularly ones with many features. L1
regularized logistic regression requires solving a convex mization problems increases with the number of constraints.
optimization problem. However, standard algorithms If, alternatively, the L1 regularization is implemented by
for solving convex optimization problems do not scale adding (a constant times) the L1 norm of the parameter vec-
well enough to handle the large datasets encountered tor to the objective, then the objective is not continuously
in many practical settings. In this paper, we propose differentiable anymore. This precludes the use of, for exam-
an efficient algorithm for L1 regularized logistic regres- ple, Newton’s method to solve the optimization problem ef-
sion. Our algorithm iteratively approximates the objec- ficiently. Thus, both formulations of L1 regularized logistic
tive function by a quadratic approximation at the current regression lead to a more complicated optimization problem
point, while maintaining the L1 constraint. In each iter- than unregularized logistic regression. However, the L1 con-
ation, it uses the efficient LARS (Least Angle Regres-
straint (or penalty term) has significant structure. For exam-
sion) algorithm to solve the resulting L1 constrained
quadratic optimization problem. Our theoretical results ple, if one knew a priori the signs of each of the parameters,
show that our algorithm is guaranteed to converge to the then the L1 constraint would become a single linear con-
global optimum. Our experiments show that our algo- straint. Or, in case of regularization by adding the L1 term
rithm significantly outperforms standard algorithms for to the objective, the L1 term would become linear, resulting
solving convex optimization problems. Moreover, our in a continuously differentiable objective. These (and sim-
algorithm outperforms four previously published algo- ilar) observations have led to various algorithms that try to
rithms that were specifically designed to solve the L1 exploit the structure in the optimization problem and solve
regularized logistic regression problem. it more efficiently.
Goodman (2004) proposed an algorithm for learning con-
Introduction ditional maximum entropy models (of which logistic regres-
sion is a special case) with an exponential prior (also referred
Logistic regression is widely used in machine learning for to as a one-sided Laplacian prior). His algorithm is based
classification problems. It is well-known that regularization on generalized iterative scaling (GIS). The idea is to itera-
is required to avoid over-fitting, especially when there is a tively (efficiently) maximize a differentiable lower bound of
only small number of training examples, or when there are the objective function. The proposed learning algorithm can
a large number of parameters to be learned. In particular, easily be extended to the case of logistic regression with a
L1 regularized logistic regression is often used for feature Laplacian prior by duplicating all the features with the op-
selection, and has been shown to have good generalization posite sign. Logistic regression with a Laplacian prior is
performance in the presence of many irrelevant features. (Ng equivalent to L1 regularized logistic regression.
2004; Goodman 2004)
Perkins et al. (2003) proposed a method called grafting.
Unregularized logistic regression is an unconstrained con-
The key idea in grafting is to incrementally build a subset
vex optimization problem with a continuously differentiable
of the parameters, which are allowed to differ from zero.
objective function. As a consequence, it can be solved fairly
Grafting uses a local derivative test in each iteration of the
efficiently with standard convex optimization methods, such
conjugate gradient method, to choose an additional feature
as Newton’s method or conjugate gradient. However, adding
that is allowed to differ from zero. This way, the conjugate
the L1 regularization makes the optimization problem com-
gradient method can be applied without explicitly dealing
putationally more expensive to solve. If the L1 regulariza-
with the discontinuous first derivatives.
tion is enforced by an L1 norm constraint on the parame-
Roth (2004) proposed an algorithm called generalized
c 2006, American Association for Artificial Intelli-
Copyright ° LASSO that extends a LASSO algorithm proposed by Os-
gence (www.aaai.org). All rights reserved. borne et al. (2000). (The LASSO refers to an L1 regularized
least squares problem. See, Tibshirani (1996) for details.) the following alternative parameterization of the L1 regular-
In this paper, we propose a new, efficient algorithm for ized logistic regression problem:
solving L1 regularized logistic regression. Our algorithm is M
based on the iteratively reweighted least squares (IRLS) for- X
min − log p(y (i) |x(i) ; θ), (3)
mulation of logistic regression. More specifically, in each θ
i=1
iteration, our algorithm finds a step direction by optimizing
the quadratic approximation of the objective function at the subject to kθk1 ≤ C.
current point subject to the L1 norm constraint. The IRLS The optimization problems (2) and (3) are equivalent, in the
formulation of logistic regression allows us to (iteratively) following sense: for any choice of β, there is a choice of C
reformulate the quadratic approximation as a least squares such that both optimization problems have the same mini-
objective. Thus our algorithm ends up solving an L1 con- mizing argument. This follows from the fact that, (up to a
strained least squares problem in every iteration. The L1 constant that does not depend on θ) the optimization prob-
constrained least squares problem can be solved very effi- lem (2) is the Lagrangian of the constrained optimization
ciently using least angle regression (LARS), proposed by problem (3), where β is the Lagrange multiplier. (Rockafel-
Efron et al. (2004). lar 1970) In practice, C (and/or β) may be either chosen
Our theoretical results show that our algorithm is guar- manually or via cross-validation; for the sake of simplicity,
anteed to converge to a global optimum. (In fact, our the- in this paper, we will consider solving the problem (3) for a
oretical results could be generalized to a larger family of given, fixed, value of C.
optimization problems, as discussed later.)
Our experiments show that our algorithm significantly Our Learning Algorithm
outperforms the previously proposed algorithms for L1 reg-
ularized logistic regression. Our algorithm also signifi- In this section, we describe our learning algorithm for L1
cantly outperforms standard gradient-based algorithms, such regularized logistic regression. We also formally prove that
as conjugate gradient and Newton’s method.1 our learning algorithm converges to the global optimum of
We note that Lokhorst (1999) also proposed an algorithm the optimization problem (3).
that uses the IRLS formulation of logistic regression. How-
ever he used a different LASSO algorithm (Osborne, Pres- Preliminaries
nell, & Turlach 2000) in the inner loop and did not provide IRLS for unregularized logistic regression Our learning
any formal convergence guarantees. algorithm is based on iteratively reweighted least squares
(IRLS). (Green 1984; Minka 2003) IRLS reformulates the
problem of finding the step direction for Newton’s method
L1 Regularized Logistic Regression as a weighted ordinary least squares problem.
We consider a supervised learning task where we are given Consider the problem of finding the maximum likelihood
M training instances {(x(i) , y (i) ), i = 1, . . . , M }. Here, estimate (MLE) of the parameters θ for the unregularized
each x(i) ∈ RN is an N -dimensional feature vector, and logistic regression model. The optimization problem can be
y (i) ∈ {0, 1} is a class label. Logistic regression models the written as:
probability distribution of the class label y given a feature M
X
vector x as follows: min − log p(y (i) |x(i) ; θ). (4)
θ
1 i=1
p(y = 1|x; θ) = σ(θ> x) = . (1)
1 + exp(−θ> x) In every iteration, Newton’s method first finds a step di-
rection by approximating the objective by the second order
Here θ ∈ RN are the parameters of the logistic regression Taylor expansion at the current point, and optimizing the
model; and σ(·) is the sigmoid function, defined by the sec- quadratic approximation in closed-form. More specifically,
ond equality. let θ(k) denote the current point in the k’th iteration. Then
Under the Laplacian prior p(θ) = (β/2)N exp(−βkθk1 ) Newton’s method finds a step direction by computing the
(with β > 0), the maximum a posteriori (MAP) estimate of optimum γ (k) of the quadratic approximation at θ(k) as fol-
the parameters θ is given by: lows:
(k)
M γ (k) = θ(k) − H−1 (θ(k) )g(θ ). (5)
X
min − log p(y (i) |x(i) ; θ) + βkθk1 . (2) Here H(θ) and g(θ) represent the Hessian and gradient of
θ
i=1 the objective function in (4) evaluated at θ. Once the step
direction γ (k) is computed, Newton’s method computes the
This optimization problem is referred to as L1 regularized next iterate
logistic regression. Often, it will be convenient to consider
θ(k+1) = (1 − t)θ(k) + tγ (k) (6)
1
The standard gradient-based algorithms are not directly appli-
cable, because the objective function of the L1 regularized logistic by a line search over the step size parameter t.
regression has discontinuous first derivatives. In our experiments, We now show how the Newton step direction can be com-
we used a smooth approximation of the L1 loss function. puted by solving a weighted ordinary least squares problem
rather than using Equation (5). (See Green 1984, or Minka squares problems. In our algorithm, we use LARS with the
2003 for details of this derivation.) Let X(∈ RN ×M ) denote LASSO modification. (Efron et al. 2004) LARS with the
the design matrix: LASSO modification uses a stagewise procedure, based on
a simple incremental update, to efficiently solve the L1 reg-
X = [x(1) x(2) . . . x(M ) ]. ularized least squares problem.
Let the diagonal matrix Λ and the vector z be defined as
follows: for all i = 1, 2, . . . , M : Our algorithm
Our algorithm iteratively solves the L1 constrained logistic
Λii = σ(θ(k)> x(i) )[1 − σ(θ(k)> x(i) )], (7) regression problem of Equation (3). In the k’th iteration, it
[1 − σ(y (i) θ(k)> x(i) )]y (i) finds a step direction γ (k) by solving the constrained least
zi = x(i)> θ(k) + . (8) squares problem of Equation (11). Then it performs a back-
Λii
tracking line search (see Boyd & Vandenberghe, 2004 for
Then, we have that H(θ(k) ) = −XΛX> and g(θ(k) ) = details) over the step size parameter t ∈ [0, 1] to find the
XΛ(z − X> θ(k) ), and thus Equation (5) can be rewritten next iterate θ(k+1) = (1 − t)θ(k) + tγ (k) . To summarize, we
as: propose the following algorithm, which we refer to as IRLS-
γ (k) = (XΛX> )−1 XΛz. (9) LARS. (We discuss the stopping criterion used in more de-
tail when discussing our experimental setup and results.)
(k)
Thus γ is the solution to the following weighted least
squares problem: Algorithm 1 IRLS-LARS
1 1
γ (k) = arg min k(Λ X> )γ − Λ zk22 .
2 2 (10)
γ
1 Set θ(0) = ~0.
Therefore, the Newton step direction can be computed by 2 for (k = 0 to MaxIterations)
solving the least squares problem (10), and thus the logistic 3 Compute Λ and z using Equation (7)
regression optimization problem can be solved using iter- and Equation (8) (for θ = θ(k) ).
atively reweighted least squares (IRLS). This least squares 4 Use LARS to solve the L1 constrained least squares
formulation will play an important role in our algorithm. problem of Equation (11), and let γ (k) be the solution.
The term IRLS refers to the fact that in every iteration the 5 Set θ(k+1) = (1 − t)θ(k) + tγ (k) , where t is found using
least squares problem of Equation (10) has a different di- a backtracking line-search that minimizes the objective
agonal weighting matrix Λ and “right-hand side” z, but the function given in Equation (3).
same design matrix X. (Note that, although we did not make 6 Evaluate the objective function, given in
the dependence explicit, the diagonal matrix Λ and the vec- Equation (3) at θ(k+1) .
7 if (the stopping criterion is satisfied)
tor z both depend on θ(k) and therefore change in every iter-
8 Break;
ation.) 9 end
IRLS for L1 regularized logistic regression For the case 10 end
of L1 regularized logistic regression, as formulated in Equa-
tion (3), the objective is equal to the unregularized logis-
tic regression objective. By augmenting the IRLS formu- We note that our algorithm can be applied to, not only
lation of the unregularized logistic regression with the L1 L1 constrained logistic regression, but rather to any L1 con-
constraint, we get our IRLS formulation for L1 regularized strained optimization problems, so long as the unconstrained
logistic regression (leaving out the dependencies on k for problem can be solved using IRLS. In particular, our algo-
clarity): rithm can be used for parameter learning for L1 constrained
generalized linear models. (Nelder & Wedderbum 1972;
1 1
McCullagh & Nelder 1989)
min k(Λ 2 X> )γ − Λ 2 zk22 , (11)
γ
Convergence and optimality guarantees
subject to kγk1 ≤ C The following theorem shows that the IRLS-LARS algo-
Our algorithm (presented in more detail below) iteratively rithm presented in this paper converges to a global opti-
solves the L1 regularized logistic regression problem by (in mum.2
every iteration) first solving the L1 constrained least squares Theorem 1. The IRLS-LARS algorithm presented in the pa-
problem (11) to find a step direction and then doing a line per converges to a global optimum of the L1 constrained
search to determine the step size. Thus our algorithm com- regularized logistic regression problem of Equation (3).
bines a local quadratic approximation with the global con-
2
straints. In fact—although the theorem statement is specific to our
algorithm—the proof easily generalizes to any algorithm that it-
LARS for solving L1 constrained least squares. The L1 eratively minimizes the local quadratic approximation subject to
constrained least squares problem of Equation (11) is also the constraints (followed by a line search) so long as the objective
known as a LASSO problem. (Tibshirani 1996) Several al- function has a continuous 3rd order derivative, and the feasible set
gorithms have been developed to solve L1 constrained least is convex, closed and bounded.
Proof. Our convergence proof uses the following fact: Let t be the step size resulting from the line search. Then
we have that:
Fact 1. Let any point θ0 that is not a global optimum be
given. Then there is an open subset Sθ0 that includes θ0 , f˜φ (φ0 + t(φ1 − φ0 )) ≤ (1 − t)f˜φ (φ0 ) + tf˜φ (φ1 )
0 0 0
and there is a constant Kθ0 > 0 such that the following δ
holds: for all feasible φ0 ∈ Sθ0 we have that one iteration ≤ f (φ0 ) − t . (17)
2
of our algorithm, starting from φ0 , will give us a (feasible)
point φ1 that satisfies f (φ1 ) ≤ f (φ0 ) − Kθ0 . Here we used, in order: f˜φ0 is convex; f˜φ0 (φ0 ) = f (φ0 )
and Equation (16).
Proof of Fact 1. We have a convex objective function f with Since f is three times continuously differentiable and the
convex constraints. Since θ0 is not a global optimum, any feasible set is bounded, there exists C > 0, such that for all
open set that includes θ0 includes a feasible point with φ0 ∈ Sθ0 the following holds for all θ in the feasible set:
strictly lower value of the objective function f . (In fact, all
points on a line segment connecting θ0 with a global op- f (θ) ≤ f˜φ0 (θ) + Ckθ − φ0 k32 . (18)
timum are feasible and have a strictly lower value of the Now, we choose θ = φ0 + t(φ1 − φ0 ) in Equation (18), and
objective function f .) The objective f is convex and has we use Equation (17) to obtain:
bounded second derivative over the feasible set, thus for any
sufficiently small open set that includes θ0 , we have that f (φ0 + t(φ1 − φ0 )) ≤ f (φ0 ) − t 2δ + Ckt(φ1 − φ0 )k32 . (19)
this set includes a feasible point with strictly lower value Let R be the radius of a sphere that includes the (bounded)
of the local quadratic approximation at θ0 , which we de- feasible set, then we get that
note by f˜θ0 . Thus when our algorithm globally optimizes f (φ0 + t(φ1 − φ0 )) ≤ f (φ0 ) − t 2δ + Ct3 (2R)3 . (20)
the quadratic approximation over the feasible set, the result- q
ing point θ1 must satisfy f˜θ0 (θ1 ) < f˜θ0 (θ0 ). Now define Choosing t = δ
48CR3 gives us that for all φ0 ∈ Sθ0 , we
δ > 0 as follows:
have that our algorithm finds a φ∗ such that:
δ = f˜θ (θ0 ) − f˜θ (θ1 ). (12) r
0 0
∗ δ3
Since f is three times continuously differentiable and f (φ ) ≤ f (φ0 ) − .
192CR3
since the feasible set is bounded, there exists ² > 0, such q
δ3
that for all φ0 ∈ Sθ0 = {θ : kθ − θ0 k2 ≤ ²} the following Choosing Kθ0 = 192CR 3 proves Fact 1.
hold:
δ Now we established Fact 1, we are ready to prove the
|f (θ0 ) − f (φ0 )| ≤ , (13)
8 theorem. Let any δ > 0 be fixed. Let Pδ = {θ :
and for all θ in the feasible set: θis feasible , ∀ global optima θ∗ , kθ − θ∗ k ≥ δ}. We will
show convergence to a global optimum by showing that the
δ
|f˜θ0 (θ) − f˜φ0 (θ)| ≤ . (14) algorithm can only spend a finite number of iterations in the
8 set Pδ .
From Equation (14) we have for all φ0 ∈ Sθ0 that From Fact 1 we have that for all θ ∈ Pδ , there exists
an open set Sθ and a positive constant Kθ > 0 such that
| minθ f˜θ0 (θ) − minθ f˜φ0 (θ)| ≤ 4δ . (15) the algorithm improves the objective by at least Kθ in one
Therefore, for all φ0 ∈ Sθ0 the following holds. Let φ1 be iteration, when started from a point in the set Sθ . Since such
the global optimum (within the feasible set) of the quadratic a set Sθ exists for all θ ∈ Pδ we have that
approximation at φ0 , then we have that: Pδ ⊆ ∪θ∈Pδ Sθ .
δ Since the set Pδ is a closed and bounded subset of RN , we
f˜φ0 (φ1 ) ≤ f˜θ0 (θ1 ) + have (from the Heine-Borel theorem) that Pδ is compact. By
4
δ definition of compactness, for every open cover of a compact
≤ f˜θ0 (θ0 ) − δ + set there exists a finite sub-cover. (See, e.g., Royden, 1988,
4
for more details.) Thus, there is a finite set Qδ such that

= f (θ0 ) − Pδ ⊆ ∪θ∈Qδ Sθ . (21)
4
3δ Since Qδ is finite, we have that there is some Cδ > 0 such
= f (φ0 ) + f (θ0 ) − f (φ0 ) −
4 that
δ 3δ inf Kθ = Cδ .
≤ f (φ0 ) + − θ∈Qδ
8 4
Since Qδ covers Pδ , we have that for all points θ ∈ Pδ our
δ
≤ f (φ0 ) − . (16) algorithm improves the objective by at least Cδ in every iter-
2 ation. The objective is bounded, so this means the algorithm
Here we used, in order: Equation (15); Equation (12); sim- has to leave the set Pδ after a finite number of iterations.
plification (f˜θ0 (θ0 ) = f (θ0 )); adding and subtracting the Since this holds true for arbitrary δ > 0, we showed that the
same term; Equation (14); simplification. algorithm converges to a global optimum.
Name Description Name Number of Number of
IRLS-LARS Our algorithm features samples
GenLASSO Generalized LASSO (Roth 2004) Arrhythmia 279 420
SCGIS Sequential Conditional GIS (Goodman Hepatitis 19 80
2004) Ionosphere 33 351
Gl1ce Gl1ce (Lokhorst 1999) Promoters 57 106
Grafting Grafting (Perkins & Theiler 2003) Spambase 57 4601
CG-L1 Conjugate gradient with exact L1 Spect 22 267
NM-epsL1 Newton’s method with epsL1 Spectf 44 349
CG-epsL1 Conjugate gradient with epsL1 Splice 60 1527
CG-Huber Conjugate gradient with Huber Wisconsin breast cancer 30 569
Madelon 500 2600
Table 1: The nine algorithms that were used in our analysis. (See Microarray1 2000 62
text for details.) Microarray2 3571 72

Table 2: The 12 datasets we used in our experiments. (See text for


Experimental Results details.)
We compared the computational efficiency of our algorithm
with the eight other algorithms listed in Table 1. We tested method. These algorithms are not directly applicable to L1
each algorithm’s performance on 12 different datasets, con- logistic regression because the objective function has dis-
sisting of 9 UCI datasets (Newman et al. 1998), one artificial continuous first derivatives. Thus, we used a smooth ap-
dataset called Madelon from the NIPS 2003 workshop on proximation of the L1 penalty term. More specifically, we
feature extraction,3 and two gene expression datasets (Mi- have that the L1 term is given by
croarray 1 and 2).4 Table 2 gives details on the number
N
X
of training examples and the number of features for each
dataset. kθk1 = |θj |.
j=1

We used the “epsL1” function defined as:


q
Experimental details on the other algorithms
epsL1(θj , ²) = θj2 + ²,
We compared our algorithm (IRLS-LARS) to four previ-
ously published algorithms: Grafting (Perkins & Theiler to obtain a smooth approximation for each of the terms |θj |.
2003), Generalized LASSO (Roth 2004), SCGIS (Goodman We similarly used the Huber loss function (see, e.g., Boyd
2004), and Gl1ce (Lokhorst 1999). & Vandenberghe 2004):
Three of these algorithms (Gl1ce, SCGIS and Graft- ½ 2
θj /(2²) if |θj | < ²
ing) and our algorithm (IRLS-LARS) were implemented in Huber(θj , ²) =
|θj | − ²/2 otherwise.
MATLAB. In our implementations, we went through great
efforts to carefully implement all the optimization tech- We also tested conjugate gradient with the exact (non-
niques suggested by the corresponding papers. We used differentiable) L1 norm penalty (CG-L1). In this case, we
Volker Roth’s C-implementation for GenLASSO.5 treated the first derivative of the L1 terms as zero when eval-
∂|θ |
uated at zero. More specifically we treated ∂θjj as being
We also compared IRLS-LARS to standard optimization al- zero for θj = 0.
gorithms, such as conjugate gradient method and Newton’s We went through great efforts to ensure a fair comparison
by optimizing the performance of all other algorithms. We
3
Madelon is available at: carefully profiled all algorithms and made optimizations ac-
http://www.clopinet.com/isabelle/Projects/ cordingly. We also spent significant amounts of time tuning
NIPS2003/. each algorithm’s parameters (when applicable). For exam-
4
Microarray 1 is a colon cancer microarray dataset available at: ple, for the algorithms that are based on conjugate gradient
http://microarray.princeton.edu/oncology/ and Newton’s method, we extensively tuned the parameters
Microarray 2 is a leukemia microarray dataset available at: of the line-search algorithm; for the algorithms using con-
http://www.broad.mit.edu/cgi-bin/cancer/ jugate gradient (Grafting, CG-epsL1, CG-Huber and CG-
publications/pub_paper.cgi?mode=view&paper_ L1), we tested both our own conjugate gradient implemen-
id=43.
5 tation as well as the MATLAB optimization toolbox’s con-
The code for GenLASSO is available at: http:
jugate gradient; for the algorithms that use the approximate
//www.informatik.uni-bonn.de/˜roth/GenLASSO/
index.html. His implementation contained some unnecessary L1 penalty term, we tried many choices for ² in the range
steps (for the purposes of our evaluation), such as automatically (10−15 < ² < 0.01) and chose the one with the shortest
performing the cross-validation for determining the regularization running time; etc. In all cases, we are reporting the running
parameter C. We went carefully through his code and removed times for those parameter settings (and other choices) which
these additional computations. resulted in the shortest running times.
Choice of the regularization parameters Stopping criterion For each algorithm the stopping cri-
For a fixed dataset, the choice of the regularization param- terion is based on how close the current objective function
eter C (or β) greatly affects the running time of the al- value (denoted by obj) is to the optimal objective function
gorithms. Thus, we ran 10-fold cross validation (for each value (denoted by obj ∗ ). For each dataset, we first compute
dataset) and chose the regularization parameter that achieves obj ∗ as follows. First, we run our algorithm (IRLS-LARS)
the highest (hold-out) accuracy. Doing so ensures that we for an extensive amount of time to obtain the solution θ∗ .
evaluate the algorithms in the regime we care about for clas- Then, we use θ∗ to compute β ∗ from Equation (24). Finally,
sification. we verify whether the KKT optimality conditions (see Boyd
& Vandenberghe, 2004) are satisfied for the values for θ∗
In particular, we ran our algorithm (IRLS-LARS) to deter-
and β ∗ we found. In all cases, we found that the KKT con-
mine the regularization parameter C. This value of C was
ditions were satisfied up to a very small error.8
then used for all learning algorithms that use the constrained
version of L1 regularized logistic regression, namely, IRLS- Using the optimal objective obj ∗ , we defined the stopping
LARS, GenLASSO, and Gl1ce. criterion as:
The other algorithms solve the formulation with the L1 |obj ∗ − obj|
penalty term (or a smooth version thereof), as given in Equa- < τ. (25)
|obj ∗ |
tion (2). As discussed in more detail in our description of L1
regularized logistic regression, the optimization problems of We used τ = 10−6 for our algorithm, Grafting, SCGIS, and
Equation (2) and Equation (3) are equivalent, in the follow- GenLASSO. For the algorithms with the approximate L1
ing sense: for any choice of C, there is a choice of β such functions, we used a less stringent bound (τ = 10−5 ) since
that the solution to both problems is the same. We now show the objective function with the approximate L1 penalty term
how to find β explicitly for a given value of C. is slightly different from from the original one formulated in
From the Karush-Kuhn-Tucker (KKT) conditions, we Equation (2).
have that the optimal primal solution θ∗ and the optimal dual
solution β ∗ (of the constrained optimization problem) sat- Discussion of experimental results
isfy the following conditions: Figure 1 shows the results for the five algorithms specifi-
PM ¯ cally designed for L1 regularized logistic regression (IRLS-
∂ i=1 log p(y (i) |x(i) ; θ) ¯¯ LARS, Grafting, SCGIS, GenLASSO and Gl1ce) on 12
¯ = β ∗ sign(θj ), (22)
∂θj ¯ ∗ datasets. Since the running times for the standard opti-
θ
mization methods (CG-L1, NM-epsL1, CG-epsL1 and CG-
θj∗
for all 6= 0. From Lagrange duality, we also have that the Huber) tend to be significantly higher, we do not report them
optimal primal solution θ∗ is given by in the same figure. (Doing so allows us to visualize more
clearly the differences in performance for the first set of al-
M
X gorithms.) We report the performance of the standard op-
θ∗ = arg min − log p(y (i) |x(i) ; θ) + β ∗ kθk1 . (23) timization algorithms (and our algorithm) in Table 3. In
θ
i=1
all cases we report the average running time (in seconds)
Thus, by choosing over 100 trials of the same experiment (dataset and algo-
¯ rithm combination).
PM ¯
∂ log p(y (i) (i)
|x ; θ) ¯ The results reported in Figure 1 show that our algorithm
β = β ∗ = sign(θj ) i=1
¯ (24) (IRLS-LARS) was more efficient than all other algorithms
∂θj ¯
θ∗ in 11 out of 12 datasets. On one dataset (Microarray2) Gen-
(for any j such that θ∗ j 6= 0). We ensure that the optimal Lasso and Gl1ce were slightly faster. All other algorithms
solution of the formulation with L1 penalty term gives the were slower than our algorithm on every dataset.
same optimal solution as the formulation with the L1 con- IRLS-LARS significantly outperformed Grafting and
straint (kθk1 ≤ C).6 SCGIS. More specifically, in 8 (out of 12) datasets our
method was more than 8 times faster than Grafting. (In the
other 4, it was also faster, although not by a factor of 8.) Our
Running time
method was at least 10 times faster than SCGIS in 10 out of
The efficiency of an algorithm was measured based on the 12 datasets.
running time, i.e., the CPU time measured in seconds until GenLASSO and Gl1ce are the strongest competitors, and
the algorithm meets the stopping criterion. Each algorithm even then our algorithm is more than twice as fast as Gl1ce
was run on a Linux machine with AMD Opteron 2GHz with
4.6GB RAM.7 8
In all cases, the primal constraint (kθk1 ≤ C) was satisfied
with an error less than 10−13 , and Equation (24) was satisfied with
6
Effectively the KKT condition as in (22) is not exactly satisfied an error less than 5 × 10−8 . All the other KKT conditions were
for approximate L1 functions. However, typical ² value turned out satisfied exactly. We performed the same procedure using Gen-
to be very small(≈ 10−10 ) such that the resulting approximate L1 LASSO and obtained similar results. In all cases, the absolute
functions were very close to the true L1 functions. difference between the obj ∗ ’s obtained from running IRLS-LARS
7
For the runs that took longer than 500 seconds, we truncated and the obj ∗ ’s obtained from running GenLASSO was smaller than
the running time to 500 seconds. 10−8 ; and the relative difference was always smaller than 10−10 .
Data IRLS- NM- CG- CG- CG- Acknowledgments
LARS epsL1 epsL1 Huber L1 We give warm thanks to John Lafferty, Alexis Battle, Ben
Arrhythmia 0.0420 2.8450 14.384 13.660 14.065 Packer, and Vivek Farias for helpful comments. H. Lee
Hepatitis 0.0046 0.0258 0.0819 0.6580 0.6347 was supported by Stanford Graduate Fellowship (SGF). This
Ionosphere 0.0516 0.1534 0.1746 0.9717 0.9237 work was supported in part by the Department of the Inte-
Promoters 0.0182 0.0632 0.2047 0.8418 0.7328 rior/DARPA under contract number NBCHD030010.
Spambase 4.0804 21.172 10.151 13.410 12.016
Spect 0.0345 0.0704 0.1206 0.7651 0.7293 References
Spectf 0.0646 0.2330 0.5493 1.1761 1.1235
Splice 0.8376 2.9828 0.9424 4.3407 4.8007 Boyd, S., and Vandenberghe, L. 2004. Convex Optimiza-
WBCa 0.1222 0.7225 4.9592 1.4854 1.3777 tion. Cambridge University Press.
Madelon 0.8602 55.060 113.22 95.223 104.50 Efron, B.; Hastie, T.; Johnstone, I.; and Tibshirani,
Microarray1 0.0330 242.88 500.00 500.00 500.00 R. 2004. Least angle regression. Annals of Statistics
Microarray2 0.4069 500.00 500.00 500.00 500.00 32(2):407–499.
a
Goodman, J. 2004. Exponential priors for maximum en-
Wisconsin breast cancer tropy models. In Proceedings of the Annual Meeting of the
Association for Computational Linguistics.
Table 3: The running time (seconds) of IRLS-LARS, CG-L1 and
three algorithms with the approximate L1 functions (NM-epsL1, Green, P. J. 1984. Iteratively reweighted least squares for
CG-epsL1 and CG-Huber). (See text for details.) maximum likelihood estimation, and some robust and re-
sistant alternatives. Journal of the Royal Statistical Society,
Series B, Methodological 46:149–192.
for 5 out of 12 datasets and more than twice as fast as Gen- Lokhorst, J. 1999. The LASSO and Generalised Linear
LASSO for 8 out of 12 datasets. We note that (as opposed Models. Honors Project, Department of Statistics, The Uni-
to ours and all the other methods) GenLASSO is highly- versity of Adelaide, South Australia, Australia.
optimized C code (linked with highly optimized numerical McCullagh, P., and Nelder, J. A. 1989. Generalized Linear
libraries). It is widely accepted that C code is faster than Models. Chapman & Hall.
MATLAB code, and thus we are giving GenLASSO a some-
Minka, T. P. 2003. A comparison of numerical op-
what unfair advantage in our comparison.9
timizers for logistic regression. http://research.
As reported in Table 3, conjugate gradient and Newton’s microsoft.com/˜minka/papers/logreg/.
method are an order of magnitude slower than our IRLS-
LARS algorithm in most of the datasets (despite the less Nelder, J. A., and Wedderbum, R. W. M. 1972. General-
stringent stopping criterion). Among these methods, NM- ized linear models. Journal of the Royal Statistical Society,
epsL1 seems to be the second best algorithm on the datasets Series A, General 135(3):370–384.
with a small number of features. However, this method Newman, D. J.; Hettich, S.; Blake, C. L.; and Merz,
quickly becomes inefficient as the number of features in- C. 1998. UCI repository of machine learning
creases. Also other methods similarly suffered from the in- databases. http://www.ics.uci.edu/˜mlearn/
creasing number of features. We observe that CG-L1 is very MLRepository.html.
competitive with CG-epsL1 and CG-Huber. Ng, A. Y. 2004. Feature selection, l1 vs. l2 regularization,
and rotational invariance. In International Conference on
Machine Learning.
Osborne, M.; Presnell, B.; and Turlach, B. 2000. On the
Conclusions lasso and its dual. Journal of Computational and Graphical
Statistics 9:319–337.
In this paper, we proposed a new algorithm for learning L1
regularized logistic regression. Our theoretical results show Perkins, S., and Theiler, J. 2003. Online feature selection
that our algorithm is guaranteed to converge to a global op- using grafting. In International Conference on Machine
timum. Our experiments showed that our algorithm is com- Learning.
putationally very efficient. In particular, it outperforms stan- Rockafellar, R. T. 1970. Convex analysis. Princeton Univ.
dard algorithms for solving convex optimization problems, Press.
as well as other algorithms previously developed specifically Roth, V. 2004. The generalized lasso. IEEE Trans. Neural
to solve the L1 regularized logistic regression problem. Networks 15(1):16–28.
9
Royden, H. 1988. Real Analysis. Prentice Hall.
We have implemented GenLASSO in MATLAB. However, the
extensive optimization tricks used in the C-code and numerical li- Tibshirani, R. 1996. Regression shrinkage and selection
braries were hard to be duplicated in MATLAB, making it hard to via the lasso. Journal of the Royal Statistical Society, Series
evaluate GenLASSO correctly with our MATLAB code. Since C B 58(1):267–288.
code is widely accepted to be at least as fast as MATLAB code, we
decided to report the C code results (possibly favoring GenLASSO
somewhat in our comparisons).
Figure 1: Comparison of the running time (seconds) among five algorithms (IRLS-LARS, Grafting, SCGIS, GenLASSO and Gl1ce) over 12
datasets. Each plot corresponds to one dataset. The name of the dataset is indicated at the top of the plot, and the number of features (N ) and
the number of training instances (M ) are shown as “Name of the data: N × M ”. For each experiment, we report the running time in seconds,
and the running time relative to IRLS-LARS in parentheses. (See text for discussion.)

You might also like