2024-02-19-Scalable Algorithms For Structured Functions
2024-02-19-Scalable Algorithms For Structured Functions
Hao Yan
Hao Yan Scalable Algorithms for Structured Functions February 29, 2024 1 / 96
Outline
1 Overview
Regularized Likelihood
2 Subgradient
Algorithm for Subgradient Descent
3 Coordinate Descent
Coordinate Descent
Block Coordinate Descent
Hao Yan Scalable Algorithms for Structured Functions February 29, 2024 2 / 96
Overview
Section 1
Overview
Hao Yan Scalable Algorithms for Structured Functions February 29, 2024 3 / 96
Overview Regularized Likelihood
Blessing of Dimensionality
Overfitting Problem
Overfitting
Overfitting: the model has low training error but high prediction error.
Overfitting: the model
Using too many hascanlow
features training
lead error but high prediction error.
to overfitting
Using too many features can lead to overfitting
Hao Yan Scalable Algorithms for Structured Functions February 29, 2024 6 / 96
Overview Regularized Likelihood
Training data and testing data generated from the same distribution
(xi , yi ) ∼ D
Hao Yan Scalable Algorithms for Structured Functions February 29, 2024 7 / 96
Overview Regularized Likelihood
Hao Yan Scalable Algorithms for Structured Functions February 29, 2024 8 / 96
Overview Regularized Likelihood
Why Regularization
Why regulation?
From a computational perspective, regularization can lend stability to
the optimization problem and can lead to algorithmic speed-ups.
From a geometric point of view, regularization helps the solution be
close to the low-dimensional manifold.
From a statistical point of view, regularization avoids over-fitting and
leads to estimators with interesting guarantees on their error.
From a engineering or other domain system point of view,
regularization leads to the solution that has better desired property. It
is a way to put human knowledge into the learning systems.
Hao Yan Scalable Algorithms for Structured Functions February 29, 2024 9 / 96
Overview Regularized Likelihood
Ridge Regression:
Hao Yan Scalable Algorithms for Structured Functions February 29, 2024 10 / 96
Overview Regularized Likelihood
β2 β2 β2
● ● ●
β1 β1 β1
|| ||0 t || || 1 t || || 3 t
2 4
β2 β2 β2
● ● ●
β1 β1 β1
Hao Yan Scalable Algorithms for Structured Functions February 29, 2024 11 / 96
Overview Regularized Likelihood
β2 β2 β2
● ● ●
β1 β1 β1
|| ||0 t || || 1 t || || 3 t
2 4
β2 β2 β2
● ● ●
β1 β1 β1
|| Hao
||1Yan
t || || 3 t
Scalable Algorithms
|| || t
February229, 2024
2 for Structured Functions 12 / 96
mmary Overview Regularized Likelihood
Convex? Corners?
|| ||0 No Yes
|| || 1 No Yes
2
|| || 3 No Yes
4
|| || 3 Yes No
2
|| ||2 Yes No
Hao Yan Scalable Algorithms for Structured Functions February 29, 2024 13 / 96
both worlds: || ||1 Overview Regularized Likelihood
β2
β1
Hao Yan Scalable Algorithms for Structured Functions February 29, 2024 14 / 96
Overview Regularized Likelihood
Lasso Regression
Equivalently
n
X p
X
2
min {yi − (θ1 x1i + · · · + θp xpi )} + λ |θj |
i=1 i=1
In matrix formation
min ∥y − X θ∥2 + λ∥θ∥1
θ
Challenge: Non-differentiable on θ = 0
Hao Yan Scalable Algorithms for Structured Functions February 29, 2024 15 / 96
Overview Regularized Likelihood
n
X 1
min max(0, 1 − yi x T ξi ) + ∥x∥2
x | {z } 2
| {z }
i=1
Hinge Loss Function
L2 regularization
Hinge Loss function is a convex-relaxation of 0 − 1 loss function
Non-differentiable
Hao Yan
when yi xξi = 1for Structured Functions
Scalable Algorithms February 29, 2024 16 / 96
Overview Regularized Likelihood
Type of Regularization
Property Regulizer R Form
Pp
Sparse Solution L1 −norm |x |
P j=1 j 1
Group Sparsity Group L1 −norm P G ∈G |xG |2
Continuity with Jump L1 −norm of Gradient |x − xj+1 |
P j j
Smoothness Smoothess penalty + xj+2 − 2xj )2
j (xjP
Low-rank Matrices Nuclear Norm j σj (X )
Small magnitude Ridge penalty ∥x∥2
Hao Yan Scalable Algorithms for Structured Functions February 29, 2024 17 / 96
Subgradient
Section 2
Subgradient
Hao Yan Scalable Algorithms for Structured Functions February 29, 2024 18 / 96
Subgradient Algorithm for Subgradient Descent
Hao Yan Scalable Algorithms for Structured Functions February 29, 2024 19 / 96
Subgradient Algorithm for Subgradient Descent
Wolfe’s example
(
5(9x12 + 16x22 )1/2 x1 > |x2 |
f (x1 , x2 ) =
9x1 + 16|x2 | x1 ≤ |x2 |
Hao Yan Scalable Algorithms for Structured Functions February 29, 2024 20 / 96
Subgradient Algorithm for Subgradient Descent
Subgradient
Convexity ⇒ continuity (on the domain of the function)
Convexity ̸⇒ differentiability (e.g. ∥x∥1 )
Subgradients generalize gradients for general convex function
v is subgradient of f at x if f (x ′ ) ≥ f (x) + v T (x ′ − x) Subdifferential:
∂f (x) = {all subgradients of f at x}
If f is differentiable, ∂f (x) = {∇f (x)}
supporting hyperplanes
Hao Yan Scalable Algorithms for Structured Functions February 29, 2024 21 / 96
Subgradient Algorithm for Subgradient Descent
Basic Rules
Hao Yan Scalable Algorithms for Structured Functions February 29, 2024 22 / 96
Subgradient Algorithm for Subgradient Descent
Hao Yan Scalable Algorithms for Structured Functions February 29, 2024 23 / 96
Subgradient Algorithm for Subgradient Descent
∥x∥p = max z⊤ x
∥z∥q ≤1
Hao Yan Scalable Algorithms for Structured Functions February 29, 2024 24 / 96
Subgradient Algorithm for Subgradient Descent
Hao Yan Scalable Algorithms for Structured Functions February 29, 2024 25 / 96
Subgradient Algorithm for Subgradient Descent
Subgradient Properties
Hao Yan Scalable Algorithms for Structured Functions February 29, 2024 26 / 96
Subgradient Algorithm for Subgradient Descent
Definition
A vector g ∈ Rn is a subgradient of f : Rn → R at x if for all z,
Corollary
If 0 ∈ ∂f (x), then x is a global minimizer of f .
Hao Yan Scalable Algorithms for Structured Functions February 29, 2024 27 / 96
Subgradient Algorithm for Subgradient Descent
Hao Yan Scalable Algorithms for Structured Functions February 29, 2024 28 / 96
Subgradient Algorithm for Subgradient Descent
Hao Yan Scalable Algorithms for Structured Functions February 29, 2024 29 / 96
Subgradient Algorithm for Subgradient Descent
Subgradient Method
Hao Yan Scalable Algorithms for Structured Functions February 29, 2024 30 / 96
Subgradient Algorithm for Subgradient Descent
x (k) is k th iterate
g̃ (k) is any unbiased subgradient estimation of (convex) f at x (k)
Typically, g̃ (k) is computed on mini-batch sample ξi , i = 1, · · · , B
Hao Yan Scalable Algorithms for Structured Functions February 29, 2024 31 / 96
f (x) is always a descent direction (except
Subgradient Algorithm for Subgradient Descent
x1
Hao Yan Scalable Algorithms for Structured Functions February 29, 2024 32 / 96
Subgradient Algorithm for Subgradient Descent
Property
if f is convex, f (z) < f (x), g ∈ ∂f (x), then for small t > 0,
thus −g is descent direction for ∥x − z∥2 , for any z with f (z) < f (x) (e.g.,
x⋆ )
Hao Yan Scalable Algorithms for Structured Functions February 29, 2024 33 / 96
Subgradient Algorithm for Subgradient Descent
Convergence Rate
Theorem (Theorem)
(k)
Suppose fbest is the best function value obtained to k th iteration
R 2 +G 2 ki=1 α2i
P
(k)
fbest − f ∗ ≤ Pk . Here, R = ∥x (1) − x ∗ ∥2 and assume ∥g ∥2 ≤ G
2 i=1 αi
Apply it recursively
k k
(k+1) ∗ 2 (1) ∗ 2 (i) ∗ 2 (k) 2
X X
0 ≤ ∥x − x ∥2 ≤ ∥x − x ∥2 − 2 αi (f (x )−f )+ αi ∥g ∥2
i=1 i=1
k k
2 (i) ∗ 2 2
X X
≤R −2 αi (f (x )−f )+G αi
i=1 i=1
k k
2 (k) ∗ 2 2
X X
≤ R − 2(fbest − f )( αi ) + G αi
i=1 i=1
We have
R 2 + G 2 ki=1 α2
P
(k) ∗ i
fbest − f ≤ Pk
2 i=1 αi
Hao Yan Scalable Algorithms for Structured Functions February 29, 2024 35 / 96
Subgradient Algorithm for Subgradient Descent
Step-size Choice
Important that step sizes go to zero, but not too fast such as αk = 1
k
Major different from Gradient Descent:
Step sizes are typically pre-specified, not adaptively computed (No Line
Search! )
Hao Yan Scalable Algorithms for Structured Functions February 29, 2024 36 / 96
Subgradient Algorithm for Subgradient Descent
Example 1: Lasso
Hao Yan Scalable Algorithms for Structured Functions February 29, 2024 37 / 96
Subgradient Algorithm for Subgradient Descent
Example 1: Lasso
Lasso:
1
min ∥Ax − y ∥22 + τ ∥x∥1
x 2
Hao Yan Scalable Algorithms for Structured Functions February 29, 2024 38 / 96
Subgradient Algorithm for Subgradient Descent
If A = I , this gives y − x = τ v
This can be solved for each xi individually as
{1} xi > 0
yi − xi = τ vi ∈ {−1} xi < 0
[−1, 1] xi = 0
Hao Yan Scalable Algorithms for Structured Functions February 29, 2024 39 / 96
Subgradient Algorithm for Subgradient Descent
Hao Yan Scalable Algorithms for Structured Functions February 29, 2024 40 / 96
Subgradient Algorithm for Subgradient Descent
Result
Ridge (Left): use gradients; Lasso (Right): use subgradients. Example
here has n = 1000, p = 20:
Example 2: SVM
n
X 1
min max(0, 1 − yi x T ξi ) + ∥x∥2
x | {z } 2
| {z }
i=1
Hinge Loss Function
L2 regularization
Subgradient: ∂ 12 ∥x∥2 = x,
{0} yi x T ξi > 1
vi = ∂ max(0, 1 − yi x T ξi ) = {yi ξi } yi x T ξi < 1
[0, yi ξi ] yi x T ξi = 1
P
Subgradient is g = i (xi + vi )
Subgradient descent x (k+1) = x (k) − τ g (k) .
Can be combined with stochastic optimization to choose randomly i
as g = xi + vi .
Hao Yan Scalable Algorithms for Structured Functions February 29, 2024 42 / 96
Coordinate Descent
Section 3
Coordinate Descent
Hao Yan Scalable Algorithms for Structured Functions February 29, 2024 43 / 96
Coordinate Descent Coordinate Descent
Coordinate Descent
These are some of the core methods in optimization, and they are the
main objects of study in this field, such as
First-order methods
Newton’s method
Dual methods
Today, we’ll see an extremely simple technique that is surprisingly
efficient and scalable.
They don’t apply as broadly as above methods, but are interesting and
useful when they do apply.
The idea is coordinate-wise minimization
Hao Yan Scalable Algorithms for Structured Functions February 29, 2024 44 / 96
Coordinate Descent Coordinate Descent
Coordinate-wise Minimization
Hao Yan Scalable Algorithms for Structured Functions February 29, 2024 45 / 96
Coordinate Descent Coordinate Descent
Convergence
If f (x) is convex and differentiable, f (x) is minimized at x ∗ along each
coordinate axis, or f (x ∗ + d · ei ) ≥ f (x ∗ ), then x ∗ is a global minimizer.
Proof.
If f (x ∗ + d · ei ) ≥ f (x ∗ ), we know that ∇f (x ∗ )i = 0. Therefore,
∇f (x ∗ ) = ∂x ∂f
1
(x ∗) · · · ∂f
∂xn (x ∗)
=0
Hao Yan Scalable Algorithms for Structured Functions February 29, 2024 46 / 96
Coordinate Descent Coordinate Descent
Convergence
4
2
f
x2
●
0
−2
−4
x2
x1 −4 −2 0 2 4
x1
Coordinate Descent
Hao Yan Scalable Algorithms for Structured Functions February 29, 2024 48 / 96
Coordinate Descent Coordinate Descent
Convergence
P
Q: Same question, but for f (x) = g (x) + ni=1 hi (xi ) convex with hi
convex (possible non-differentiable) and g (x) (differentiable)
A: Yes! for any y
n
X
f (y ) − f (x) ≥ ∇g (x)T (y − x) + (hi (yi ) − hi (xi ))
i=1
n
X
= [∇i g (x)(yi − xi ) + hi (yi ) − hi (xi )]
i=1
4
2
f
x2
●
0
−2
−4
x2
x1 −4 −2 0 2 4
Hao Yan Scalable Algorithms for Structured Functions x1
February 29, 2024 49 / 96
Coordinate Descent Coordinate Descent
Choice of Ordering
Cyclic Order: run all coordinates in cyclic order
Random Sampling: Randomly select some coordinate to update
Gauss−Southwell : At each iteration, pick coordinate i so that
i = arg maxj |∇j f (x (t) )|
Random Permutation: run cyclic order on a permuted index
Hao Yan Scalable Algorithms for Structured Functions February 29, 2024 50 / 96
Coordinate Descent Coordinate Descent
0 = ∇i f (x) = AT T
i (Ax − y ) = Ai (Ai xi + A−i x−i − y )
Solved by
AT
i (y − A−i x−i )
xi =
∥Ai ∥2
Coordinate descent repeats the update for i = 1, · · · , p, 1, 2, · · · , p, · · ·
Hao Yan Scalable Algorithms for Structured Functions February 29, 2024 51 / 96
Coordinate Descent Coordinate Descent
Compexity
AT
i (y − A−i x−i )
xi =
∥Ai ∥2
Hao Yan Scalable Algorithms for Structured Functions February 29, 2024 52 / 96
Coordinate Descent Coordinate Descent
A Clever Trick
Define r = y − Ax
AT T T T T 2
i (y −A−i x−i ) = Ai (y −Ax +Ai xi ) = Ai r +Ai Ai xi = Ai r +∥Ai ∥ xi
The update rule is given by
AT
i (y − A−i x−i ) AT
i r
xi = 2
= + xi
∥Ai ∥ ∥Ai ∥2
Complexity: O(n) for each coordinate update, O(np) for full cyclic
update over i = 1, · · · , p
Hao Yan Scalable Algorithms for Structured Functions February 29, 2024 53 / 96
Coordinate Descent Coordinate Descent
Is it fast?
1e+02
GD
CD
1e−01
Coordinate descent vs gra-
f(k)−fstar
1e−04
dient descent for linear re-
gression: 100 instances
1e−07
(n = 100, p = 20)
1e−10
0 10 20 30 40
1 1
∥y − Ax∥2 + λ∥x∥1 = ∥y − A−i x−i − Ai xi ∥2 + λ∥x−i ∥1 + λ|xi |
2 2
Solution is given by soft-thresholding
AT
i (y − A−i x−i )
xi = Sλ/∥Ai ∥2 ( )
∥Ai ∥2
If Ai is normalized, AT T
i 1 = 0, Ai Ai = 1, then
xi = Sλ (ATi (y − A−i x−i ))
Hao Yan Scalable Algorithms for Structured Functions February 29, 2024 55 / 96
Coordinate Descent Coordinate Descent
Complexity Analysis
If Ai is normalized, AT T
i 1 = 0, Ai Ai = 1, then
T
xi = Sλ (Ai (y − A−i x−i ))
Update rule
xi = Sλ (AT
i (y − Ax + Ai xi ))
= Sλ (xi + AT
i (y − Ax))
Hao Yan Scalable Algorithms for Structured Functions February 29, 2024 56 / 96
Coordinate Descent Coordinate Descent
Hao Yan Scalable Algorithms for Structured Functions February 29, 2024 57 / 96
Coordinate Descent Coordinate Descent
Note: Even when the solution is desired at only one l, the pathwise
strategy is typically muchmore efficient than directly performing
coordinate descent at λ
Active set strategy takes advantage of sparsity; e.g., for large
problems, coordinate descent for lasso is much faster than it is for
ridge regression
With these strategies in place (and a few more clever tricks),
coordinate descent can be competitve with fastest algorithms for l1
penalized minimization problems
Fortran implementation glmnet, linked to R, MATLAB, etc.
Hao Yan Scalable Algorithms for Structured Functions February 29, 2024 58 / 96
Coordinate Descent Coordinate Descent
Hao Yan Scalable Algorithms for Structured Functions February 29, 2024 59 / 96
Coordinate Descent Coordinate Descent
Hao Yan Scalable Algorithms for Structured Functions February 29, 2024 60 / 96
Coordinate Descent Coordinate Descent
p p p p p
1X X XX XX
min (yii −xii )2 +λ1 |xii |+λ2 |xii −xi,i−1 |+λ3 |xi,i ′ −xi−1,
x 2
i j=1 ′ i=1 i =2 ′ i=2 i =1
Hao Yan Scalable Algorithms for Structured Functions February 29, 2024 61 / 96
Coordinate Descent Coordinate Descent
Hao Yan Scalable Algorithms for Structured Functions February 29, 2024 62 / 96
Coordinate Descent Coordinate Descent
p p
1X X X
min (yi − xi )2 + λ1 |xj | + λ2 |xj − xj−1 |
x 2
i j=1 j=2
Solution:
Moving not only individual parameters but fusing them together in
pairs and varying their common value
Fix λ1 and start λ2 = 0 and solve the problem for incremental values
of λ2
Theorem guaranteed that any two non-zero adjacent values are fused
for λ2 are also fused for λ′2 > λ2
Hao Yan Scalable Algorithms for Structured Functions February 29, 2024 63 / 96
Coordinate Descent Block Coordinate Descent
Minimize f (x), x ∈ R p
Devide x = (x1 , x2 , · · · , xn ), here x1 , · · · , xn can be vectors/matrices
Update each sets of variables xi individually!
(k) (k−1) (k−1)
x1 = arg min f (x1 , x2 , · · · , xn )
x1
(k) (k) (k−1)
x2 = arg min f (x1 , x2 , · · · , xn )
x2
..
.
(k) (k) (k)
xn = arg min f (x1 , x2 , · · · , xn )
xn
Typically used if the problem has some natural division of the variables
Hao Yan Scalable Algorithms for Structured Functions February 29, 2024 64 / 96
Coordinate Descent Block Coordinate Descent
Recommondation Systems
Recommender Systems
Hao Yan Scalable Algorithms for Structured Functions February 29, 2024 65 / 96
Coordinate Descent Block Coordinate Descent
Matrix
MatrixFactorization Approach A ⇡ WH T
FactorizationApproach
Hao Yan Scalable Algorithms for Structured Functions February 29, 2024 66 / 96
Coordinate Descent Block Coordinate Descent
X
min (Aij − wiT hj )2 + λ(∥W ∥2F + ∥H∥2F )
W ∈R m×k ,H∈R n×k
(i,j)∈Ω
Hao Yan Scalable Algorithms for Structured Functions February 29, 2024 67 / 96
Coordinate Descent Block Coordinate Descent
Hao Yan Scalable Algorithms for Structured Functions February 29, 2024 68 / 96
Coordinate Descent Block Coordinate Descent
Example: EM algorithm
Maximization step:
Hao Yan Scalable Algorithms for Structured Functions February 29, 2024 69 / 96
Proximal Gradient: Handling Constraints and Regularization
Section 4
Hao Yan Scalable Algorithms for Structured Functions February 29, 2024 70 / 96
Proximal Gradient: Handling Constraints and Regularization Proximal Gradient
Euler’s Method
d
Gradient Flow: dt x(t) = −∇f (x(t))
Equilibrium point: x(∞) = {∇f (x(t)) = 0}
Discrete solutions x (k) = x(kc)
Forward Euler method:
Hao Yan Scalable Algorithms for Structured Functions February 29, 2024 71 / 96
Proximal Gradient: Handling Constraints and Regularization Proximal Gradient
x (k+1) minimizes
D E 1
f (x (k) ) + ∇f (x (k) ), x − x (k) + ∥x − x (k) ∥22
2c
Compared with forward Euler
Hao Yan Scalable Algorithms for Structured Functions February 29, 2024 72 / 96
Proximal Gradient: Handling Constraints and Regularization Proximal Gradient
Minimize the distance between the dashed graident line and the red
quadratic function
In the optimality
f (x (k) ) + ∇f (x (k) ), x − x (k) and − 2c
1
∥x − x (k) ∥22 have the same slope
Hao Yan Scalable Algorithms for Structured Functions February 29, 2024 73 / 96
Proximal Gradient: Handling Constraints and Regularization Proximal Gradient
Hao Yan Scalable Algorithms for Structured Functions February 29, 2024 74 / 96
Proximal Gradient: Handling Constraints and Regularization Proximal Gradient
k
where φ̂ ·, x is a convex upper bound to φ that is tight at x k ,
For an upper bound of f , consider the function fˆλ given by
Hao Yan Scalable Algorithms for Structured Functions February 29, 2024 75 / 96
Proximal Gradient: Handling Constraints and Regularization Proximal Gradient
Hao Yan Scalable Algorithms for Structured Functions February 29, 2024 76 / 96
Proximal Gradient: Handling Constraints and Regularization Proximal Gradient
Hao Yan Scalable Algorithms for Structured Functions February 29, 2024 77 / 96
Proximal Gradient: Handling Constraints and Regularization Proximal Gradient
(
0 β∈C
h(β) =
∞ else
proxh (b) = arg minβ∈C ∥β − b∥2 is the Euclidean projection
Hao Yan Scalable Algorithms for Structured Functions February 29, 2024 78 / 96
Proximal Gradient: Handling Constraints and Regularization Proximal Gradient
Example: L1 norm
Hao Yan Scalable Algorithms for Structured Functions February 29, 2024 79 / 96
Proximal Gradient: Handling Constraints and Regularization Proximal Gradient
Projected gradient:
D E 1
x (k+1) = arg min{g (x (k) ) + ∇g (x (k) ), x − x (k) + ∥x − x (k) ∥2 + 1c (x)}
x 2c
1 (k) (k) 2
= arg min{ ∥x − (x − t∇g (x ))∥ + c1c (x)}
x 2
= PC (x − t∇g (x (k) ))
(k)
Hao Yan Scalable Algorithms for Structured Functions February 29, 2024 80 / 96
Proximal Gradient: Handling Constraints and Regularization Proximal Gradient
(
0 x ∈X
A special case on h(x) = 1C (x) =
∞ else
C is a set:
The algorithm becomes x (k+1) = PC (x (k) − t∇g (x (k) ))
PC is the projection on set C .
Hao Yan Scalable Algorithms for Structured Functions February 29, 2024 81 / 96
Proximal Gradient: Handling Constraints and Regularization Proximal Gradient
Hao Yan Scalable Algorithms for Structured Functions February 29, 2024 82 / 96
Proximal Gradient: Handling Constraints and Regularization Proximal Gradient
Decomposable Functions
Hao Yan Scalable Algorithms for Structured Functions February 29, 2024 83 / 96
Proximal Gradient: Handling Constraints and Regularization Proximal Gradient
Assumptions on g
Hao Yan Scalable Algorithms for Structured Functions February 29, 2024 84 / 96
Proximal Gradient: Handling Constraints and Regularization Proximal Gradient
Assumptions on g
1
proxcf (x (k) ) = arg min ∥x − x (k) ∥22 + f (x)
| x 2c
{z } |{z}
penalty
approximate x (k) with x
Hao Yan Scalable Algorithms for Structured Functions February 29, 2024 86 / 96
Proximal Gradient: Handling Constraints and Regularization Proximal Gradient
Hao Yan Scalable Algorithms for Structured Functions February 29, 2024 87 / 96
Proximal Gradient: Handling Constraints and Regularization Proximal Gradient
Convergence Rate
Hao Yan Scalable Algorithms for Structured Functions February 29, 2024 88 / 96
Proximal Gradient: Handling Constraints and Regularization Proximal Gradient
minimize∥y − Ax∥22 .
x∈RN
When A has full column rank, we know that the solution is given by
−1 T
xbls = AT A A y.
However, we also know that when AT A is not well-conditioned, this
inverse can be unstable to compute, and iterative descent methods
(gradient descent and conjugate gradients) can take many iterations
to converge.
Hao Yan Scalable Algorithms for Structured Functions February 29, 2024 89 / 96
Proximal Gradient: Handling Constraints and Regularization Proximal Gradient
Hao Yan Scalable Algorithms for Structured Functions February 29, 2024 90 / 96
Proximal Gradient: Handling Constraints and Regularization Proximal Gradient
Example 1: LASSO
For lasso:
1
g (x) = ∥Ax − y ∥2 , h(x) = γ∥x∥1
2
We know that ∇g (x) = AT (Ax − b), proxγ∥x∥1 (x) = Sγ (x)
Sγ (x) = sign(x)(|x| − γ)+ is the soft-thresholding opporator
Hao Yan Scalable Algorithms for Structured Functions February 29, 2024 91 / 96
Proximal Gradient: Handling Constraints and Regularization Proximal Gradient
Example 1: LASSO
In each iteration
Hao Yan Scalable Algorithms for Structured Functions February 29, 2024 92 / 96
Proximal Gradient: Handling Constraints and Regularization Proximal Gradient
Convergence
Hao Yan Scalable Algorithms for Structured Functions February 29, 2024 93 / 96
Proximal Gradient: Handling Constraints and Regularization Proximal Gradient
Hao Yan Scalable Algorithms for Structured Functions February 29, 2024 94 / 96
Proximal Gradient: Handling Constraints and Regularization Proximal Gradient
Example 3: L∞ Regression
For L∞ regerssion:
1
g (x) = ∥Ax − y ∥2 , h(x) = γ∥x∥∞
2
We know that ∇g (x) = AT (Ax − b), proxγ∥x∥∞ (x) = Π∥·∥1 <γ (x)
Hao Yan Scalable Algorithms for Structured Functions February 29, 2024 95 / 96
Proximal Gradient: Handling Constraints and Regularization Proximal Gradient
Hao Yan Scalable Algorithms for Structured Functions February 29, 2024 96 / 96