0% found this document useful (0 votes)
47 views96 pages

2024-02-19-Scalable Algorithms For Structured Functions

Uploaded by

Olabiyi Ridwan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
47 views96 pages

2024-02-19-Scalable Algorithms For Structured Functions

Uploaded by

Olabiyi Ridwan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Scalable Algorithms for Structured Functions

Hao Yan

February 29, 2024

Hao Yan Scalable Algorithms for Structured Functions February 29, 2024 1 / 96
Outline

1 Overview
Regularized Likelihood

2 Subgradient
Algorithm for Subgradient Descent

3 Coordinate Descent
Coordinate Descent
Block Coordinate Descent

4 Proximal Gradient: Handling Constraints and Regularization


Proximal Gradient

Hao Yan Scalable Algorithms for Structured Functions February 29, 2024 2 / 96
Overview

Section 1

Overview

Hao Yan Scalable Algorithms for Structured Functions February 29, 2024 3 / 96
Overview Regularized Likelihood

Challenges for High-Dimensional Problems


It is extremely hard, if not impossible to model the joint distribution of
the HD data Curse of Dimensionality
The curse of Dimensionality
Curse of Dimensionality
Data become sparse sparse
I Data becomes
Point-to-point distances
I Point-to-point loss meaning
distances lose meaning
I Data requirement
Data requirement for statistical inference
for statistical and combinatorial
significance search
and combinatorial
search
complexity complexity
grows grows exponentially
exponentially

Figure: The curse of dimensionality. Source: Parsons, Haque, and Liu


2004
Hao Yan Scalable Algorithms for Structured Functions February 29, 2024 4 / 96
Overview Regularized Likelihood

Blessing of Dimensionality

The blessing of dimensionality: real data highly concentrated on


low-dimensional, sparse, or degenerate structure in high-dimensional
space

Challenge: Encode such structure information into the


likelihood-based inference
Hao Yan Scalable Algorithms for Structured Functions February 29, 2024 5 / 96
Overview Regularized Likelihood

Overfitting Problem
Overfitting

Overfitting: the model has low training error but high prediction error.
Overfitting: the model
Using too many hascanlow
features training
lead error but high prediction error.
to overfitting
Using too many features can lead to overfitting

Hao Yan Scalable Algorithms for Structured Functions February 29, 2024 6 / 96
Overview Regularized Likelihood

Understand Model Complexity

Training data and testing data generated from the same distribution

(xi , yi ) ∼ D

L(f ) = E(x,y )∼D [(f (x) − y )2 ]


Ideal model: minimizing prediction error f ∗ = arg minf ∈F L(f )
Empirical error: error on the training data
n
1X
L(fˆ) = (f (xi ) − yi )2
n
i=1

Our estimator: fˆ = arg minf ∈F L̂(f )

Hao Yan Scalable Algorithms for Structured Functions February 29, 2024 7 / 96
Overview Regularized Likelihood

Unified Framework of High-Dimensional Analysis

Traditional statistical framework


High-dimensional statistical inference:
Number of parameters p is larger than sample size n
Usually have low-dimension structure, including sparse representation,
sparse and structured matrices, low-rank matrices
Solved by regularized M-estimators under high-dimensional scaling,
ξ1n = {ξ1 , · · · , ξn } denote n identically distributed observations with
parameter
Solve convex optimization problem

x̂ ∈ arg min L(x; ξ1n ) + λn R(x)


x

Hao Yan Scalable Algorithms for Structured Functions February 29, 2024 8 / 96
Overview Regularized Likelihood

Why Regularization

Why regulation?
From a computational perspective, regularization can lend stability to
the optimization problem and can lead to algorithmic speed-ups.
From a geometric point of view, regularization helps the solution be
close to the low-dimensional manifold.
From a statistical point of view, regularization avoids over-fitting and
leads to estimators with interesting guarantees on their error.
From a engineering or other domain system point of view,
regularization leads to the solution that has better desired property. It
is a way to put human knowledge into the learning systems.

Hao Yan Scalable Algorithms for Structured Functions February 29, 2024 9 / 96
Overview Regularized Likelihood

All Subset Regression v.s. Ridge Regression

Ridge Regression:

min ∥y − X θ∥22 subject to ∥θ∥22 ≤ t


θ

Best subset selection:

min ∥y − X θ∥22 subject to ∥θ∥0 ≤ t


θ

∥θ∥0 is the number of non-zero elements in θ


Subset selection Ridge Regression
Computationally Feasible No Yes
Variable Selection Yes No

Hao Yan Scalable Algorithms for Structured Functions February 29, 2024 10 / 96
Overview Regularized Likelihood

Vector Norm - Convexity


eometry of regularization in R2 : Convexity

β2 β2 β2

● ● ●

β1 β1 β1

|| ||0  t || || 1  t || || 3  t
2 4
β2 β2 β2

● ● ●

β1 β1 β1

Hao Yan Scalable Algorithms for Structured Functions February 29, 2024 11 / 96
Overview Regularized Likelihood

Vector Norm - Variable Selection


eometry of regularization in R2 : Model selection

β2 β2 β2

● ● ●

β1 β1 β1

|| ||0  t || || 1  t || || 3  t
2 4
β2 β2 β2

● ● ●

β1 β1 β1

|| Hao
||1Yan
t || || 3  t
Scalable Algorithms
|| ||  t
February229, 2024
2 for Structured Functions 12 / 96
mmary Overview Regularized Likelihood

Vector Norm - Variable Selection

Convex? Corners?

|| ||0 No Yes
|| || 1 No Yes
2
|| || 3 No Yes
4

|| ||1 Yes Yes X

|| || 3 Yes No
2
|| ||2 Yes No

Hao Yan Scalable Algorithms for Structured Functions February 29, 2024 13 / 96
both worlds: || ||1 Overview Regularized Likelihood

Vector Norm - Variable Selection

β2

β1

Hao Yan Scalable Algorithms for Structured Functions February 29, 2024 14 / 96
Overview Regularized Likelihood

Lasso Regression

Assuming no intercept by first standardize the data. (θ0 = 0)


Lasso Regression
Xn p
X
min {yi − (θ1 x1i + · · · + θp xpi )}2 s.t. |θj | ≤ M
i=1 i=1

Equivalently
n
X p
X
2
min {yi − (θ1 x1i + · · · + θp xpi )} + λ |θj |
i=1 i=1

In matrix formation
min ∥y − X θ∥2 + λ∥θ∥1
θ

Challenge: Non-differentiable on θ = 0
Hao Yan Scalable Algorithms for Structured Functions February 29, 2024 15 / 96
Overview Regularized Likelihood

Example 2: Support Vector Machines

n
X 1
min max(0, 1 − yi x T ξi ) + ∥x∥2
x | {z } 2
| {z }
i=1
Hinge Loss Function
L2 regularization
Hinge Loss function is a convex-relaxation of 0 − 1 loss function

Non-differentiable
Hao Yan
when yi xξi = 1for Structured Functions
Scalable Algorithms February 29, 2024 16 / 96
Overview Regularized Likelihood

Type of Regularization
Property Regulizer R Form
Pp
Sparse Solution L1 −norm |x |
P j=1 j 1
Group Sparsity Group L1 −norm P G ∈G |xG |2
Continuity with Jump L1 −norm of Gradient |x − xj+1 |
P j j
Smoothness Smoothess penalty + xj+2 − 2xj )2
j (xjP
Low-rank Matrices Nuclear Norm j σj (X )
Small magnitude Ridge penalty ∥x∥2

Hao Yan Scalable Algorithms for Structured Functions February 29, 2024 17 / 96
Subgradient

Section 2

Subgradient

Hao Yan Scalable Algorithms for Structured Functions February 29, 2024 18 / 96
Subgradient Algorithm for Subgradient Descent

Problem of Gradient Descent

Finding the Gradient descent direction (or even finding a descent


direction) may involve expensive computation
Stepsize rules are tricky to choose: for certain popular stepsize rules
(like exact line search), Gradient descent might converge to
non-optimal points.
Even though it never hits non-differentiable points, Gradient descent
with exact line search gets stuck around a non-optimal point (i.e. (0,
0))
Problem: Gradient descent directions may undergo large /
discontinuous changes when close to convergence limits

Hao Yan Scalable Algorithms for Structured Functions February 29, 2024 19 / 96
Subgradient Algorithm for Subgradient Descent

Wolfe’s example

(
5(9x12 + 16x22 )1/2 x1 > |x2 |
f (x1 , x2 ) =
9x1 + 16|x2 | x1 ≤ |x2 |

Hao Yan Scalable Algorithms for Structured Functions February 29, 2024 20 / 96
Subgradient Algorithm for Subgradient Descent

Subgradient
Convexity ⇒ continuity (on the domain of the function)
Convexity ̸⇒ differentiability (e.g. ∥x∥1 )
Subgradients generalize gradients for general convex function
v is subgradient of f at x if f (x ′ ) ≥ f (x) + v T (x ′ − x) Subdifferential:
∂f (x) = {all subgradients of f at x}
If f is differentiable, ∂f (x) = {∇f (x)}

supporting hyperplanes

Hao Yan Scalable Algorithms for Structured Functions February 29, 2024 21 / 96
Subgradient Algorithm for Subgradient Descent

Basic Rules

Scaling ∂(αf ) = α∂(f )


Summation ∂(f1 + f2 ) = ∂(f1 ) + ∂(f2 )
Affine Transformation: ∂f (Ax + b) = AT ∂f (Ax + b)
Chain rule ∂(g (f (x)) = g ′ (f (x))∂f (x), suppose f is convex, and g is
differentiable, nondecreasing, and convex
Maximum f (x) = maxi fi (x), ∂f (x) = Conv {∪{∂fi (x)|fi (x) = f (x)}},
which is the convex hull of all active functions

Hao Yan Scalable Algorithms for Structured Functions February 29, 2024 22 / 96
Subgradient Algorithm for Subgradient Descent

Example: Piecewise-linear function

f (x) = maxi=1,··· ,m (aT


i x + bi )
∂f (x) = Conv{ai |i ∈ I (x)}, I (x) = {i|aiT x + bi } = f (x)}
I (x) is the index of all active functions
Conv is the convex hull, the subgradient is the convex hull of all active
coefficient ai s.

Hao Yan Scalable Algorithms for Structured Functions February 29, 2024 23 / 96
Subgradient Algorithm for Subgradient Descent

Dual Norm Examples

f (x) = ∥x∥p . Define q such that 1/p + 1/q = 1, then

∥x∥p = max z⊤ x
∥z∥q ≤1

∂∥x∥p = Conv{z|z = arg max z⊤ x}


∥z∥q ≤1

Hao Yan Scalable Algorithms for Structured Functions February 29, 2024 24 / 96
Subgradient Algorithm for Subgradient Descent

Dual Norm Examples


Example p = 1
∥x∥1 = maxz⊤ x
z∈{−1,1}n


[−1, 1] xk = 0
Y
∂∥x∥1 = Jk , Jk = {1} xk > 0


{−1} xk < 0

Hao Yan Scalable Algorithms for Structured Functions February 29, 2024 25 / 96
Subgradient Algorithm for Subgradient Descent

Subgradient Properties

f is convex and differentiable =⇒ ∂f (x) = {∇f (x)}.


Any point x, there can be 0,1 , or infinitely many subgradients.
∂f (x) = ∅ =⇒ f is not convex.

Hao Yan Scalable Algorithms for Structured Functions February 29, 2024 26 / 96
Subgradient Algorithm for Subgradient Descent

Globla Optimality Condition

Definition
A vector g ∈ Rn is a subgradient of f : Rn → R at x if for all z,

f (z) ⩾ f (x) + g T (z − x).

Corollary
If 0 ∈ ∂f (x), then x is a global minimizer of f .

Hao Yan Scalable Algorithms for Structured Functions February 29, 2024 27 / 96
Subgradient Algorithm for Subgradient Descent

Example Absolution Function

For the function


 f (x) = |x|, the subgradient is given by:

{1} x >0
∂f (x) = {−1} x <0


[−1, 1] x = 0
This indicates that for any non-zero x, the subgradient is simply the
sign of x, but at x = 0, the subgradient includes all values between -1
and 1 .

Hao Yan Scalable Algorithms for Structured Functions February 29, 2024 28 / 96
Subgradient Algorithm for Subgradient Descent

Subdifferential of Absolute Value

Hao Yan Scalable Algorithms for Structured Functions February 29, 2024 29 / 96
Subgradient Algorithm for Subgradient Descent

Subgradient Method

minx f (x) with f convex, dom(f ) = R n


Subgradient method: choose x (0) ∈ R n and repeat:

x (k) = x (k−1) − αk g (k−1)

where g (k−1) ∈ ∂f (x (k−1) )


Generalization of Gradient Descent for non-smooth function
Problem: Not a descent algorithm
(k) (k)
fbest = f (xbest ) = min f (x (i) )
i=1,··· ,k

Hao Yan Scalable Algorithms for Structured Functions February 29, 2024 30 / 96
Subgradient Algorithm for Subgradient Descent

Stochastic Subgradient Method

Stochastic subgradient method is the subgradient method, using noisy


unbiased subgradient estimation

x (k+1) = x (k) − αk g̃ (k)

x (k) is k th iterate
g̃ (k) is any unbiased subgradient estimation of (convex) f at x (k)
Typically, g̃ (k) is computed on mini-batch sample ξi , i = 1, · · · , B

Hao Yan Scalable Algorithms for Structured Functions February 29, 2024 31 / 96
f (x) is always a descent direction (except
Subgradient Algorithm for Subgradient Descent

Not descent algorithms


(convex) functions, δx = −g, with
nt direction
Example 1: f (x) = |x1 | + 2|x2 |
x2

x1

Example 2: f (x) = |x| at x = 0

Hao Yan Scalable Algorithms for Structured Functions February 29, 2024 32 / 96
Subgradient Algorithm for Subgradient Descent

Descent in terms of Distance

Property
if f is convex, f (z) < f (x), g ∈ ∂f (x), then for small t > 0,

∥x − tg − z∥2 < ∥x − z∥2

thus −g is descent direction for ∥x − z∥2 , for any z with f (z) < f (x) (e.g.,
x⋆ )

Negative subgradient is descent direction for distance to optimal point


∥x − tg − z∥22 = ∥x − z∥22 − 2tg T (x − z) + t 2 ∥g ∥22
proof: ≤ ∥x − z∥22 − 2t(f (x) − f (z)) + t 2 ∥g ∥22
≤ ∥x − z∥22 for small t

Hao Yan Scalable Algorithms for Structured Functions February 29, 2024 33 / 96
Subgradient Algorithm for Subgradient Descent

Convergence Rate
Theorem (Theorem)
(k)
Suppose fbest is the best function value obtained to k th iteration
R 2 +G 2 ki=1 α2i
P
(k)
fbest − f ∗ ≤ Pk . Here, R = ∥x (1) − x ∗ ∥2 and assume ∥g ∥2 ≤ G
2 i=1 αi

Constant step size: αk = α, we get


(k) R 2 + G 2 kα2
fbest − f ∗ ≤ → G 2 α/2
2kα
Not Guaranteed to converge!
P∞ P
Finite square step-size: < ∞, ki=1 αk = ∞
2
i=1 αk
P
(k) ∗ R 2 + G 2 ki=1 αi2 √
fbest −f ≤ Pk ≤ RG / k
2 i=1 αi
(k)
We have fbest → f ∗ √
Optimal step size when αk = R/G / k
Hao Yan Scalable Algorithms for Structured Functions February 29, 2024 34 / 96
Subgradient Algorithm for Subgradient Descent

Convergence Rate (Not Required)


Define R = ∥x (1) − x ∗ ∥2 and assume ∥g ∥2 ≤ G
Let x ∗ to be any minimizer of f

(k+1) ∗ 2 (k) (k) ∗ 2


∥x − x ∥2 = ∥x − αk g − x ∥2
(k) ∗ 2 (k)T (k) ∗ 2 (k) 2
= ∥x − x ∥2 − 2αk g (x − x ) + αk ∥g ∥2
(k) ∗ 2 (k) ∗ 2 (k) 2
≤ ∥x − x ∥2 − 2αk (f (x ) − f (x )) + αk ∥g ∥2

Apply it recursively

k k
(k+1) ∗ 2 (1) ∗ 2 (i) ∗ 2 (k) 2
X X
0 ≤ ∥x − x ∥2 ≤ ∥x − x ∥2 − 2 αi (f (x )−f )+ αi ∥g ∥2
i=1 i=1
k k
2 (i) ∗ 2 2
X X
≤R −2 αi (f (x )−f )+G αi
i=1 i=1
k k
2 (k) ∗ 2 2
X X
≤ R − 2(fbest − f )( αi ) + G αi
i=1 i=1

We have

R 2 + G 2 ki=1 α2
P
(k) ∗ i
fbest − f ≤ Pk
2 i=1 αi

Hao Yan Scalable Algorithms for Structured Functions February 29, 2024 35 / 96
Subgradient Algorithm for Subgradient Descent

Step-size Choice

Fixed step sizes αk = α for all k.


Diminishing step sizes: choose to meet conditions

X ∞
X
αk2 < +∞, αk = ∞
k=1 k=1

Important that step sizes go to zero, but not too fast such as αk = 1
k
Major different from Gradient Descent:
Step sizes are typically pre-specified, not adaptively computed (No Line
Search! )

Hao Yan Scalable Algorithms for Structured Functions February 29, 2024 36 / 96
Subgradient Algorithm for Subgradient Descent

Example 1: Lasso

Hao Yan Scalable Algorithms for Structured Functions February 29, 2024 37 / 96
Subgradient Algorithm for Subgradient Descent

Example 1: Lasso

Lasso:
1
min ∥Ax − y ∥22 + τ ∥x∥1
x 2

Optimality 0 ∈ ∂( 12 ∥Ax − y ∥22 + τ ∥x∥1 )




{1} xi > 0
AT (y − Ax) = τ v , for v ∈ ∂∥x∥1 , vi ∈ {−1} xi < 0


[−1, 1] xi = 0

Hao Yan Scalable Algorithms for Structured Functions February 29, 2024 38 / 96
Subgradient Algorithm for Subgradient Descent

Optimality Criterion for Lasso

If A = I , this gives y − x = τ v
This can be solved for each xi individually as


{1} xi > 0
yi − xi = τ vi ∈ {−1} xi < 0


[−1, 1] xi = 0

If xi > 0, vi = 1, yi − τ = xi , when yi > τ


If xi < 0, vi = −1, yi + τ = xi , when yi < −τ
If xi = 0, yi − xi = yi ∈ [−τ, τ ], when yi ∈ [−τ, τ ]
Solution is xi = sgn(yi ) · (|yi | − τ )+ or the soft thresholding operator

Hao Yan Scalable Algorithms for Structured Functions February 29, 2024 39 / 96
Subgradient Algorithm for Subgradient Descent

Subgradient Descent for Lasso

If A is orthogonal, similar proof can also be given!


x = sgn(AT y ) · (|AT y | − τ )+
If A is not orthogonal, we cannot achieve the close-form solution!
Use Subgradient descent:
x (k+1) = x (k) − αk g (k) , g (k) = AT (y − Ax) − τ v is the subgradient
Computational complexity: O(np)

Hao Yan Scalable Algorithms for Structured Functions February 29, 2024 40 / 96
Subgradient Algorithm for Subgradient Descent

Result
Ridge (Left): use gradients; Lasso (Right): use subgradients. Example
here has n = 1000, p = 20:

Step sizes hand-tuned to be favorable for each method (of course


comparison
Hao Yan is imperfect,
Scalable but it reveals
Algorithms the Functions
for Structured convergence behaviors)
February 29, 2024 41 / 96
Subgradient Algorithm for Subgradient Descent

Example 2: SVM

n
X 1
min max(0, 1 − yi x T ξi ) + ∥x∥2
x | {z } 2
| {z }
i=1
Hinge Loss Function
L2 regularization

Subgradient: ∂ 12 ∥x∥2 = x,


{0} yi x T ξi > 1
vi = ∂ max(0, 1 − yi x T ξi ) = {yi ξi } yi x T ξi < 1


[0, yi ξi ] yi x T ξi = 1
P
Subgradient is g = i (xi + vi )
Subgradient descent x (k+1) = x (k) − τ g (k) .
Can be combined with stochastic optimization to choose randomly i
as g = xi + vi .
Hao Yan Scalable Algorithms for Structured Functions February 29, 2024 42 / 96
Coordinate Descent

Section 3

Coordinate Descent

Hao Yan Scalable Algorithms for Structured Functions February 29, 2024 43 / 96
Coordinate Descent Coordinate Descent

Coordinate Descent

These are some of the core methods in optimization, and they are the
main objects of study in this field, such as
First-order methods
Newton’s method
Dual methods
Today, we’ll see an extremely simple technique that is surprisingly
efficient and scalable.
They don’t apply as broadly as above methods, but are interesting and
useful when they do apply.
The idea is coordinate-wise minimization

Hao Yan Scalable Algorithms for Structured Functions February 29, 2024 44 / 96
Coordinate Descent Coordinate Descent

Coordinate-wise Minimization

Cyclic Coordinate descent:


One at a time update
Cyclic through {1, · · · , n}
(k) (k−1)
x1 = arg min f (x1 , x2 , · · · , xn(k−1) )
x1
(k) (k)
x2 = arg min f (x1 , x2 , · · · , xn(k−1) )
x2
..
.
(k) (k)
xn(k) = arg min f (x1 , x2 , · · · , xn )
xn

Hao Yan Scalable Algorithms for Structured Functions February 29, 2024 45 / 96
Coordinate Descent Coordinate Descent

Convergence
If f (x) is convex and differentiable, f (x) is minimized at x ∗ along each
coordinate axis, or f (x ∗ + d · ei ) ≥ f (x ∗ ), then x ∗ is a global minimizer.
Proof.
If f (x ∗ + d · ei ) ≥ f (x ∗ ), we know that ∇f (x ∗ )i = 0. Therefore,
 
∇f (x ∗ ) = ∂x ∂f
1
(x ∗) · · · ∂f
∂xn (x ∗)
=0

Hao Yan Scalable Algorithms for Structured Functions February 29, 2024 46 / 96
Coordinate Descent Coordinate Descent

Convergence

Q: Same question, but for convex non-differentiable f


A: No! Counter example.

4
2
f

x2

0
−2
−4
x2

x1 −4 −2 0 2 4

x1

A: No! Look at the above counterexample


Hao Yan Scalable Algorithms for Structured Functions February 29, 2024 47 / 96
Coordinate Descent Coordinate Descent

Coordinate Descent

Consider first min f (x) with f : R n → R


Iteration j of basic coordinate descent:
Choose index ij ∈ {1, · · · , n}
Fix all i ̸= ij , change xij to reduces f
Variants:
Take a gradient step: −∇ij f (x)
Do a more rigorous search in the subspace of xij
Actually minimize f according to xij arg minxij f (x)

Hao Yan Scalable Algorithms for Structured Functions February 29, 2024 48 / 96
Coordinate Descent Coordinate Descent

Convergence
P
Q: Same question, but for f (x) = g (x) + ni=1 hi (xi ) convex with hi
convex (possible non-differentiable) and g (x) (differentiable)
A: Yes! for any y
n
X
f (y ) − f (x) ≥ ∇g (x)T (y − x) + (hi (yi ) − hi (xi ))
i=1
n
X
= [∇i g (x)(yi − xi ) + hi (yi ) − hi (xi )]
i=1

g (x) + hi (xi ) ≤ g (x + yi ei ) + hi (yi )

4
2
f

x2


0
−2
−4
x2

x1 −4 −2 0 2 4
Hao Yan Scalable Algorithms for Structured Functions x1
February 29, 2024 49 / 96
Coordinate Descent Coordinate Descent

Selecting Coordinate Order

Choice of Ordering
Cyclic Order: run all coordinates in cyclic order
Random Sampling: Randomly select some coordinate to update
Gauss−Southwell : At each iteration, pick coordinate i so that
i = arg maxj |∇j f (x (t) )|
Random Permutation: run cyclic order on a permuted index

Hao Yan Scalable Algorithms for Structured Functions February 29, 2024 50 / 96
Coordinate Descent Coordinate Descent

Example 1: Linear Regression

minx f (x) = 21 ∥y − Ax∥2


Consider minimizing over xi with all xj , j ̸= i fixed
Optimality condition:

0 = ∇i f (x) = AT T
i (Ax − y ) = Ai (Ai xi + A−i x−i − y )

Solved by
AT
i (y − A−i x−i )
xi =
∥Ai ∥2
Coordinate descent repeats the update for i = 1, · · · , p, 1, 2, · · · , p, · · ·

Hao Yan Scalable Algorithms for Structured Functions February 29, 2024 51 / 96
Coordinate Descent Coordinate Descent

Compexity

Complexity for each iteration i

AT
i (y − A−i x−i )
xi =
∥Ai ∥2

Ai ∈ R n×1 , y ∈ R n×1 , A−i ∈ R n×(p−1) , x−i ∈ R (p−1)×1


Compute A−i x−i requries O(np) complexity
Each coordinate update is O(n)
Each full cyclic over i = 1, · · · , p takes O(np 2 )
Is it optimal ?

Hao Yan Scalable Algorithms for Structured Functions February 29, 2024 52 / 96
Coordinate Descent Coordinate Descent

A Clever Trick

Define r = y − Ax
AT T T T T 2
i (y −A−i x−i ) = Ai (y −Ax +Ai xi ) = Ai r +Ai Ai xi = Ai r +∥Ai ∥ xi
The update rule is given by

AT
i (y − A−i x−i ) AT
i r
xi = 2
= + xi
∥Ai ∥ ∥Ai ∥2

Complexity: O(n) for each coordinate update, O(np) for full cyclic
update over i = 1, · · · , p

Hao Yan Scalable Algorithms for Structured Functions February 29, 2024 53 / 96
Coordinate Descent Coordinate Descent

Is it fast?

1e+02
GD
CD

1e−01
Coordinate descent vs gra-

f(k)−fstar

1e−04
dient descent for linear re-
gression: 100 instances

1e−07
(n = 100, p = 20)

1e−10
0 10 20 30 40

s it fairIs it fair to compare 1 full cycle of coordinate descent to 1 iteration of


to compare 1 cycle of coordinate descent to 1 iteration of
gradient descent?
gradient descent?
Yes, since Yes, if we’re
both requires onlyclever:
O(np) operations
CD is not a first-order
T (y method and it used much
T r more information of
the problem A A i x i ) A
x = ii = i + xold i
Hao Yan AT A
Scalable Algorithms for Structured Functions kA k2 February 29, 2024 54 / 96
Coordinate Descent Coordinate Descent

Example 2: LASSO Regression

f (x) = 12 ∥y − Ax∥2 + λ∥x∥1


P
Note that ∥x∥1 = pi=1 |xi |, nonsmooth part is separable, convergence
is guaranteed.
Minizing over xi with xj , j ̸= i fixed, si ∈ ∂|xi |

1 1
∥y − Ax∥2 + λ∥x∥1 = ∥y − A−i x−i − Ai xi ∥2 + λ∥x−i ∥1 + λ|xi |
2 2
Solution is given by soft-thresholding

AT
i (y − A−i x−i )
xi = Sλ/∥Ai ∥2 ( )
∥Ai ∥2

If Ai is normalized, AT T
i 1 = 0, Ai Ai = 1, then
xi = Sλ (ATi (y − A−i x−i ))

Hao Yan Scalable Algorithms for Structured Functions February 29, 2024 55 / 96
Coordinate Descent Coordinate Descent

Complexity Analysis

If Ai is normalized, AT T
i 1 = 0, Ai Ai = 1, then
T
xi = Sλ (Ai (y − A−i x−i ))
Update rule

xi = Sλ (AT
i (y − Ax + Ai xi ))
= Sλ (xi + AT
i (y − Ax))

Computational complexity is still O(n) for each iteration (for comuting


Ax)

Hao Yan Scalable Algorithms for Structured Functions February 29, 2024 56 / 96
Coordinate Descent Coordinate Descent

Pathwise Coordiante Descent

Proposed in Friedman et al. (2007, 2009)


Outer loop (Pathwise strategy)
Compute solution over a sequence λ1 > λ2 > · · · > λr
For each λk , initialize coordinate descent at computed solution for λk+1
Inner loop (Active set strategy)
Perform one coordinate cycle (or small number of cycles), and record
active set A of coefficients that are nonzero
Cycle over only the coefficients in A until convergence
Check optimality conditions over all coefficients; if not all satisfied, add
offending coefficients to A, go back one step
Implemented into glmnet algorithm in R, MATLAB, Python

Hao Yan Scalable Algorithms for Structured Functions February 29, 2024 57 / 96
Coordinate Descent Coordinate Descent

Pathwise Coordiante Descent

Note: Even when the solution is desired at only one l, the pathwise
strategy is typically muchmore efficient than directly performing
coordinate descent at λ
Active set strategy takes advantage of sparsity; e.g., for large
problems, coordinate descent for lasso is much faster than it is for
ridge regression
With these strategies in place (and a few more clever tricks),
coordinate descent can be competitve with fastest algorithms for l1
penalized minimization problems
Fortran implementation glmnet, linked to R, MATLAB, etc.

Hao Yan Scalable Algorithms for Structured Functions February 29, 2024 58 / 96
Coordinate Descent Coordinate Descent

Coordiante Gradient Descent

For a smooth function, the iterations


(k) (k−1) (k) (k) (k−1) (k−1)
xi = xi − tik ∇i f (x1 , · · · , xi−1 , xi , xi+1 , · · · , xn )

for k = 1, 2, 3, · · · are called the coordinate gradient descent.

Hao Yan Scalable Algorithms for Structured Functions February 29, 2024 59 / 96
Coordinate Descent Coordinate Descent

Example 3: Fused Lasso

Diagonal fused LASSO


p p
1X X X
min (yi − xi )2 + λ1 |xj | + λ2 |xj − xj−1 |
x 2
i j=1 j=2

Hao Yan Scalable Algorithms for Structured Functions February 29, 2024 60 / 96
Coordinate Descent Coordinate Descent

2-D Fused Lasso and Denoising

p p p p p
1X X XX XX
min (yii −xii )2 +λ1 |xii |+λ2 |xii −xi,i−1 |+λ3 |xi,i ′ −xi−1,
x 2
i j=1 ′ i=1 i =2 ′ i=2 i =1

we consider fusions of pixels one unit away in horizontal or vertical


directions.
After a number of fusions, fused regions can become very irregular in
shape.

Hao Yan Scalable Algorithms for Structured Functions February 29, 2024 61 / 96
Coordinate Descent Coordinate Descent

Algorithm of Fused Lasso

Question: Can we use Coordinate Descent?

Hao Yan Scalable Algorithms for Structured Functions February 29, 2024 62 / 96
Coordinate Descent Coordinate Descent

Algorithm of Fused Lasso

p p
1X X X
min (yi − xi )2 + λ1 |xj | + λ2 |xj − xj−1 |
x 2
i j=1 j=2

Solution:
Moving not only individual parameters but fusing them together in
pairs and varying their common value
Fix λ1 and start λ2 = 0 and solve the problem for incremental values
of λ2
Theorem guaranteed that any two non-zero adjacent values are fused
for λ2 are also fused for λ′2 > λ2

Hao Yan Scalable Algorithms for Structured Functions February 29, 2024 63 / 96
Coordinate Descent Block Coordinate Descent

Block Coordinate Descent

Minimize f (x), x ∈ R p
Devide x = (x1 , x2 , · · · , xn ), here x1 , · · · , xn can be vectors/matrices
Update each sets of variables xi individually!
(k) (k−1) (k−1)
x1 = arg min f (x1 , x2 , · · · , xn )
x1
(k) (k) (k−1)
x2 = arg min f (x1 , x2 , · · · , xn )
x2
..
.
(k) (k) (k)
xn = arg min f (x1 , x2 , · · · , xn )
xn

Typically used if the problem has some natural division of the variables

Hao Yan Scalable Algorithms for Structured Functions February 29, 2024 64 / 96
Coordinate Descent Block Coordinate Descent

Recommondation Systems
Recommender Systems

Hao Yan Scalable Algorithms for Structured Functions February 29, 2024 65 / 96
Coordinate Descent Block Coordinate Descent

Matrix
MatrixFactorization Approach A ⇡ WH T
FactorizationApproach

Hao Yan Scalable Algorithms for Structured Functions February 29, 2024 66 / 96
Coordinate Descent Block Coordinate Descent

Matrix Factorization Approach

X
min (Aij − wiT hj )2 + λ(∥W ∥2F + ∥H∥2F )
W ∈R m×k ,H∈R n×k
(i,j)∈Ω

Ω = {(i, j)|Aij is observed


Recularized terms to avoid over-fitting
the i th user ⇒ i th row of W , wiT
the j th iterm ⇒ j th row of H, hjT

Hao Yan Scalable Algorithms for Structured Functions February 29, 2024 67 / 96
Coordinate Descent Block Coordinate Descent

Example: Alternativating Least Squares

Minimize H, W until convergence


Subproblem
X
min (Aij − (WH T ))ij)2 + λ(∥W ∥2F + ∥H∥2F )
H
(i,j)∈Ω
n
X 1X λ
= ( (Aij − wiT hj )2 + ∥hj ∥2 )
2 2
j=1 i∈Ωj

Ridge regression problem


Easy to paralleize n independet ridge regression subproblems
Block Coordinate Gradient Descent can also be used (Implemented in
the homework)

Hao Yan Scalable Algorithms for Structured Functions February 29, 2024 68 / 96
Coordinate Descent Block Coordinate Descent

Example: EM algorithm

log p(x; θ) = ELBO(q; θ) + KL(q(z)∥p(z|x))

ELBO(q; θ) = Eq ( P(x,z;θ) P(z|x;θ)


q(z) ), KL(q∥p) = −Eq(z) [log( q(z) )]
Expectation step:

q (t) = arg max ELBO(q; θ(t) )


q

Maximization step:

θ(t+1) = arg max ELBO(q (t) ; θ)


q

Will be covered later in much more details!

Hao Yan Scalable Algorithms for Structured Functions February 29, 2024 69 / 96
Proximal Gradient: Handling Constraints and Regularization

Section 4

Proximal Gradient: Handling Constraints and


Regularization

Hao Yan Scalable Algorithms for Structured Functions February 29, 2024 70 / 96
Proximal Gradient: Handling Constraints and Regularization Proximal Gradient

Euler’s Method

d
Gradient Flow: dt x(t) = −∇f (x(t))
Equilibrium point: x(∞) = {∇f (x(t)) = 0}
Discrete solutions x (k) = x(kc)
Forward Euler method:

x (k+1) = x (k) − c∇f (x (k) )

Backward Euler method:


solve x
x (k+1) ←− x = x (k) − c∇f (x)

Hao Yan Scalable Algorithms for Structured Functions February 29, 2024 71 / 96
Proximal Gradient: Handling Constraints and Regularization Proximal Gradient

Gradient Descent/Forward Euler

Assume function f is convex, differentiable


Consider
min f (x)
Gradient descent iteration (with step size c):

x (k+1) = x (k) − c∇f (x (k) )

x (k+1) minimizes
D E 1
f (x (k) ) + ∇f (x (k) ), x − x (k) + ∥x − x (k) ∥22
2c
Compared with forward Euler

x (k+1) = x (k) − c∇f (x (k) )

Hao Yan Scalable Algorithms for Structured Functions February 29, 2024 72 / 96
Proximal Gradient: Handling Constraints and Regularization Proximal Gradient

Gradient Descent - A Proximal View

x (k+1) minimizes f (x (k) ) + ∇f (x (k) ), x − x (k) + 1


2c ∥x − x (k) ∥22

Minimize the distance between the dashed graident line and the red
quadratic function
In the optimality
f (x (k) ) + ∇f (x (k) ), x − x (k) and − 2c
1
∥x − x (k) ∥22 have the same slope

Hao Yan Scalable Algorithms for Structured Functions February 29, 2024 73 / 96
Proximal Gradient: Handling Constraints and Regularization Proximal Gradient

Proximal Gradient as the Majorization Maximization

Hao Yan Scalable Algorithms for Structured Functions February 29, 2024 74 / 96
Proximal Gradient: Handling Constraints and Regularization Proximal Gradient

Proximal Gradient as the Majorization Maximization


A majorization-minimization algorithm for minimizing a function
φ : Rn → R consists of the iteration
 
x k+1 := argminφ̂ x, x k ,
x

k

where φ̂ ·, x is a convex upper bound to φ that is tight at x k ,
For an upper bound of f , consider the function fˆλ given by

fˆλ (x, y ) = f (y ) + ∇f (y )T (x − y ) + (1/2λ)∥x − y ∥22 ,

with λ > 0. For fixed y , this function is convex, satisfies


fˆλ (x, x) = f (x), and is an upper bound on f when λ ∈ (0, 1/L], where
L is a Lipschitz constant of ∇f . The algorithm
 
x k+1 := argminfˆλ x, x k
x

Hao Yan Scalable Algorithms for Structured Functions February 29, 2024 75 / 96
Proximal Gradient: Handling Constraints and Regularization Proximal Gradient

Proximal Operator as Backward Eular

This is exactly what the proximal operator does. If f is differentiable,


then  
1 2
x k+1 = arg min f (x) + ∥x − x k ∥2
x∈RN 2αk

1
0 = ∇f (x k+1 ) + (x k+1 − x k ) .
αk

Proximal operator can be interpretated as Backward Euler for gradient


flow.

Hao Yan Scalable Algorithms for Structured Functions February 29, 2024 76 / 96
Proximal Gradient: Handling Constraints and Regularization Proximal Gradient

Proximity Operators as Regularization

Proximity Operators: Map a convex function f to another


1
proxcf (x (k) ) = arg min ∥x − x (k) ∥22 + f (x)
| x 2c
{z } |{z}
penalty
approximate x (k) with x

Although proxcf (x (k) ) can sometimes be difficult to compute, it


greatly simplifies computation
For some sub-differential functions f

Hao Yan Scalable Algorithms for Structured Functions February 29, 2024 77 / 96
Proximal Gradient: Handling Constraints and Regularization Proximal Gradient

Example: Characteristic Functions

(
0 β∈C
h(β) =
∞ else
proxh (b) = arg minβ∈C ∥β − b∥2 is the Euclidean projection

Hao Yan Scalable Algorithms for Structured Functions February 29, 2024 78 / 96
Proximal Gradient: Handling Constraints and Regularization Proximal Gradient

Example: L1 norm

h(x) = ∥x∥1 , then


Proxλh (x) = S(x; λ) is the soft thesholding

Hao Yan Scalable Algorithms for Structured Functions February 29, 2024 79 / 96
Proximal Gradient: Handling Constraints and Regularization Proximal Gradient

Projected Gradient Descent

Projected gradient:
D E 1
x (k+1) = arg min{g (x (k) ) + ∇g (x (k) ), x − x (k) + ∥x − x (k) ∥2 + 1c (x)}
x 2c
1 (k) (k) 2
= arg min{ ∥x − (x − t∇g (x ))∥ + c1c (x)}
x 2
= PC (x − t∇g (x (k) ))
(k)

This is exactly the projected gradient descent

Hao Yan Scalable Algorithms for Structured Functions February 29, 2024 80 / 96
Proximal Gradient: Handling Constraints and Regularization Proximal Gradient

Example: Gradient Projection Method

(
0 x ∈X
A special case on h(x) = 1C (x) =
∞ else
C is a set:
The algorithm becomes x (k+1) = PC (x (k) − t∇g (x (k) ))
PC is the projection on set C .

Hao Yan Scalable Algorithms for Structured Functions February 29, 2024 81 / 96
Proximal Gradient: Handling Constraints and Regularization Proximal Gradient

Import Proximity Operators

h(x) = 0, proxh (x) = x


$h(x)=$indicator function of closed convex set C , proxh is the
projection on C .

proxh (x) = arg min ∥u − x∥22 = PC (x)


u∈C

L1 norm: h(x) = τ ∥x∥1 , proxh (y ) = Sτ (y ) := sign(y ) max(|y | − τ, 0)


y
Squared Euclidean norm: ψ(x) = τ2 ∥x∥22 , proxh (y ) = 1+τ
Nuclear norm ψ = τ ∥X ∥∗ , Y = UΛV T is the SVD, we have

proxh (Y ) = USτ (Λ)V T

Hao Yan Scalable Algorithms for Structured Functions February 29, 2024 82 / 96
Proximal Gradient: Handling Constraints and Regularization Proximal Gradient

Decomposable Functions

Unconstrained optimization with objective split in two compenents

minimizex f (x) = g (x) + h(x)


g is convex differentiable, dom(g ) = R n
h is convex, not necessary differentiable, with in-expensive
prox-operator
We can use subgradient to solve
Are there any scalable ways for large-scale optimization?

Hao Yan Scalable Algorithms for Structured Functions February 29, 2024 83 / 96
Proximal Gradient: Handling Constraints and Regularization Proximal Gradient

Assumptions on g

Lower bound: Convexity on g : exists m such that


m
g (x) ≥ g (u) + ∇g (u)T (x − u) + ∥x − u∥22
2
m = 0: convexity
m > 0: strong convexity
Upper bound: Lipschitz continuity of gradient
L
g (x) ≤ g (u) + ∇g (u)T (x − u) + ∥x − u∥22
2

Hao Yan Scalable Algorithms for Structured Functions February 29, 2024 84 / 96
Proximal Gradient: Handling Constraints and Regularization Proximal Gradient

Assumptions on g

1
proxcf (x (k) ) = arg min ∥x − x (k) ∥22 + f (x)
| x 2c
{z } |{z}
penalty
approximate x (k) with x

There exist such that

x (k+1) = arg min (g (x) + h(x))


x

L
≤ arg min g (x (k) ) + ∇g (x (k) )T (x − x (k) ) + ∥x − x (k) ∥22 + h(x
x 2
 
L (k) 1 (k) 2
= arg min ∥x − x + ∇g (x )∥2 + h(x)
x 2 L
 
= proxch x (k) − c∇g (x (k) )

where c = 1/L is the step size


Hao Yan Scalable Algorithms for Structured Functions February 29, 2024 85 / 96
Proximal Gradient: Handling Constraints and Regularization Proximal Gradient

Basic Proximal Gradient Algorithm

Proximal gradient algorithm to solve

min f (x) = min(g (x) + h(x))


 
x (k+1) = proxch x (k) − c∇g (x (k) )

Convergence rate: O(1/k) on function value

Hao Yan Scalable Algorithms for Structured Functions February 29, 2024 86 / 96
Proximal Gradient: Handling Constraints and Regularization Proximal Gradient

Accelerated Proximal Gradient

Initialize: Choose α ≤ 1/L, set y1 = x0 , t1 = 1


Iterate:
xk ← proxτ αψ (y
pk − α∇f (x))
tk+1 ← 2 (1 + 1 + 4tk2 )
1
−1
yk+1 ← xk + ttkk+1 (xk − xk−1 )
Faster O(1/k 2 ) convergence rate

Hao Yan Scalable Algorithms for Structured Functions February 29, 2024 87 / 96
Proximal Gradient: Handling Constraints and Regularization Proximal Gradient

Convergence Rate

Convergence Rate Iteration Complexity


PG: Convex O(1/k) O(1/ϵ)
1
PG Strong Convex O((1 − m
L) )
k O( mLplog ϵ )
APG: Convex O(1/k 2 ) O( 1/ϵ)

Hao Yan Scalable Algorithms for Structured Functions February 29, 2024 88 / 96
Proximal Gradient: Handling Constraints and Regularization Proximal Gradient

Example: Least Squre

Suppose we want to solve the standard least-squares problem

minimize∥y − Ax∥22 .
x∈RN

When A has full column rank, we know that the solution is given by
−1 T
xbls = AT A A y.
However, we also know that when AT A is not well-conditioned, this
inverse can be unstable to compute, and iterative descent methods
(gradient descent and conjugate gradients) can take many iterations
to converge.

Hao Yan Scalable Algorithms for Structured Functions February 29, 2024 89 / 96
Proximal Gradient: Handling Constraints and Regularization Proximal Gradient

Proximal Operator for Least Squares

Consider the proximal point iteration (with fixed αk = α ) for solving


this problem:
 
1 1
x k+1 = arg min ∥y − Ax∥22 + ∥x − x k ∥22 .
x∈RN 2 2α

Here we have the closed form solution


−1 T  1
x k+1 = AT A + δI A y + δx k , δ =
α
T
−1 T
= x k + A A + δI A (y − Ax k ) .

Well-known technique in numerical linear algebra called “Iterative


refinements”

Hao Yan Scalable Algorithms for Structured Functions February 29, 2024 90 / 96
Proximal Gradient: Handling Constraints and Regularization Proximal Gradient

Example 1: LASSO

For lasso:
1
g (x) = ∥Ax − y ∥2 , h(x) = γ∥x∥1
2
We know that ∇g (x) = AT (Ax − b), proxγ∥x∥1 (x) = Sγ (x)
Sγ (x) = sign(x)(|x| − γ)+ is the soft-thresholding opporator

Hao Yan Scalable Algorithms for Structured Functions February 29, 2024 91 / 96
Proximal Gradient: Handling Constraints and Regularization Proximal Gradient

Example 1: LASSO

In each iteration

x (k+1) = Sγ (x (k) + tAT (Ax − b))

Often called the iterative soft-thresholding algorithm (ISTA). 1 Very


simple algorithm to compute a lasso solution
Complexity in general: O(np) per iteration

Hao Yan Scalable Algorithms for Structured Functions February 29, 2024 92 / 96
Proximal Gradient: Handling Constraints and Regularization Proximal Gradient

Convergence

Note APG is not a descent method


Aceleration can really help!

Hao Yan Scalable Algorithms for Structured Functions February 29, 2024 93 / 96
Proximal Gradient: Handling Constraints and Regularization Proximal Gradient

Example 2: Stochastic PCA

Goal: fine the optimal PC direction w such that the variance is


maximized
Xn
T
min −w xi xiT w
∥w ∥=1
i=1

Initially pick a random unit norm vector w0


In t’s iteration:
SGD: perform wt = wt−1 + ηxt xtT wt−1
Projection: wt = wt /∥wt ∥

Hao Yan Scalable Algorithms for Structured Functions February 29, 2024 94 / 96
Proximal Gradient: Handling Constraints and Regularization Proximal Gradient

Example 3: L∞ Regression

For L∞ regerssion:
1
g (x) = ∥Ax − y ∥2 , h(x) = γ∥x∥∞
2
We know that ∇g (x) = AT (Ax − b), proxγ∥x∥∞ (x) = Π∥·∥1 <γ (x)

Hao Yan Scalable Algorithms for Structured Functions February 29, 2024 95 / 96
Proximal Gradient: Handling Constraints and Regularization Proximal Gradient

Example: Proximal Policy Optimization

Hao Yan Scalable Algorithms for Structured Functions February 29, 2024 96 / 96

You might also like