0% found this document useful (0 votes)

47 views96 pages

2024-02-19-Scalable Algorithms For Structured Functions

Uploaded by

Olabiyi Ridwan

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

47 views96 pages

2024-02-19-Scalable Algorithms For Structured Functions

Uploaded by

Olabiyi Ridwan

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

Scalable Algorithms for Structured Functions

Hao Yan

February 29, 2024

Hao Yan Scalable Algorithms for Structured Functions February 29, 2024 1 / 96
Outline

1 Overview
Regularized Likelihood

2 Subgradient
Algorithm for Subgradient Descent

3 Coordinate Descent
Coordinate Descent
Block Coordinate Descent

4 Proximal Gradient: Handling Constraints and Regularization

Proximal Gradient

Hao Yan Scalable Algorithms for Structured Functions February 29, 2024 2 / 96
Overview

Section 1

Overview

Hao Yan Scalable Algorithms for Structured Functions February 29, 2024 3 / 96
Overview Regularized Likelihood

Challenges for High-Dimensional Problems

It is extremely hard, if not impossible to model the joint distribution of
the HD data Curse of Dimensionality
The curse of Dimensionality
Curse of Dimensionality
Data become sparse sparse
I Data becomes
Point-to-point distances
I Point-to-point loss meaning
distances lose meaning
I Data requirement
Data requirement for statistical inference
for statistical and combinatorial
significance search
and combinatorial
search
complexity complexity
grows grows exponentially
exponentially

Figure: The curse of dimensionality. Source: Parsons, Haque, and Liu

2004
Hao Yan Scalable Algorithms for Structured Functions February 29, 2024 4 / 96
Overview Regularized Likelihood

Blessing of Dimensionality

The blessing of dimensionality: real data highly concentrated on

low-dimensional, sparse, or degenerate structure in high-dimensional
space

Challenge: Encode such structure information into the

likelihood-based inference
Hao Yan Scalable Algorithms for Structured Functions February 29, 2024 5 / 96
Overview Regularized Likelihood

Overfitting Problem
Overfitting

Overfitting: the model has low training error but high prediction error.
Overfitting: the model
Using too many hascanlow
features training
lead error but high prediction error.
to overfitting
Using too many features can lead to overfitting

Hao Yan Scalable Algorithms for Structured Functions February 29, 2024 6 / 96
Overview Regularized Likelihood

Understand Model Complexity

Training data and testing data generated from the same distribution

(xi , yi ) ∼ D

L(f ) = E(x,y )∼D [(f (x) − y )2 ]

Ideal model: minimizing prediction error f ∗ = arg minf ∈F L(f )
Empirical error: error on the training data
n
1X
L(fˆ) = (f (xi ) − yi )2
n
i=1

Our estimator: fˆ = arg minf ∈F L̂(f )

Hao Yan Scalable Algorithms for Structured Functions February 29, 2024 7 / 96
Overview Regularized Likelihood

Unified Framework of High-Dimensional Analysis

Traditional statistical framework

High-dimensional statistical inference:
Number of parameters p is larger than sample size n
Usually have low-dimension structure, including sparse representation,
sparse and structured matrices, low-rank matrices
Solved by regularized M-estimators under high-dimensional scaling,
ξ1n = {ξ1 , · · · , ξn } denote n identically distributed observations with
parameter
Solve convex optimization problem

x̂ ∈ arg min L(x; ξ1n ) + λn R(x)

Hao Yan Scalable Algorithms for Structured Functions February 29, 2024 8 / 96
Overview Regularized Likelihood

Why Regularization

Why regulation?
From a computational perspective, regularization can lend stability to
the optimization problem and can lead to algorithmic speed-ups.
From a geometric point of view, regularization helps the solution be
close to the low-dimensional manifold.
From a statistical point of view, regularization avoids over-fitting and
leads to estimators with interesting guarantees on their error.
From a engineering or other domain system point of view,
regularization leads to the solution that has better desired property. It
is a way to put human knowledge into the learning systems.

Hao Yan Scalable Algorithms for Structured Functions February 29, 2024 9 / 96
Overview Regularized Likelihood

All Subset Regression v.s. Ridge Regression

Ridge Regression:

min ∥y − X θ∥22 subject to ∥θ∥22 ≤ t

Best subset selection:

min ∥y − X θ∥22 subject to ∥θ∥0 ≤ t

∥θ∥0 is the number of non-zero elements in θ

Subset selection Ridge Regression
Computationally Feasible No Yes
Variable Selection Yes No

Hao Yan Scalable Algorithms for Structured Functions February 29, 2024 10 / 96
Overview Regularized Likelihood

Vector Norm - Convexity

eometry of regularization in R2 : Convexity

β2 β2 β2

● ● ●

β1 β1 β1

|| ||0  t || || 1  t || || 3  t
2 4
β2 β2 β2

● ● ●

β1 β1 β1

Hao Yan Scalable Algorithms for Structured Functions February 29, 2024 11 / 96
Overview Regularized Likelihood

Vector Norm - Variable Selection

eometry of regularization in R2 : Model selection

β2 β2 β2

● ● ●

β1 β1 β1

|| ||0  t || || 1  t || || 3  t
2 4
β2 β2 β2

● ● ●

β1 β1 β1

|| Hao
||1Yan
t || || 3  t
Scalable Algorithms
|| ||  t
February229, 2024
2 for Structured Functions 12 / 96
mmary Overview Regularized Likelihood

Vector Norm - Variable Selection

Convex? Corners?

|| ||0 No Yes
|| || 1 No Yes
2
|| || 3 No Yes
4

|| ||1 Yes Yes X

|| || 3 Yes No
2
|| ||2 Yes No

Hao Yan Scalable Algorithms for Structured Functions February 29, 2024 13 / 96
both worlds: || ||1 Overview Regularized Likelihood

Vector Norm - Variable Selection

β2

β1

Hao Yan Scalable Algorithms for Structured Functions February 29, 2024 14 / 96
Overview Regularized Likelihood

Lasso Regression

Assuming no intercept by first standardize the data. (θ0 = 0)

Lasso Regression
Xn p
X
min {yi − (θ1 x1i + · · · + θp xpi )}2 s.t. |θj | ≤ M
i=1 i=1

Equivalently
n
X p
X
2
min {yi − (θ1 x1i + · · · + θp xpi )} + λ |θj |
i=1 i=1

In matrix formation
min ∥y − X θ∥2 + λ∥θ∥1
θ

Challenge: Non-differentiable on θ = 0
Hao Yan Scalable Algorithms for Structured Functions February 29, 2024 15 / 96
Overview Regularized Likelihood

Example 2: Support Vector Machines

n
X 1
min max(0, 1 − yi x T ξi ) + ∥x∥2
x | {z } 2
| {z }
i=1
Hinge Loss Function
L2 regularization
Hinge Loss function is a convex-relaxation of 0 − 1 loss function

Non-differentiable
Hao Yan
when yi xξi = 1for Structured Functions
Scalable Algorithms February 29, 2024 16 / 96
Overview Regularized Likelihood

Type of Regularization
Property Regulizer R Form
Pp
Sparse Solution L1 −norm |x |
P j=1 j 1
Group Sparsity Group L1 −norm P G ∈G |xG |2
Continuity with Jump L1 −norm of Gradient |x − xj+1 |
P j j
Smoothness Smoothess penalty + xj+2 − 2xj )2
j (xjP
Low-rank Matrices Nuclear Norm j σj (X )
Small magnitude Ridge penalty ∥x∥2

Hao Yan Scalable Algorithms for Structured Functions February 29, 2024 17 / 96
Subgradient

Section 2

Subgradient

Hao Yan Scalable Algorithms for Structured Functions February 29, 2024 18 / 96
Subgradient Algorithm for Subgradient Descent

Problem of Gradient Descent

Finding the Gradient descent direction (or even finding a descent

direction) may involve expensive computation
Stepsize rules are tricky to choose: for certain popular stepsize rules
(like exact line search), Gradient descent might converge to
non-optimal points.
Even though it never hits non-differentiable points, Gradient descent
with exact line search gets stuck around a non-optimal point (i.e. (0,
0))
Problem: Gradient descent directions may undergo large /
discontinuous changes when close to convergence limits

Hao Yan Scalable Algorithms for Structured Functions February 29, 2024 19 / 96
Subgradient Algorithm for Subgradient Descent

Wolfe’s example

(
5(9x12 + 16x22 )1/2 x1 > |x2 |
f (x1 , x2 ) =
9x1 + 16|x2 | x1 ≤ |x2 |

Hao Yan Scalable Algorithms for Structured Functions February 29, 2024 20 / 96
Subgradient Algorithm for Subgradient Descent

Subgradient
Convexity ⇒ continuity (on the domain of the function)
Convexity ̸⇒ differentiability (e.g. ∥x∥1 )
Subgradients generalize gradients for general convex function
v is subgradient of f at x if f (x ′ ) ≥ f (x) + v T (x ′ − x) Subdifferential:
∂f (x) = {all subgradients of f at x}
If f is differentiable, ∂f (x) = {∇f (x)}

supporting hyperplanes

Hao Yan Scalable Algorithms for Structured Functions February 29, 2024 21 / 96
Subgradient Algorithm for Subgradient Descent

Basic Rules

Scaling ∂(αf ) = α∂(f )

Summation ∂(f1 + f2 ) = ∂(f1 ) + ∂(f2 )
Affine Transformation: ∂f (Ax + b) = AT ∂f (Ax + b)
Chain rule ∂(g (f (x)) = g ′ (f (x))∂f (x), suppose f is convex, and g is
differentiable, nondecreasing, and convex
Maximum f (x) = maxi fi (x), ∂f (x) = Conv {∪{∂fi (x)|fi (x) = f (x)}},
which is the convex hull of all active functions

Hao Yan Scalable Algorithms for Structured Functions February 29, 2024 22 / 96
Subgradient Algorithm for Subgradient Descent

Example: Piecewise-linear function

f (x) = maxi=1,··· ,m (aT

i x + bi )
∂f (x) = Conv{ai |i ∈ I (x)}, I (x) = {i|aiT x + bi } = f (x)}
I (x) is the index of all active functions
Conv is the convex hull, the subgradient is the convex hull of all active
coefficient ai s.

Hao Yan Scalable Algorithms for Structured Functions February 29, 2024 23 / 96
Subgradient Algorithm for Subgradient Descent

Dual Norm Examples

f (x) = ∥x∥p . Define q such that 1/p + 1/q = 1, then

∥x∥p = max z⊤ x
∥z∥q ≤1

∂∥x∥p = Conv{z|z = arg max z⊤ x}

∥z∥q ≤1

Hao Yan Scalable Algorithms for Structured Functions February 29, 2024 24 / 96
Subgradient Algorithm for Subgradient Descent

Dual Norm Examples

Example p = 1
∥x∥1 = maxz⊤ x
z∈{−1,1}n


[−1, 1] xk = 0
Y
∂∥x∥1 = Jk , Jk = {1} xk > 0


{−1} xk < 0

Hao Yan Scalable Algorithms for Structured Functions February 29, 2024 25 / 96
Subgradient Algorithm for Subgradient Descent

Subgradient Properties

f is convex and differentiable =⇒ ∂f (x) = {∇f (x)}.

Any point x, there can be 0,1 , or infinitely many subgradients.
∂f (x) = ∅ =⇒ f is not convex.

Hao Yan Scalable Algorithms for Structured Functions February 29, 2024 26 / 96
Subgradient Algorithm for Subgradient Descent

Globla Optimality Condition

Definition
A vector g ∈ Rn is a subgradient of f : Rn → R at x if for all z,

f (z) ⩾ f (x) + g T (z − x).

Corollary
If 0 ∈ ∂f (x), then x is a global minimizer of f .

Hao Yan Scalable Algorithms for Structured Functions February 29, 2024 27 / 96
Subgradient Algorithm for Subgradient Descent

Example Absolution Function

For the function

 f (x) = |x|, the subgradient is given by:

{1} x >0
∂f (x) = {−1} x <0


[−1, 1] x = 0
This indicates that for any non-zero x, the subgradient is simply the
sign of x, but at x = 0, the subgradient includes all values between -1
and 1 .

Hao Yan Scalable Algorithms for Structured Functions February 29, 2024 28 / 96
Subgradient Algorithm for Subgradient Descent

Subdifferential of Absolute Value

Hao Yan Scalable Algorithms for Structured Functions February 29, 2024 29 / 96
Subgradient Algorithm for Subgradient Descent

Subgradient Method

minx f (x) with f convex, dom(f ) = R n

Subgradient method: choose x (0) ∈ R n and repeat:

x (k) = x (k−1) − αk g (k−1)

where g (k−1) ∈ ∂f (x (k−1) )

Generalization of Gradient Descent for non-smooth function
Problem: Not a descent algorithm
(k) (k)
fbest = f (xbest ) = min f (x (i) )
i=1,··· ,k

Hao Yan Scalable Algorithms for Structured Functions February 29, 2024 30 / 96
Subgradient Algorithm for Subgradient Descent

Stochastic Subgradient Method

Stochastic subgradient method is the subgradient method, using noisy

unbiased subgradient estimation

x (k+1) = x (k) − αk g̃ (k)

x (k) is k th iterate
g̃ (k) is any unbiased subgradient estimation of (convex) f at x (k)
Typically, g̃ (k) is computed on mini-batch sample ξi , i = 1, · · · , B

Hao Yan Scalable Algorithms for Structured Functions February 29, 2024 31 / 96
f (x) is always a descent direction (except
Subgradient Algorithm for Subgradient Descent

Not descent algorithms

(convex) functions, δx = −g, with
nt direction
Example 1: f (x) = |x1 | + 2|x2 |
x2

Example 2: f (x) = |x| at x = 0

Hao Yan Scalable Algorithms for Structured Functions February 29, 2024 32 / 96
Subgradient Algorithm for Subgradient Descent

Descent in terms of Distance

Property
if f is convex, f (z) < f (x), g ∈ ∂f (x), then for small t > 0,

∥x − tg − z∥2 < ∥x − z∥2

thus −g is descent direction for ∥x − z∥2 , for any z with f (z) < f (x) (e.g.,
x⋆ )

Negative subgradient is descent direction for distance to optimal point

∥x − tg − z∥22 = ∥x − z∥22 − 2tg T (x − z) + t 2 ∥g ∥22
proof: ≤ ∥x − z∥22 − 2t(f (x) − f (z)) + t 2 ∥g ∥22
≤ ∥x − z∥22 for small t

Hao Yan Scalable Algorithms for Structured Functions February 29, 2024 33 / 96
Subgradient Algorithm for Subgradient Descent

Convergence Rate
Theorem (Theorem)
(k)
Suppose fbest is the best function value obtained to k th iteration
R 2 +G 2 ki=1 α2i
P
(k)
fbest − f ∗ ≤ Pk . Here, R = ∥x (1) − x ∗ ∥2 and assume ∥g ∥2 ≤ G
2 i=1 αi

Constant step size: αk = α, we get

(k) R 2 + G 2 kα2
fbest − f ∗ ≤ → G 2 α/2
2kα
Not Guaranteed to converge!
P∞ P
Finite square step-size: < ∞, ki=1 αk = ∞
2
i=1 αk
P
(k) ∗ R 2 + G 2 ki=1 αi2 √
fbest −f ≤ Pk ≤ RG / k
2 i=1 αi
(k)
We have fbest → f ∗ √
Optimal step size when αk = R/G / k
Hao Yan Scalable Algorithms for Structured Functions February 29, 2024 34 / 96
Subgradient Algorithm for Subgradient Descent

Convergence Rate (Not Required)

Define R = ∥x (1) − x ∗ ∥2 and assume ∥g ∥2 ≤ G
Let x ∗ to be any minimizer of f

(k+1) ∗ 2 (k) (k) ∗ 2

∥x − x ∥2 = ∥x − αk g − x ∥2
(k) ∗ 2 (k)T (k) ∗ 2 (k) 2
= ∥x − x ∥2 − 2αk g (x − x ) + αk ∥g ∥2
(k) ∗ 2 (k) ∗ 2 (k) 2
≤ ∥x − x ∥2 − 2αk (f (x ) − f (x )) + αk ∥g ∥2

Apply it recursively

k k
(k+1) ∗ 2 (1) ∗ 2 (i) ∗ 2 (k) 2
X X
0 ≤ ∥x − x ∥2 ≤ ∥x − x ∥2 − 2 αi (f (x )−f )+ αi ∥g ∥2
i=1 i=1
k k
2 (i) ∗ 2 2
X X
≤R −2 αi (f (x )−f )+G αi
i=1 i=1
k k
2 (k) ∗ 2 2
X X
≤ R − 2(fbest − f )( αi ) + G αi
i=1 i=1

We have

R 2 + G 2 ki=1 α2
P
(k) ∗ i
fbest − f ≤ Pk
2 i=1 αi

Hao Yan Scalable Algorithms for Structured Functions February 29, 2024 35 / 96
Subgradient Algorithm for Subgradient Descent

Step-size Choice

Fixed step sizes αk = α for all k.

Diminishing step sizes: choose to meet conditions
∞
X ∞
X
αk2 < +∞, αk = ∞
k=1 k=1

Important that step sizes go to zero, but not too fast such as αk = 1
k
Major different from Gradient Descent:
Step sizes are typically pre-specified, not adaptively computed (No Line
Search! )

Hao Yan Scalable Algorithms for Structured Functions February 29, 2024 36 / 96
Subgradient Algorithm for Subgradient Descent

Example 1: Lasso

Hao Yan Scalable Algorithms for Structured Functions February 29, 2024 37 / 96
Subgradient Algorithm for Subgradient Descent

Example 1: Lasso

Lasso:
1
min ∥Ax − y ∥22 + τ ∥x∥1
x 2

Optimality 0 ∈ ∂( 12 ∥Ax − y ∥22 + τ ∥x∥1 )



{1} xi > 0
AT (y − Ax) = τ v , for v ∈ ∂∥x∥1 , vi ∈ {−1} xi < 0


[−1, 1] xi = 0

Hao Yan Scalable Algorithms for Structured Functions February 29, 2024 38 / 96
Subgradient Algorithm for Subgradient Descent

Optimality Criterion for Lasso

If A = I , this gives y − x = τ v
This can be solved for each xi individually as


{1} xi > 0
yi − xi = τ vi ∈ {−1} xi < 0


[−1, 1] xi = 0

If xi > 0, vi = 1, yi − τ = xi , when yi > τ

If xi < 0, vi = −1, yi + τ = xi , when yi < −τ
If xi = 0, yi − xi = yi ∈ [−τ, τ ], when yi ∈ [−τ, τ ]
Solution is xi = sgn(yi ) · (|yi | − τ )+ or the soft thresholding operator

Hao Yan Scalable Algorithms for Structured Functions February 29, 2024 39 / 96
Subgradient Algorithm for Subgradient Descent

Subgradient Descent for Lasso

If A is orthogonal, similar proof can also be given!

x = sgn(AT y ) · (|AT y | − τ )+
If A is not orthogonal, we cannot achieve the close-form solution!
Use Subgradient descent:
x (k+1) = x (k) − αk g (k) , g (k) = AT (y − Ax) − τ v is the subgradient
Computational complexity: O(np)

Hao Yan Scalable Algorithms for Structured Functions February 29, 2024 40 / 96
Subgradient Algorithm for Subgradient Descent

Result
Ridge (Left): use gradients; Lasso (Right): use subgradients. Example
here has n = 1000, p = 20:

Step sizes hand-tuned to be favorable for each method (of course

comparison
Hao Yan is imperfect,
Scalable but it reveals
Algorithms the Functions
for Structured convergence behaviors)
February 29, 2024 41 / 96
Subgradient Algorithm for Subgradient Descent

Example 2: SVM

n
X 1
min max(0, 1 − yi x T ξi ) + ∥x∥2
x | {z } 2
| {z }
i=1
Hinge Loss Function
L2 regularization

Subgradient: ∂ 12 ∥x∥2 = x,


{0} yi x T ξi > 1
vi = ∂ max(0, 1 − yi x T ξi ) = {yi ξi } yi x T ξi < 1


[0, yi ξi ] yi x T ξi = 1
P
Subgradient is g = i (xi + vi )
Subgradient descent x (k+1) = x (k) − τ g (k) .
Can be combined with stochastic optimization to choose randomly i
as g = xi + vi .
Hao Yan Scalable Algorithms for Structured Functions February 29, 2024 42 / 96
Coordinate Descent

Section 3

Coordinate Descent

Hao Yan Scalable Algorithms for Structured Functions February 29, 2024 43 / 96
Coordinate Descent Coordinate Descent

Coordinate Descent

These are some of the core methods in optimization, and they are the
main objects of study in this field, such as
First-order methods
Newton’s method
Dual methods
Today, we’ll see an extremely simple technique that is surprisingly
efficient and scalable.
They don’t apply as broadly as above methods, but are interesting and
useful when they do apply.
The idea is coordinate-wise minimization

Hao Yan Scalable Algorithms for Structured Functions February 29, 2024 44 / 96
Coordinate Descent Coordinate Descent

Coordinate-wise Minimization

Cyclic Coordinate descent:

One at a time update
Cyclic through {1, · · · , n}
(k) (k−1)
x1 = arg min f (x1 , x2 , · · · , xn(k−1) )
x1
(k) (k)
x2 = arg min f (x1 , x2 , · · · , xn(k−1) )
x2
..
.
(k) (k)
xn(k) = arg min f (x1 , x2 , · · · , xn )
xn

Hao Yan Scalable Algorithms for Structured Functions February 29, 2024 45 / 96
Coordinate Descent Coordinate Descent

Convergence
If f (x) is convex and differentiable, f (x) is minimized at x ∗ along each
coordinate axis, or f (x ∗ + d · ei ) ≥ f (x ∗ ), then x ∗ is a global minimizer.
Proof.
If f (x ∗ + d · ei ) ≥ f (x ∗ ), we know that ∇f (x ∗ )i = 0. Therefore,

∇f (x ∗ ) = ∂x ∂f
1
(x ∗) · · · ∂f
∂xn (x ∗)
=0

Hao Yan Scalable Algorithms for Structured Functions February 29, 2024 46 / 96
Coordinate Descent Coordinate Descent

Convergence

Q: Same question, but for convex non-differentiable f

A: No! Counter example.

4
2
f

x2
●

0
−2
−4
x2

x1 −4 −2 0 2 4

A: No! Look at the above counterexample

Hao Yan Scalable Algorithms for Structured Functions February 29, 2024 47 / 96
Coordinate Descent Coordinate Descent

Coordinate Descent

Consider first min f (x) with f : R n → R

Iteration j of basic coordinate descent:
Choose index ij ∈ {1, · · · , n}
Fix all i ̸= ij , change xij to reduces f
Variants:
Take a gradient step: −∇ij f (x)
Do a more rigorous search in the subspace of xij
Actually minimize f according to xij arg minxij f (x)

Hao Yan Scalable Algorithms for Structured Functions February 29, 2024 48 / 96
Coordinate Descent Coordinate Descent

Convergence
P
Q: Same question, but for f (x) = g (x) + ni=1 hi (xi ) convex with hi
convex (possible non-differentiable) and g (x) (differentiable)
A: Yes! for any y
n
X
f (y ) − f (x) ≥ ∇g (x)T (y − x) + (hi (yi ) − hi (xi ))
i=1
n
X
= [∇i g (x)(yi − xi ) + hi (yi ) − hi (xi )]
i=1

g (x) + hi (xi ) ≤ g (x + yi ei ) + hi (yi )

4
2
f

●
0
−2
−4
x2

x1 −4 −2 0 2 4
Hao Yan Scalable Algorithms for Structured Functions x1
February 29, 2024 49 / 96
Coordinate Descent Coordinate Descent

Selecting Coordinate Order

Choice of Ordering
Cyclic Order: run all coordinates in cyclic order
Random Sampling: Randomly select some coordinate to update
Gauss−Southwell : At each iteration, pick coordinate i so that
i = arg maxj |∇j f (x (t) )|
Random Permutation: run cyclic order on a permuted index

Hao Yan Scalable Algorithms for Structured Functions February 29, 2024 50 / 96
Coordinate Descent Coordinate Descent

Example 1: Linear Regression

minx f (x) = 21 ∥y − Ax∥2

Consider minimizing over xi with all xj , j ̸= i fixed
Optimality condition:

0 = ∇i f (x) = AT T
i (Ax − y ) = Ai (Ai xi + A−i x−i − y )

Solved by
AT
i (y − A−i x−i )
xi =
∥Ai ∥2
Coordinate descent repeats the update for i = 1, · · · , p, 1, 2, · · · , p, · · ·

Hao Yan Scalable Algorithms for Structured Functions February 29, 2024 51 / 96
Coordinate Descent Coordinate Descent

Compexity

Complexity for each iteration i

AT
i (y − A−i x−i )
xi =
∥Ai ∥2

Ai ∈ R n×1 , y ∈ R n×1 , A−i ∈ R n×(p−1) , x−i ∈ R (p−1)×1

Compute A−i x−i requries O(np) complexity
Each coordinate update is O(n)
Each full cyclic over i = 1, · · · , p takes O(np 2 )
Is it optimal ?

Hao Yan Scalable Algorithms for Structured Functions February 29, 2024 52 / 96
Coordinate Descent Coordinate Descent

A Clever Trick

Define r = y − Ax
AT T T T T 2
i (y −A−i x−i ) = Ai (y −Ax +Ai xi ) = Ai r +Ai Ai xi = Ai r +∥Ai ∥ xi
The update rule is given by

AT
i (y − A−i x−i ) AT
i r
xi = 2
= + xi
∥Ai ∥ ∥Ai ∥2

Complexity: O(n) for each coordinate update, O(np) for full cyclic
update over i = 1, · · · , p

Hao Yan Scalable Algorithms for Structured Functions February 29, 2024 53 / 96
Coordinate Descent Coordinate Descent

Is it fast?

1e+02
GD
CD

1e−01
Coordinate descent vs gra-

f(k)−fstar

1e−04
dient descent for linear re-
gression: 100 instances

1e−07
(n = 100, p = 20)

1e−10
0 10 20 30 40

s it fairIs it fair to compare 1 full cycle of coordinate descent to 1 iteration of

to compare 1 cycle of coordinate descent to 1 iteration of
gradient descent?
gradient descent?
Yes, since Yes, if we’re
both requires onlyclever:
O(np) operations
CD is not a first-order
T (y method and it used much
T r more information of
the problem A A i x i ) A
x = ii = i + xold i
Hao Yan AT A
Scalable Algorithms for Structured Functions kA k2 February 29, 2024 54 / 96
Coordinate Descent Coordinate Descent

Example 2: LASSO Regression

f (x) = 12 ∥y − Ax∥2 + λ∥x∥1

P
Note that ∥x∥1 = pi=1 |xi |, nonsmooth part is separable, convergence
is guaranteed.
Minizing over xi with xj , j ̸= i fixed, si ∈ ∂|xi |

1 1
∥y − Ax∥2 + λ∥x∥1 = ∥y − A−i x−i − Ai xi ∥2 + λ∥x−i ∥1 + λ|xi |
2 2
Solution is given by soft-thresholding

AT
i (y − A−i x−i )
xi = Sλ/∥Ai ∥2 ( )
∥Ai ∥2

If Ai is normalized, AT T
i 1 = 0, Ai Ai = 1, then
xi = Sλ (ATi (y − A−i x−i ))

Hao Yan Scalable Algorithms for Structured Functions February 29, 2024 55 / 96
Coordinate Descent Coordinate Descent

Complexity Analysis

If Ai is normalized, AT T
i 1 = 0, Ai Ai = 1, then
T
xi = Sλ (Ai (y − A−i x−i ))
Update rule

xi = Sλ (AT
i (y − Ax + Ai xi ))
= Sλ (xi + AT
i (y − Ax))

Computational complexity is still O(n) for each iteration (for comuting

Ax)

Hao Yan Scalable Algorithms for Structured Functions February 29, 2024 56 / 96
Coordinate Descent Coordinate Descent

Pathwise Coordiante Descent

Proposed in Friedman et al. (2007, 2009)

Outer loop (Pathwise strategy)
Compute solution over a sequence λ1 > λ2 > · · · > λr
For each λk , initialize coordinate descent at computed solution for λk+1
Inner loop (Active set strategy)
Perform one coordinate cycle (or small number of cycles), and record
active set A of coefficients that are nonzero
Cycle over only the coefficients in A until convergence
Check optimality conditions over all coefficients; if not all satisfied, add
offending coefficients to A, go back one step
Implemented into glmnet algorithm in R, MATLAB, Python

Hao Yan Scalable Algorithms for Structured Functions February 29, 2024 57 / 96
Coordinate Descent Coordinate Descent

Pathwise Coordiante Descent

Note: Even when the solution is desired at only one l, the pathwise
strategy is typically muchmore efficient than directly performing
coordinate descent at λ
Active set strategy takes advantage of sparsity; e.g., for large
problems, coordinate descent for lasso is much faster than it is for
ridge regression
With these strategies in place (and a few more clever tricks),
coordinate descent can be competitve with fastest algorithms for l1
penalized minimization problems
Fortran implementation glmnet, linked to R, MATLAB, etc.

Hao Yan Scalable Algorithms for Structured Functions February 29, 2024 58 / 96
Coordinate Descent Coordinate Descent

Coordiante Gradient Descent

For a smooth function, the iterations

(k) (k−1) (k) (k) (k−1) (k−1)
xi = xi − tik ∇i f (x1 , · · · , xi−1 , xi , xi+1 , · · · , xn )

for k = 1, 2, 3, · · · are called the coordinate gradient descent.

Hao Yan Scalable Algorithms for Structured Functions February 29, 2024 59 / 96
Coordinate Descent Coordinate Descent

Example 3: Fused Lasso

Diagonal fused LASSO

p p
1X X X
min (yi − xi )2 + λ1 |xj | + λ2 |xj − xj−1 |
x 2
i j=1 j=2

Hao Yan Scalable Algorithms for Structured Functions February 29, 2024 60 / 96
Coordinate Descent Coordinate Descent

2-D Fused Lasso and Denoising

p p p p p
1X X XX XX
min (yii −xii )2 +λ1 |xii |+λ2 |xii −xi,i−1 |+λ3 |xi,i ′ −xi−1,
x 2
i j=1 ′ i=1 i =2 ′ i=2 i =1

we consider fusions of pixels one unit away in horizontal or vertical

directions.
After a number of fusions, fused regions can become very irregular in
shape.

Hao Yan Scalable Algorithms for Structured Functions February 29, 2024 61 / 96
Coordinate Descent Coordinate Descent

Algorithm of Fused Lasso

Question: Can we use Coordinate Descent?

Hao Yan Scalable Algorithms for Structured Functions February 29, 2024 62 / 96
Coordinate Descent Coordinate Descent

Algorithm of Fused Lasso

p p
1X X X
min (yi − xi )2 + λ1 |xj | + λ2 |xj − xj−1 |
x 2
i j=1 j=2

Solution:
Moving not only individual parameters but fusing them together in
pairs and varying their common value
Fix λ1 and start λ2 = 0 and solve the problem for incremental values
of λ2
Theorem guaranteed that any two non-zero adjacent values are fused
for λ2 are also fused for λ′2 > λ2

Hao Yan Scalable Algorithms for Structured Functions February 29, 2024 63 / 96
Coordinate Descent Block Coordinate Descent

Block Coordinate Descent

Minimize f (x), x ∈ R p
Devide x = (x1 , x2 , · · · , xn ), here x1 , · · · , xn can be vectors/matrices
Update each sets of variables xi individually!
(k) (k−1) (k−1)
x1 = arg min f (x1 , x2 , · · · , xn )
x1
(k) (k) (k−1)
x2 = arg min f (x1 , x2 , · · · , xn )
x2
..
.
(k) (k) (k)
xn = arg min f (x1 , x2 , · · · , xn )
xn

Typically used if the problem has some natural division of the variables

Hao Yan Scalable Algorithms for Structured Functions February 29, 2024 64 / 96
Coordinate Descent Block Coordinate Descent

Recommondation Systems
Recommender Systems

Hao Yan Scalable Algorithms for Structured Functions February 29, 2024 65 / 96
Coordinate Descent Block Coordinate Descent

Matrix
MatrixFactorization Approach A ⇡ WH T
FactorizationApproach

Hao Yan Scalable Algorithms for Structured Functions February 29, 2024 66 / 96
Coordinate Descent Block Coordinate Descent

Matrix Factorization Approach

X
min (Aij − wiT hj )2 + λ(∥W ∥2F + ∥H∥2F )
W ∈R m×k ,H∈R n×k
(i,j)∈Ω

Ω = {(i, j)|Aij is observed

Recularized terms to avoid over-fitting
the i th user ⇒ i th row of W , wiT
the j th iterm ⇒ j th row of H, hjT

Hao Yan Scalable Algorithms for Structured Functions February 29, 2024 67 / 96
Coordinate Descent Block Coordinate Descent

Example: Alternativating Least Squares

Minimize H, W until convergence

Subproblem
X
min (Aij − (WH T ))ij)2 + λ(∥W ∥2F + ∥H∥2F )
H
(i,j)∈Ω
n
X 1X λ
= ( (Aij − wiT hj )2 + ∥hj ∥2 )
2 2
j=1 i∈Ωj

Ridge regression problem

Easy to paralleize n independet ridge regression subproblems
Block Coordinate Gradient Descent can also be used (Implemented in
the homework)

Hao Yan Scalable Algorithms for Structured Functions February 29, 2024 68 / 96
Coordinate Descent Block Coordinate Descent

Example: EM algorithm

log p(x; θ) = ELBO(q; θ) + KL(q(z)∥p(z|x))

ELBO(q; θ) = Eq ( P(x,z;θ) P(z|x;θ)

q(z) ), KL(q∥p) = −Eq(z) [log( q(z) )]
Expectation step:

q (t) = arg max ELBO(q; θ(t) )

Maximization step:

θ(t+1) = arg max ELBO(q (t) ; θ)

Will be covered later in much more details!

Hao Yan Scalable Algorithms for Structured Functions February 29, 2024 69 / 96
Proximal Gradient: Handling Constraints and Regularization

Section 4

Proximal Gradient: Handling Constraints and

Regularization

Hao Yan Scalable Algorithms for Structured Functions February 29, 2024 70 / 96
Proximal Gradient: Handling Constraints and Regularization Proximal Gradient

Euler’s Method

d
Gradient Flow: dt x(t) = −∇f (x(t))
Equilibrium point: x(∞) = {∇f (x(t)) = 0}
Discrete solutions x (k) = x(kc)
Forward Euler method:

x (k+1) = x (k) − c∇f (x (k) )

Backward Euler method:

solve x
x (k+1) ←− x = x (k) − c∇f (x)

Hao Yan Scalable Algorithms for Structured Functions February 29, 2024 71 / 96
Proximal Gradient: Handling Constraints and Regularization Proximal Gradient

Gradient Descent/Forward Euler

Assume function f is convex, differentiable

Consider
min f (x)
Gradient descent iteration (with step size c):

x (k+1) = x (k) − c∇f (x (k) )

x (k+1) minimizes
D E 1
f (x (k) ) + ∇f (x (k) ), x − x (k) + ∥x − x (k) ∥22
2c
Compared with forward Euler

x (k+1) = x (k) − c∇f (x (k) )

Hao Yan Scalable Algorithms for Structured Functions February 29, 2024 72 / 96
Proximal Gradient: Handling Constraints and Regularization Proximal Gradient

Gradient Descent - A Proximal View

x (k+1) minimizes f (x (k) ) + ∇f (x (k) ), x − x (k) + 1

2c ∥x − x (k) ∥22

Minimize the distance between the dashed graident line and the red
quadratic function
In the optimality
f (x (k) ) + ∇f (x (k) ), x − x (k) and − 2c
1
∥x − x (k) ∥22 have the same slope

Hao Yan Scalable Algorithms for Structured Functions February 29, 2024 73 / 96
Proximal Gradient: Handling Constraints and Regularization Proximal Gradient

Proximal Gradient as the Majorization Maximization

Hao Yan Scalable Algorithms for Structured Functions February 29, 2024 74 / 96
Proximal Gradient: Handling Constraints and Regularization Proximal Gradient

Proximal Gradient as the Majorization Maximization

A majorization-minimization algorithm for minimizing a function
φ : Rn → R consists of the iteration

x k+1 := argminφ̂ x, x k ,
x

k

where φ̂ ·, x is a convex upper bound to φ that is tight at x k ,
For an upper bound of f , consider the function fˆλ given by

fˆλ (x, y ) = f (y ) + ∇f (y )T (x − y ) + (1/2λ)∥x − y ∥22 ,

with λ > 0. For fixed y , this function is convex, satisfies

fˆλ (x, x) = f (x), and is an upper bound on f when λ ∈ (0, 1/L], where
L is a Lipschitz constant of ∇f . The algorithm

x k+1 := argminfˆλ x, x k
x

Hao Yan Scalable Algorithms for Structured Functions February 29, 2024 75 / 96
Proximal Gradient: Handling Constraints and Regularization Proximal Gradient

Proximal Operator as Backward Eular

This is exactly what the proximal operator does. If f is differentiable,

then
1 2
x k+1 = arg min f (x) + ∥x − x k ∥2
x∈RN 2αk
⇕
1
0 = ∇f (x k+1 ) + (x k+1 − x k ) .
αk

Proximal operator can be interpretated as Backward Euler for gradient

flow.

Hao Yan Scalable Algorithms for Structured Functions February 29, 2024 76 / 96
Proximal Gradient: Handling Constraints and Regularization Proximal Gradient

Proximity Operators as Regularization

Proximity Operators: Map a convex function f to another

1
proxcf (x (k) ) = arg min ∥x − x (k) ∥22 + f (x)
| x 2c
{z } |{z}
penalty
approximate x (k) with x

Although proxcf (x (k) ) can sometimes be difficult to compute, it

greatly simplifies computation
For some sub-differential functions f

Hao Yan Scalable Algorithms for Structured Functions February 29, 2024 77 / 96
Proximal Gradient: Handling Constraints and Regularization Proximal Gradient

Example: Characteristic Functions

(
0 β∈C
h(β) =
∞ else
proxh (b) = arg minβ∈C ∥β − b∥2 is the Euclidean projection

Hao Yan Scalable Algorithms for Structured Functions February 29, 2024 78 / 96
Proximal Gradient: Handling Constraints and Regularization Proximal Gradient

Example: L1 norm

h(x) = ∥x∥1 , then

Proxλh (x) = S(x; λ) is the soft thesholding

Hao Yan Scalable Algorithms for Structured Functions February 29, 2024 79 / 96
Proximal Gradient: Handling Constraints and Regularization Proximal Gradient

Projected Gradient Descent

Projected gradient:
D E 1
x (k+1) = arg min{g (x (k) ) + ∇g (x (k) ), x − x (k) + ∥x − x (k) ∥2 + 1c (x)}
x 2c
1 (k) (k) 2
= arg min{ ∥x − (x − t∇g (x ))∥ + c1c (x)}
x 2
= PC (x − t∇g (x (k) ))
(k)

This is exactly the projected gradient descent

Hao Yan Scalable Algorithms for Structured Functions February 29, 2024 80 / 96
Proximal Gradient: Handling Constraints and Regularization Proximal Gradient

Example: Gradient Projection Method

(
0 x ∈X
A special case on h(x) = 1C (x) =
∞ else
C is a set:
The algorithm becomes x (k+1) = PC (x (k) − t∇g (x (k) ))
PC is the projection on set C .

Hao Yan Scalable Algorithms for Structured Functions February 29, 2024 81 / 96
Proximal Gradient: Handling Constraints and Regularization Proximal Gradient

Import Proximity Operators

h(x) = 0, proxh (x) = x

$h(x)=$indicator function of closed convex set C , proxh is the
projection on C .

proxh (x) = arg min ∥u − x∥22 = PC (x)

u∈C

L1 norm: h(x) = τ ∥x∥1 , proxh (y ) = Sτ (y ) := sign(y ) max(|y | − τ, 0)

y
Squared Euclidean norm: ψ(x) = τ2 ∥x∥22 , proxh (y ) = 1+τ
Nuclear norm ψ = τ ∥X ∥∗ , Y = UΛV T is the SVD, we have

proxh (Y ) = USτ (Λ)V T

Hao Yan Scalable Algorithms for Structured Functions February 29, 2024 82 / 96
Proximal Gradient: Handling Constraints and Regularization Proximal Gradient

Decomposable Functions

Unconstrained optimization with objective split in two compenents

minimizex f (x) = g (x) + h(x)

g is convex differentiable, dom(g ) = R n
h is convex, not necessary differentiable, with in-expensive
prox-operator
We can use subgradient to solve
Are there any scalable ways for large-scale optimization?

Hao Yan Scalable Algorithms for Structured Functions February 29, 2024 83 / 96
Proximal Gradient: Handling Constraints and Regularization Proximal Gradient

Assumptions on g

Lower bound: Convexity on g : exists m such that

m
g (x) ≥ g (u) + ∇g (u)T (x − u) + ∥x − u∥22
2
m = 0: convexity
m > 0: strong convexity
Upper bound: Lipschitz continuity of gradient
L
g (x) ≤ g (u) + ∇g (u)T (x − u) + ∥x − u∥22
2

Hao Yan Scalable Algorithms for Structured Functions February 29, 2024 84 / 96
Proximal Gradient: Handling Constraints and Regularization Proximal Gradient

Assumptions on g

1
proxcf (x (k) ) = arg min ∥x − x (k) ∥22 + f (x)
| x 2c
{z } |{z}
penalty
approximate x (k) with x

There exist such that

x (k+1) = arg min (g (x) + h(x))

x

L
≤ arg min g (x (k) ) + ∇g (x (k) )T (x − x (k) ) + ∥x − x (k) ∥22 + h(x
x 2

L (k) 1 (k) 2
= arg min ∥x − x + ∇g (x )∥2 + h(x)
x 2 L

= proxch x (k) − c∇g (x (k) )

where c = 1/L is the step size

Hao Yan Scalable Algorithms for Structured Functions February 29, 2024 85 / 96
Proximal Gradient: Handling Constraints and Regularization Proximal Gradient

Basic Proximal Gradient Algorithm

Proximal gradient algorithm to solve

min f (x) = min(g (x) + h(x))

x (k+1) = proxch x (k) − c∇g (x (k) )

Convergence rate: O(1/k) on function value

Hao Yan Scalable Algorithms for Structured Functions February 29, 2024 86 / 96
Proximal Gradient: Handling Constraints and Regularization Proximal Gradient

Accelerated Proximal Gradient

Initialize: Choose α ≤ 1/L, set y1 = x0 , t1 = 1

Iterate:
xk ← proxτ αψ (y
pk − α∇f (x))
tk+1 ← 2 (1 + 1 + 4tk2 )
1
−1
yk+1 ← xk + ttkk+1 (xk − xk−1 )
Faster O(1/k 2 ) convergence rate

Hao Yan Scalable Algorithms for Structured Functions February 29, 2024 87 / 96
Proximal Gradient: Handling Constraints and Regularization Proximal Gradient

Convergence Rate

Convergence Rate Iteration Complexity

PG: Convex O(1/k) O(1/ϵ)
1
PG Strong Convex O((1 − m
L) )
k O( mLplog ϵ )
APG: Convex O(1/k 2 ) O( 1/ϵ)

Hao Yan Scalable Algorithms for Structured Functions February 29, 2024 88 / 96
Proximal Gradient: Handling Constraints and Regularization Proximal Gradient

Example: Least Squre

Suppose we want to solve the standard least-squares problem

minimize∥y − Ax∥22 .
x∈RN

When A has full column rank, we know that the solution is given by
−1 T
xbls = AT A A y.
However, we also know that when AT A is not well-conditioned, this
inverse can be unstable to compute, and iterative descent methods
(gradient descent and conjugate gradients) can take many iterations
to converge.

Hao Yan Scalable Algorithms for Structured Functions February 29, 2024 89 / 96
Proximal Gradient: Handling Constraints and Regularization Proximal Gradient

Proximal Operator for Least Squares

Consider the proximal point iteration (with fixed αk = α ) for solving

this problem:

1 1
x k+1 = arg min ∥y − Ax∥22 + ∥x − x k ∥22 .
x∈RN 2 2α

Here we have the closed form solution

−1 T 1
x k+1 = AT A + δI A y + δx k , δ =
α
T
−1 T
= x k + A A + δI A (y − Ax k ) .

Well-known technique in numerical linear algebra called “Iterative

refinements”

Hao Yan Scalable Algorithms for Structured Functions February 29, 2024 90 / 96
Proximal Gradient: Handling Constraints and Regularization Proximal Gradient

Example 1: LASSO

For lasso:
1
g (x) = ∥Ax − y ∥2 , h(x) = γ∥x∥1
2
We know that ∇g (x) = AT (Ax − b), proxγ∥x∥1 (x) = Sγ (x)
Sγ (x) = sign(x)(|x| − γ)+ is the soft-thresholding opporator

Hao Yan Scalable Algorithms for Structured Functions February 29, 2024 91 / 96
Proximal Gradient: Handling Constraints and Regularization Proximal Gradient

Example 1: LASSO

In each iteration

x (k+1) = Sγ (x (k) + tAT (Ax − b))

Often called the iterative soft-thresholding algorithm (ISTA). 1 Very

simple algorithm to compute a lasso solution
Complexity in general: O(np) per iteration

Hao Yan Scalable Algorithms for Structured Functions February 29, 2024 92 / 96
Proximal Gradient: Handling Constraints and Regularization Proximal Gradient

Convergence

Note APG is not a descent method

Aceleration can really help!

Hao Yan Scalable Algorithms for Structured Functions February 29, 2024 93 / 96
Proximal Gradient: Handling Constraints and Regularization Proximal Gradient

Example 2: Stochastic PCA

Goal: fine the optimal PC direction w such that the variance is

maximized
Xn
T
min −w xi xiT w
∥w ∥=1
i=1

Initially pick a random unit norm vector w0

In t’s iteration:
SGD: perform wt = wt−1 + ηxt xtT wt−1
Projection: wt = wt /∥wt ∥

Hao Yan Scalable Algorithms for Structured Functions February 29, 2024 94 / 96
Proximal Gradient: Handling Constraints and Regularization Proximal Gradient

Example 3: L∞ Regression

For L∞ regerssion:
1
g (x) = ∥Ax − y ∥2 , h(x) = γ∥x∥∞
2
We know that ∇g (x) = AT (Ax − b), proxγ∥x∥∞ (x) = Π∥·∥1 <γ (x)

Hao Yan Scalable Algorithms for Structured Functions February 29, 2024 95 / 96
Proximal Gradient: Handling Constraints and Regularization Proximal Gradient

Example: Proximal Policy Optimization

Hao Yan Scalable Algorithms for Structured Functions February 29, 2024 96 / 96

Chapter
No ratings yet
Chapter
46 pages
AutomaticDifferentiation AppliedMaths
No ratings yet
AutomaticDifferentiation AppliedMaths
228 pages
Subgradient Methods
No ratings yet
Subgradient Methods
56 pages
EE364b Optimization Exercises Guide
No ratings yet
EE364b Optimization Exercises Guide
48 pages
Deep Learning Numerical Challenges
No ratings yet
Deep Learning Numerical Challenges
46 pages
Unit 2 Introduction To Deep Learning
67% (3)
Unit 2 Introduction To Deep Learning
79 pages
Optimization
No ratings yet
Optimization
16 pages
Chapter 0: Introduction: 0.2.1 Examples in Machine Learning
No ratings yet
Chapter 0: Introduction: 0.2.1 Examples in Machine Learning
4 pages
Unit3 Rev3
No ratings yet
Unit3 Rev3
201 pages
Unit 2
No ratings yet
Unit 2
76 pages
Skript Opt Mach
No ratings yet
Skript Opt Mach
49 pages
Opt Lec 10
No ratings yet
Opt Lec 10
16 pages
Convex Cardinality Optimization
No ratings yet
Convex Cardinality Optimization
26 pages
Mod 4
No ratings yet
Mod 4
65 pages
Numerical Optimization Course Notes
No ratings yet
Numerical Optimization Course Notes
96 pages
Efficient Gradient Computation Methods
No ratings yet
Efficient Gradient Computation Methods
4 pages
Mit18 S096iap23 Lec06
No ratings yet
Mit18 S096iap23 Lec06
9 pages
CS-6777 Liu Abs
100% (1)
CS-6777 Liu Abs
103 pages
Optimization Algorithms For Data Analysis Wright
No ratings yet
Optimization Algorithms For Data Analysis Wright
49 pages
11 Gradient Descent
No ratings yet
11 Gradient Descent
58 pages
Lecture 11
No ratings yet
Lecture 11
30 pages
DNN M3 Optimization
No ratings yet
DNN M3 Optimization
81 pages
N.O. Study Guide
No ratings yet
N.O. Study Guide
7 pages
Unit 2-DLV
No ratings yet
Unit 2-DLV
84 pages
Efficient Methods in Optimization
No ratings yet
Efficient Methods in Optimization
159 pages
Lecture 10
No ratings yet
Lecture 10
8 pages
Lecture 11
No ratings yet
Lecture 11
4 pages
5 1 SD 17122020
No ratings yet
5 1 SD 17122020
47 pages
Optimization for Engineers
No ratings yet
Optimization for Engineers
166 pages
HW 2 Sol
No ratings yet
HW 2 Sol
5 pages
EE364b Homework 2 Solutions
100% (1)
EE364b Homework 2 Solutions
5 pages
Bologna 07
No ratings yet
Bologna 07
315 pages
04 Numerical
No ratings yet
04 Numerical
39 pages
Short Introduction To Optimization MIT
No ratings yet
Short Introduction To Optimization MIT
16 pages
Chapter 4
No ratings yet
Chapter 4
65 pages
Unconstrained Numerical Optimization An Introduction For Econometricians
100% (1)
Unconstrained Numerical Optimization An Introduction For Econometricians
32 pages
EE227C Course Notes: Convex Optimization
No ratings yet
EE227C Course Notes: Convex Optimization
122 pages
OpTimIzation Overview
No ratings yet
OpTimIzation Overview
47 pages
02 Grad Desc
No ratings yet
02 Grad Desc
54 pages
Ee227c Notes PDF
No ratings yet
Ee227c Notes PDF
122 pages
Subgradients: Ryan Tibshirani Convex Optimization 10-725
No ratings yet
Subgradients: Ryan Tibshirani Convex Optimization 10-725
25 pages
Lecture 10
No ratings yet
Lecture 10
31 pages
Week02 Convex Optimization
No ratings yet
Week02 Convex Optimization
48 pages
Super Gradient Descent for Global Optimization
No ratings yet
Super Gradient Descent for Global Optimization
15 pages
Steepest Descent Method Guide
No ratings yet
Steepest Descent Method Guide
3 pages
Optimization for Convex Functions
No ratings yet
Optimization for Convex Functions
31 pages
Primal - Dual Decomposition Methods
No ratings yet
Primal - Dual Decomposition Methods
40 pages
Exam1Review Annotated
No ratings yet
Exam1Review Annotated
13 pages
Global Optimization for ML
No ratings yet
Global Optimization for ML
15 pages
Chapter 8 Lecture Notes
No ratings yet
Chapter 8 Lecture Notes
4 pages
Chap 4 Beyond Gradient Descent
No ratings yet
Chap 4 Beyond Gradient Descent
26 pages
Lecture 2
No ratings yet
Lecture 2
6 pages
Unit VI Optimization Techniques Question Bank Solved Answer
No ratings yet
Unit VI Optimization Techniques Question Bank Solved Answer
20 pages
Lecture 7 (With Notes)
No ratings yet
Lecture 7 (With Notes)
39 pages
Lecture 1
No ratings yet
Lecture 1
6 pages
Advanced Numerical Analysis Course
No ratings yet
Advanced Numerical Analysis Course
68 pages
NLPQLG: A Fortran Implementation of A Sequential Quadratic Programming Algorithm For Heuristic Global Optimization - User's Guide
No ratings yet
NLPQLG: A Fortran Implementation of A Sequential Quadratic Programming Algorithm For Heuristic Global Optimization - User's Guide
24 pages
Slides-4 Optimization Extra Gradient Descent
No ratings yet
Slides-4 Optimization Extra Gradient Descent
67 pages