Optimization for Machine Learning
Lecture Notes CS-439, Spring 2025
Bernd Gärtner, ETH
Martin Jaggi, EPFL
February 17, 2025
Contents
1 Theory of Convex Functions 2–38
2 Gradient Descent 38–61
3 Projected and Proximal Gradient Descent 61–77
4 Subgradient Descent 77–88
5 Stochastic Gradient Descent 88–96
6 Nonconvex functions 96–115
7 Newton’s Method 115–127
8 Quasi-Newton Methods 127–145
9 Coordinate Descent 145–160
10 The Frank-Wolfe Algorithm 160–180
1
Chapter 1
Theory of Convex Functions
Contents
1.1 Mathematical Background . . . . . . . . . . . . . . . . . . . . 4
1.1.1 Notation . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.1.2 The Cauchy-Schwarz inequality . . . . . . . . . . . . 4
1.1.3 The spectral norm . . . . . . . . . . . . . . . . . . . . 6
1.1.4 The mean value theorem . . . . . . . . . . . . . . . . . 7
1.1.5 The fundamental theorem of calculus . . . . . . . . . 7
1.1.6 Differentiability . . . . . . . . . . . . . . . . . . . . . . 8
1.2 Convex sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
1.2.1 The mean value inequality . . . . . . . . . . . . . . . 10
1.3 Convex functions . . . . . . . . . . . . . . . . . . . . . . . . . 13
1.3.1 First-order characterization of convexity . . . . . . . 16
1.3.2 Second-order characterization of convexity . . . . . . 19
1.3.3 Operations that preserve convexity . . . . . . . . . . 21
1.4 Minimizing convex functions . . . . . . . . . . . . . . . . . . 21
1.4.1 Strictly convex functions . . . . . . . . . . . . . . . . . 23
1.4.2 Example: Least squares . . . . . . . . . . . . . . . . . 24
1.4.3 Constrained Minimization . . . . . . . . . . . . . . . . 25
1.5 Existence of a minimizer . . . . . . . . . . . . . . . . . . . . . 26
1.5.1 Sublevel sets and the Weierstrass Theorem . . . . . . 27
1.6 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
1.6.1 Handwritten digit recognition . . . . . . . . . . . . . 28
1.6.2 Master’s Admission . . . . . . . . . . . . . . . . . . . 29