What is Adam and AdamW?

Adam (short for adaptive moment estimation) was proposed by Kingma & Ba (2014) (Reference 1) as a first-order gradient-based optimization method that combines the advantages of AdaGrad (Duchi et al. 2011, Reference 2) (ability to deal with sparse gradients) and RMSProp (Tieleman & Hinton 2012, Reference 3) (ability to deal with non-stationary objectives). While considered an old method by now, it’s still widely used as a baseline algorithm to compare more cutting-edge algorithms against.

According to the authors,

“Some of Adam’s advantages are that the magnitudes of parameter updates are invariant to rescaling of the gradient, its stepsizes are approximately bounded by the stepsize hyperparameter, it does not require a stationary objective, it works with sparse gradients, and it naturally performs a form of step size annealing.”

Notation

  • f : \mathbb{R}^d \mapsto \mathbb{R} is a noise objective function that is differentiable w.r.t. to its parameter \theta. We want to minimize \mathbb{E}[f(\theta)].
  • f_1(\theta), \dots, f_T(\theta) are stochastic realizations of f at timesteps 1, \dots, T.
  • Let g_t = \nabla_\theta f_t (\theta) denote the gradient of f_t w.r.t. \theta.

Adam

At each timestep t, Adam keeps track of \hat{m}_t, an estimate of the first moment of the gradient g_t, and \hat{v}_t, an estimate of the second raw moment (uncentered variance) of the gradient g_t^2 = g_t \odot g_t (\odot represents element-wise multiplication). The Adam update is

\theta_t \leftarrow \theta_{t-1} - \alpha \cdot \hat{m}_t / (\sqrt{\hat{v}_t + \epsilon}),

where \alpha and \epsilon are hyperparameters. The full description of Adam is in the Algorithm below:

Adam algorithm (Algorithm 1 of Kingma & Ba).

Comparison with other methods

In comparison with Adam, the vanilla SGD update is

\theta_t \leftarrow \theta_{t-1} - \alpha \cdot g_t ,

and the classical momentum-based SGD update is

\theta_t \leftarrow \theta_{t-1} - \alpha \cdot \hat{m}_t.

The basic version of AdaGrad has update rule

\begin{aligned} \theta_t \leftarrow \theta_{t-1} - \alpha \cdot g_t / \sqrt{\sum_{i=1}^t g_t^2}. \end{aligned}

AdaGrad corresponds to Adam with \beta_1 = 0, infinitesimal (1-\beta_2), and replacing \alpha with \alpha_t = \alpha \cdot t^{-1/2} (see Section 5 of Reference 1 for a few more details).

AdamW

AdamW, proposed by Loshchilov & Hutter 2019 (Reference 4), is a simple modification of Adam which “improves regularization in Adam by decoupling the weight decay from the gradient-based update”.

The original weight decay update was of the form

\begin{aligned} \theta_{t+1} = (1-\lambda) \theta_t - \alpha g_t, \end{aligned}

where \lambda defines the rate of weight decay per step. Loshchilov & Hutter (2019) note that for standard SGD, this weight decay is equivalent to L2 regularization (Proposition 1 of Reference 4), but the two are not equivalent for the Adam update. They present Adam with L2 regularization and Adam with decoupled weight decay (AdamW) together in Algorithm 2 of the paper:

Overall, the authors found that AdamW performed better than Adam with L2 regularization.

PyTorch’s implementation of AdamW can be found here.

References:

  1. Kingma, D. P., and Ba, J. (2014). Adam: A Method for Stochastic Optimization.
  2. Duchi, J., et al. (2011). Adaptive Subgradient Methods for Online Learning and Stochastic Optimization.
  3. Tieleman, T., and Hinton, G. (2012). Lecture 6.5 – RMSProp, Coursera: Neural Networks for Machine Learning. (If someone has a link/PDF for the slides, I’d be grateful!)
  4. Loshchilov, I., and Hutter, F. (2019). Decoupled Weight Decay Normalization.

What is local SGD?

Introduction

In many supervised learning problems, we have data \mathcal{D} = \{ (x_i, y_i) \}_{i=1}^n, and we want to find the parameter w \in \mathbb{R}^d that minimizes the empirical loss:

\begin{aligned} \min_{w \in \mathbb{R}^d} \: f(w), \quad \text{where} \quad f(w) = \frac{1}{n}\sum_{i=1}^n \ell(x_i, y_i; w), \end{aligned}

where \ell is a loss function (e.g. cross-entropy loss). A common way to perform the minimization is with stochastic gradient descent (SGD). At each iteration, we sample a minibatch \mathcal{B} of data of size b \ll n and take a step in the negative gradient of this minibatch:

\begin{aligned} w \leftarrow w - \eta \cdot \dfrac{1}{b}\sum_{(x_i, y_i) \in \mathcal{B}} \nabla  \ell(x_i, y_i; w). \end{aligned}

SGD for federated learning

Assume that we are in the federated learning setting where we are trying to train a model collaboratively across several machines without sharing data across devices. Our data is partitioned across K clients, with \mathcal{D}_k being the partition on client k with n_k data points. How might one do SGD in this setting?

A naive implementation would be the following: in each round of the algorithm, randomly select one machine (“worker”). Send the current model parameters from the central machine (“driver”) to the worker, perform one SGD step, then return the updated parameters to the driver. This is wasteful because all the other clients are idle.

Large-batch synchronous SGD makes use of the other clients in the following way: let C \in (0, 1] be some fraction controlling the global batch size. In each iteration,

  1. Randomly select C-fraction of the workers and send them the current model parameters.
  2. Each worker k computes the gradient of the loss over all its data (call the gradient g_k) and sends it to the driver.
  3. The driver updates the global parameter: w \leftarrow w_t - \eta \sum_{k=1}^K \frac{n_k}{n}g_k, where n_k is the number of data points for worker k and g_k = 0 if worker k wasn’t selected.

In the above, the implied batch size for SGD is the total number of data points held in the CK machines chosen for that iteration. We could also compute the gradient of the loss over just B data points in each machine, giving an effective batch size of CKB.

The main issue with large-batch synchronous SGD is the communication cost, which dominates computational cost in the federated learning setting. Gradients/parameters are sent to and from the driver every iteration, as in the figure below:

Synchronous distributed SGD. Thick red lines represent larger communication cost then the thin black lines.

Local SGD

The idea behind local SGD is simple: Instead of each worker taking just one step in parameter space before returning to the driver, let’s have each worker take E steps in parameter space before returning. Here is what local SGD looks like as a diagram:

Synchronous local SGD where we do 4 local updates before syncing with the driver. Thick red lines represent larger communication cost then the thin black lines.

What are we trading off for reduced communication cost? The local updates in each worker are based on parameter values that are not quite the right ones. For example, consider the second parameter update that worker 1 does.

  • In synchronous distributed SGD, this update is based on w_1, the parameter after taking into account the SGD update from all workers from iteration 1. In
  • In synchronous local SGD, this update is based on w_1^{(1)}, the parameter after taking into account the SGD update from just worker 1 from inner iteration 1.

Local SGD makes a bet that even though our subsequent local steps are based on a different parameter value from synchronous distributed SGD, these parameter values are a good approximation to the parameter value we would have obtained from synchronous distributed SGD.

Here is the local SGD algorithm as presented in McMahan et al. (2017):

Local SGD algorithm as in McMahan et al. (2017). E=1 corresponds to synchronous distributed SGD.

We still consider this algorithm synchronous because in the outer loop, the driver still waits for all workers to return their updates before aggregating and computing the global update. There are asynchronous versions of local SGD, see for example Liu et al. (2024) (Reference 2).

References:

  1. McMahan, H. B. et al. (2017). “Communication-Efficient Learning of Deep Networks from Decentralized Data.
  2. Liu, B. et al. (2024). “Asynchronous Local-SGD Training for Language Modeling.

Example demonstrating why subgradient descent is not a descent method

Assume we have a function f: \mathbb{R}^n \mapsto \mathbb{R} which we wish to minimize. One possible approach is subgradient descent. Recall that a vector g \in \mathbb{R}^n is a subgradient of f at x if

\begin{aligned} f(y) \geq f(x) + g^\top (y - x) \quad \text{for all } y \in \mathbb{R}^n. \end{aligned}

Subgradient descent works as follows: start at some initial point x^{(0)}. At each iteration k, find a subgradient g^{(k)} of f at x^{(k)}, then take a step in the negative subgradient direction:

\begin{aligned} x^{(k+1)} = x^{(k)} - \alpha_k g^{(k)}, \end{aligned}

where \alpha_k is some step size.

In many references, you will find a statement such as the following: “Subgradient descent is not a descent method”. What this means is the following: there exist situations where we always have f(x^{(k+1)}) > f(x^{(k)}), no matter what the step size \alpha_k > 0 is. In contrast, gradient descent is a descent method: we can always find a (small enough) step size such that f(x^{(k+1)}) < f(x^{(k)}) (unless, of course, f(x^{(k)}) is already the minimum).

This is the reason why can’t use line searches with subgradient descent: in some situations the line search will tell us that we should take a step of zero length.

Here is an example of when this might happen (taken from Reference 1). Consider the function

f(x_1, x_2) = |x_1| + 2|x_2|.

At the point (x_1, x_2) = (1, 0), g = (1, 2) is a subgradient: for any (y_1, y_2),

\begin{aligned} f(x) + g^\top (y - x) &= 1 + 1 (y_1 - 1) + 2 (y_2 - 0) \\  &= y_1 + 2y_2 \\  &\leq f(y). \end{aligned}

However, for any \alpha > 0,

\begin{aligned} f(x_1 - \alpha, x_2 - \alpha) &= |1 - \alpha| + 2|0 - \alpha| \\  &= |\alpha - 1| + 2 |\alpha| \\  &= \begin{cases} 1 - 3\alpha &\text{if } \alpha \leq 0, \\ 1 + \alpha &\text{if } 0 < \alpha \leq 1, \\ 3\alpha - 1 &\text{if } \alpha > 1 \end{cases} \\  &> 1 = f(x_1, x_2). \end{aligned}

What’s happening here? The picture below is a contour plot of the graph f(x_1, x_2) = |x_1| + 2|x_2|, where darker blue represents smaller values and green/yellow represents larger values. The dotted black line is the level curve of the graph which the point (1, 0) is on. All points inside the dotted black line have function value less than that at (1, 0), while all points outside have greater function value.

The subgradient (1, 2)^\top is depicted by the blue arrow. The hyperplane corresponding to this subgradient is f(1, 0) + (1,2)^\top (x_1 - 1, x_2 - 0) = x_1 + 2x_2, which is exactly equal to f in the first quadrant and strictly less than f in the other quadrants.

You can see that if we move in the direction of the negative subgradient, we will always end up outside the dotted black line, i.e. the function value would have increased. This phenomenon occurs because (1, 0) happens to be a point of non-differentiability. If we had considered a point at which f was differentiable (e.g. (1, 1)), the negative gradient would have been a descent direction.

While subgradient descent will not descend in the direction of -(1, 2), if we are not at the minimum already, there will be subgradient directions that we can take to descend, e.g. -(1, 0). The hard part is finding such a subgradient direction in general.

References:

  1. Balakrishnan, S. (2023). 10-725: Convex Optimization. Lecture 5: January 31.

Finite-difference stochastic approximation (FDSA) and simultaneous perturbation stochastic approximation (SPSA)

Set-up

Consider the optimization problem

\begin{aligned} \text{minimize}_{\theta} \quad L(\theta), \end{aligned}

where \theta \in \mathbb{R}^p and L: \mathbb{R}^p \mapsto \mathbb{R} is some differentiable function. If we denote the gradient of L by g(\theta) = \partial L / \partial \theta, and natural algorithm for this problem is gradient descent: at each iteration k, set

\begin{aligned} \theta^{(k+1)} = \theta^{(k)} - a_k g(\theta^{(k)}), \end{aligned}

where \{ a_k \} is some sequence of step sizes.

Gradient descent updates involve computing the exact gradient at the iterate \theta^{(k)}. In some cases, this can be expensive in terms of computation or storage: for example, in large neural networks we need to keep some “state” at each node of the network to compute the gradient during the backward pass. In other situations (e.g. black box algorithms), we might not even know how to compute the gradient explicitly. What can we do in these cases?

Intuition

Instead of making updates with the exact gradient g(\cdot), we could update with an approximate gradient \hat{g}(\cdot) instead:

\begin{aligned} \theta^{(k+1)} = \theta^{(k)} - a_k \hat{g}(\theta^{(k)}). \end{aligned}

What might be a good candidate for \hat{g}(\cdot)? Assuming that partial derivatives exist, recall that the ith component of the gradient is defined as

\begin{aligned} [g (\theta)]_i &= \lim_{\Delta \rightarrow 0} \dfrac{L (\theta + \Delta e_i) - L(\theta)}{\Delta} = \lim_{\Delta \rightarrow 0} \dfrac{L (\theta + \Delta e_i) - L(\theta - \Delta e_i)}{2 \Delta}, \end{aligned}

where e_i is the vector which is all zeros except for a 1 as the ith component. Hence, for small values of \Delta we can approximate the gradient as the finite difference

\begin{aligned} [g(\theta)]_i &\approx \dfrac{L (\theta + \Delta e_i) - L(\theta - \Delta e_i)}{2 \Delta}. \end{aligned}

Finite-difference stochastic approximation (FDSA)

Finite-difference stochastic approximation (FDSA) does exactly that. Let \{ c_k \} be some sequence of positive numbers. At iteration k, update \theta as follows:

\begin{aligned} \left[ \hat{g}(\theta^{(k)}) \right]_i &= \dfrac{L (\theta^{(k)} + c_k e_i) - L(\theta^{(k)} - c_k e_i)}{2 c_k} \quad \text{for } i = 1, \dots, p, \\  \theta^{(k+1)} &= \theta^{(k)} - a_k \hat{g}(\theta^{(k)}). \end{aligned}

Instead of evaluating one gradient, we evaluate the original function L 2p times, twice for each coordinate of \theta.

Simultaneous perturbation stochastic approximation (SPSA)

One potential problem with FDSA is that the required number of evaluations of L scales linearly with the number of coordinates of \theta. As the name suggests, Simultaneous perturbation stochastic approximation (SPSA) avoids this by simultaneously perturbing all coordinates at the same time. Hence, each SPSA update requires just two evaluations of L! (See Spall 1998 Reference 1 for a simple overview of FDSA and SPSA.)

Concretely, at each iteration k, generate a random perturbation vector \Delta^{(k)} \in \mathbb{R}^p, then approximate the gradient by

\begin{aligned} \hat{g}(\theta^{(k)}) = \dfrac{L (\theta^{(k)} + c_k \Delta^{(k)}) - L(\theta^{(k)} - c_k \Delta^{(k)})}{2 c_k} \begin{pmatrix} 1 / \left( \Delta^{(k)} \right)_1 \\ 1 / \left( \Delta^{(k)} \right)_2 \\ \vdots \\ 1 / \left( \Delta^{(k)} \right)_p \end{pmatrix}. \end{aligned}

\Delta^{(k)} \in \mathbb{R}^p can’t come from just any random distribution. For the theory to hold, we need the components of \Delta^{(k)} to be i.i.d., symmetrically distributed about 0, almost surely bounded, and have finite inverse first moment, i.e. \mathbb{E} \left[ 1 / \left( \Delta^{(k)} \right)_i \right] < \infty for i = 1, \dots, p. (At least these are the conditions needed to prove a theorem.) Reference 1 notes that having each \left( \Delta^{(k)} \right)_i be an independent symmetric Bernoulli \pm 1 distribution works (i.e. equal to 1 or -1 each with probability 0.5), and cautions that uniform and normal distributions do NOT work as they do not have finite inverse first moments.

Theoretical result for SPSA

Spall 1992 (Reference 2) provides a theorem that establishes almost sure convergence of \theta^{(k)} from SPSA to \theta^*, the true minimum of L under some conditions. A number of the conditions are technical (see Reference 2’s Lemma 1 and Proposition 1); below I provide the ones that are most relevant to practitioners:

  1. The components of \Delta^{(k)} to be i.i.d., symmetrically distributed about 0, almost surely bounded, and have finite inverse first moment, i.e. \mathbb{E} \left[ 1 / \left( \Delta^{(k)} \right)_i \right] < \infty for i = 1, \dots, p.
  2. \{ a_k \} and \{ c_k \} are strictly positive sequences such that \lim_{k \rightarrow 0} a_k = 0, \lim_{k \rightarrow 0} c_k = 0, \sum_k a_k = \infty, and \sum_k (a_k / c_k)^2 < \infty.

Proposition 2 of Reference 2 proves an asymptotic normality result for \theta^{(k)} under stronger conditions.

Implementation notes

Spall (1998) (Reference 3) provides some pointers for using SPSA in practice.

  • Spall notes that the choice of the sequences \{ a_k \} and \{ c_k \} is critical to SPSA’s performance. Spall suggests a_k = a / (A + k + 1)^\alpha and c_k = c/(k+1)^\gamma, where a, c, A, \alpha, \gamma are non-negative. He mentions \alpha = 0.602 and \gamma = 0.101 as practically effective values (lowest allowable satisfying the conditions for asymptotic normality), and provides some guidance for determining the other parameters in Section IIIB of the paper.
  • For the perturbation sequence, Spall suggests each \left( \Delta^{(k)} \right)_i being an independent symmetric Bernoulli \pm 1 distribution.

References:

  1. Spall, J. C. (1998). An Overview of the Simultaneous Perturbation Method for Efficient Optimization.
  2. Spall, J. C. (1992). Multivariate Stochastic Approximation Using a Simultaneous Perturbation Gradient Approximation.
  3. Spall, J. C. (1998). Implementation of the Simultaneous Perturbation Algorithm for Stochastic Optimization.

Danskin’s theorem as applied to the dual function

I introduced Danskin’s theorem in an earlier post. Danskins’ theorem can be applied in a fairly general context in optimization, which I outline here. (This post is essentially Danskin’s theorem for a particular function, with notation that is different from the earlier post.)

Consider the optimization problem

\begin{aligned} \text{maximize}_x \quad & f_0(x) \\  \text{subject to} \quad & f_j (x) \leq 0 \text{ for } j = 1, \dots, m, \\  & x \in \mathcal{X}, \end{aligned}

where \mathcal{X} \subseteq \mathbb{R}^n is a compact set. The Lagrangian L: \mathcal{X} \times \mathbb{R}^m \mapsto \mathbb{R} is defined as

\begin{aligned} L(x, \lambda) &= f_0(x) - \sum_{j=1}^m \lambda_j f_j(x), \end{aligned}

and the dual function g: \mathbb{R}^n \mapsto \mathbb{R} is defined as

\begin{aligned} g(\lambda) = \max_{x \in \mathcal{X}} \; L(x, \lambda) = \max_{x \in \mathcal{X}} \; \left\{ f_0(x) - \sum_{j=1}^m \lambda_j f_j(x) \right\}. \end{aligned}

The dual problem requires us to minimize the dual function subject to \lambda \geq 0. One common way to do that is to do subgradient descent: compute the subgradient of the dual \partial g(\lambda) and take a step in the opposite direction of the subgradient. Danskin’s theorem gives us a way to compute \partial g(\lambda).

Assume that f_0, f_1, \dots, f_m are continuous functions. It follows that L is continuous in (x, \lambda). Note also that L is linear, and hence convex, in \lambda. With these assumptions, the first two parts of Danskin’s theorem say:

Part 1: The function g is convex.

Part 2: The function g has a directional semi-differential. The semi-differential of g at the point \lambda in the direction y, denoted \partial_y g(\lambda), is given by

\begin{aligned} \partial_y g(\lambda) = \max \left\{ L' (x, \lambda; y): x \in \text{argmax}_{x \in \mathcal{X}} L(x, \lambda) \right\} ,\end{aligned}

where L' (x, \lambda; y) is the directional derivation of the function L(x, \cdot) at \lambda in the direction y.

If we make more assumptions on the set-up, we can compute the subgradient \partial g(\lambda) we need for dual optimization. (Note: The assumptions made in Part 3 are not carried over as assumptions in Part 4.)

Part 3: Assume that for some \lambda, \hat{x} = \text{argmax}_{x \in \mathcal{X}} L(x, \lambda) is unique. It is clear that \lambda \mapsto L(\hat{x}, \lambda) is differentiable at \lambda. Thus, g is differentiable at \lambda, and

\begin{aligned} \frac{\partial g}{\partial \lambda} = \frac{\partial L(\hat{x}, \lambda)}{\partial \lambda} = -\begin{pmatrix} f_1 (\hat{x}) \\ \vdots \\ f_m (\hat{x}) \end{pmatrix} .\end{aligned}

You will recognize that vector (without the negative sign out front) as the LHS of the constraints evaluated at \hat{x}.

Part 4: Note that \lambda \mapsto L(x, \lambda) is differentiable for all x \in \mathcal{X}, and that the derivative \partial L / \partial \lambda = -\begin{pmatrix} f_1 (x) & \dots & f_m(x) \end{pmatrix}^\top is continuous w.r.t. \lambda for all \lambda. Hence, the subgradient of g is given by

\begin{aligned} \partial g(\lambda) &= \text{conv} \left\{ \frac{\partial L(\hat{x}, \lambda)}{\partial \lambda}: \hat{x} \in \text{argmax}_{x \in \mathcal{X}} L(x, \lambda) \right\} \\  &= \text{conv} \left\{ - \begin{pmatrix} f_1 (\hat{x}) \\ \vdots \\ f_m(\hat{x}) \end{pmatrix} : \hat{x} \in \text{argmax}_{x \in \mathcal{X}} L(x, \lambda) \right\}, \end{aligned}

where \text{conv} is the convex hull operation.

More details on sharpness-aware minimization (SAM)

Introduction

In this previous post, we discussed the intuition behind sharpness-aware minimization (SAM). In this post, we describe how SAM is implemented.

Consider the general supervised learning set-up, and assume that we have a class of models parameterized by w \in \mathcal{W} \subseteq \mathbb{R}^d. If we have a dataset (x_1, y_1), \dots, (x_n, y_n), a common approach is to select w by minimizing the empirical loss

\begin{aligned} L_S(w) = \frac{1}{n}\sum_{i=1}^n l(w, x_i, y_i), \end{aligned}

where l is the loss incurred for a single data point. Instead of minimizing the empirical loss, sharpness-aware minimization (SAM) (Foret et al. 2020, Reference 1) solves the problem

\begin{aligned} \hat{w} &= \text{argmin}_{w \in \mathcal{W}} \left\{ L_S^{SAM}(w) + \lambda \| w \|_2^2 \right\} \\ &= \text{argmin}_{w \in \mathcal{W}} \left\{ \max_{\| \epsilon \|_p \leq \rho} L_S(w + \epsilon) + \lambda \| w \|_2^2 \right\}, \end{aligned}

where \rho, \lambda \geq 0 are hyperparameters and \| \cdot \|_p is the Lp-norm, with p typically chosen to be 2.

We could try to solve this problem using first-order methods, e.g. gradient descent:

\begin{aligned} w^{(k+1)} \leftarrow w^{(k)} - \eta_k \left[ \nabla_w L_S^{SAM}(w^{(k)}) + 2 \lambda w^{(k)} \right] \end{aligned}

for some step size \eta_k. However, we need some tricks to approximate \nabla_w L_S^{SAM}(w) efficiently.

Simplification 1: Taylor expansion

The main difficulty in computing \nabla_w L_S^{SAM}(w) is that we need to solve the maximization problem before taking a gradient w.r.t. w. Let’s approximate the objective function in the maximization problem with a first-order Taylor expansion. If we define \epsilon^*(w) = \text{argmax}_{\| \epsilon \|_p \leq \rho} L_S(w + \epsilon), then

\begin{aligned} \epsilon^*(w) \approx \underset{\| \epsilon \|_p \leq \rho}{\text{argmax}} \left\{ L_S(w) + \epsilon^\top \nabla_w L_S(w) \right\} = \underset{\| \epsilon \|_p \leq \rho}{\text{argmax}} \; \epsilon^\top \nabla_w L_S(w). \end{aligned}

Defining the last quantity as \hat{\epsilon}(w), we recognize it as the solution to the classical dual norm problem:

\begin{aligned} \hat{\epsilon}(w) = \rho \cdot \text{sign} (\nabla_w L_S(w)) |\nabla_w L_S(w)|^{q-1} / \left( \| \nabla_w L_S(w) \|_q^q \right)^{1/p}, \end{aligned}

where 1/p + 1/q = 1. With this expression,

\begin{aligned} L_S^{SAM}(w) &\approx L_S(w + \hat{\epsilon}(w)), \\  \nabla_w L_S^{SAM}(w) &\approx \nabla_w L_S(w + \hat{\epsilon}(w)) \\  &= \nabla_w L_S(w) |_{w + \hat\epsilon(w)} \cdot \dfrac{d(w + \hat\epsilon(w))}{dw} \\  &= \nabla_w L_S(w) |_{w + \hat\epsilon(w)} + \nabla_w L_S(w) |_{w + \hat\epsilon(w)} \cdot \dfrac{d\hat\epsilon(w)}{dw}. \end{aligned}

Simplification 2: Drop second-order terms

Since \hat{\epsilon}(w) is a function of \nabla_w L_S(w), computing \dfrac{d\hat\epsilon(w)}{dw} implicitly depends on the Hessian of L_S(w). While Reference 1 notes that such a computation can be done, we can just drop this term to accelerate computation. This gives us our final gradient approximation which we use in practice:

\begin{aligned} \nabla_w L_S^{SAM}(w) \approx \nabla_w L_S(w) |_{w + \hat\epsilon(w)}. \end{aligned}

In other words, the gradient of the SAM loss at w is approximately the gradient of the empirical loss at w + \hat\epsilon(w), a point that is distance \rho away from w in a specially chosen direction.

References:

  1. Foret, P., et. al. (2020). Sharpness-Aware Minimization for Efficiently Improving Generalization.

Solving the dual norm problem

Let’s say we are in \mathbb{R}^d. Consider the maximization problem

\begin{aligned} \max_{\| x \|_p \leq \rho} c^\top x, \end{aligned}

where c \in \mathbb{R}^d is given and \| \cdot \|_p is the Lp-norm. (We may assume that c \neq 0.) The problem can be solved by applying Hölder’s inequality directly:

\begin{aligned} c^\top x &\leq \| c \|_q \| x \|_p \leq \| c \|_q \rho, \end{aligned}

where 1/p + 1/q = 1.

When does equality hold? For the first inequality, equality holds if the vectors \text{sign}(c) |c|^q and \text{sign}(x) |x|^p are pointing in the same direction, i.e.

\begin{aligned} x = k \cdot \text{sign}(c) |c|^{q/p} = k \cdot \text{sign}(c) |c|^{q-1}\end{aligned}

for some constant k. For the second inequality, equality holds if

\begin{aligned} \|x\|_p &= \rho, \\  \left\| k \cdot \text{sign}(c) |c|^{q/p} \right\|_p &= \rho, \\  k \left( \| c\|_q^q \right)^{1/p} &= \rho, \\  k &= \rho / \left( \| c\|_q^q \right)^{1/p}. \end{aligned}

Hence, the original maximization problem is maximized at x = \rho \cdot \text{sign}(c) |c|^{q-1} / \left( \| c\|_q^q \right)^{1/p}.

PHDG for solving the saddle point formulation of linear programs (LPs)

Saddle point formulation of linear programs (LPs)

Consider the linear program of the form

\begin{aligned} \text{minimize}_x \quad & c^\top x \\  \text{subject to} \quad& Ax \leq b, \end{aligned}

where c, x \in \mathbb{R}^n, A \in \mathbb{R}^{m \times n}, and b \in \mathbb{R}^m. The Lagrangian for this problem is

L(x,y) = c^\top x + y^\top (Ax - b) = y^\top Ax + c^\top x - b^\top y,

where the domain of L is (x, y) \in \mathbb{R}^n \times \mathbb{R}_+^m (elements of y must be non-negative).

In this previous post, we introduced the saddle point theorem for convex optimization. Here is that same saddle point theorem, but for LPs:

Theorem. If (x^*, y^*) \in \mathbb{R}^n \times \mathbb{R}_+^m is a saddle point for the Lagrangian L, then x^* solves the primal problem. Conversely, if x^* is a finite solution to the primal problem, then there is a y* \in \mathbb{R}_+^m such that (x^*, y^*) is a saddle point for L.

Finding the saddle point of an LP Lagrangian can be written as either of the two (equivalent) formulations:

\begin{aligned} &\text{Primal Formulation:} && \min_{x \in \mathbb{R}^n} \max_{y \in \mathbb{R}_+^m} \left[ y^\top Ax + c^\top x - b^\top y \right], \\  &\text{Dual Formulation:} && \max_{y \in \mathbb{R}_+^m} \min_{x \in \mathbb{R}^n} \left[ y^\top Ax + c^\top x - b^\top y \right].  \end{aligned}

Primal-Dual Hybrid Gradient (PDHG) method

The saddle point formulation of the LP allows us to use primal-dual methods to solve the problem. One such method is called the Primal-Dual Hybrid Gradient (PDHG) method introduced by Chambolle & Pock (2011) (Reference 1). PDHG solves the more general problem

\begin{aligned} \min_{x \in X} \max_{y \in Y} \left\{ \left\langle Kx, y \right\rangle + G(x) - F^*(y) \right\}, \end{aligned}

where X, Y are finite-dimensional real vector spaces, K : X \mapsto Y is a continuous linear operator with induced norm \| K \| = \max \{ \|Kx \|: x \in X \text{ and } \|x\| \leq 1 \}, G: X \mapsto [0, +\infty] and F^*: Y \mapsto [0, +\infty] are proper, convex, lower-semicontinuous functions.

The PDHG algorithm simply loops over the following 3 steps until convergence:

  1. (Dual step) y^{n+1} = \text{prox}_{\sigma F^*}\left( y^n + \sigma K \bar{x}^n \right).
  2. (Primal step) x^{n+1} = \text{prox}_{\tau G} \left( x^n - \tau K^* y^{n+1}\right).
  3. (Over-relaxation) \bar{x}^{n+1} = x^{n+1} + \theta (x^{n+1} - x^n).

In the above, \tau, \sigma > 0 and \theta \in [0, 1] are hyperparameters to be chosen by the user. (Note: Reference 1 doesn’t have proximal operators but has resolvent operators instead. See this post for why we can rewrite the algorithm with prox operators.) It is also possible to switch the order of the dual and primal steps, which gives rise to the following 2 steps until convergence (see Reference 2):

  1. x^{n+1} = \text{prox}_{\tau G} \left( x^n - \tau K^* y^n \right).
  2. y^{n+1} = \text{prox}_{\sigma F^*}\left( y^n + \sigma K \left[ x^n + \theta (x^{n+1} - x^n)\right] \right).

PDHG for LPs

Recall the LP problem is equivalent to

\begin{aligned} \min_{x \in \mathbb{R}^n} \max_{y \in \mathbb{R}_+^m} \left[ y^\top Ax + c^\top x - b^\top y \right]. \end{aligned}

To map this problem onto the PDHG problem, we need X \leftarrow \mathbb{R}^n, Y \leftarrow \mathbb{R}_+^m, K(x) = Ax, G(x) = c^\top x, F^*(y) = b^\top y. With this, the PDHG algorithm becomes

  1. x^{n+1} = \text{prox}_{\tau G} \left( x^n - \tau A^\top y^n \right).
  2. y^{n+1} = \text{prox}_{\sigma F^*}\left( y^n + \sigma A \left[ x^n + \theta (x^{n+1} - x^n)\right] \right).

Note that for h: X \mapsto \mathbb{R} such that h(x) = c^\top x, we have \text{prox}_h (x) = \text{proj}_X (x - c). (See this post for the proof.) Hence, we can replace the prox operators above with Euclidean projection operators:

  1. x^{n+1} = \text{proj}_{\mathbb{R}^n} \left( x^n - \tau A^\top y^n - \tau c \right) = x^n - \tau \left( A^\top y^n + c \right).
  2. y^{n+1} = \text{proj}_{\mathbb{R}_+^m}\left( y^n + \sigma A \left[ x^n + \theta (x^{n+1} - x^n) \right] - \sigma b \right).

References:

  1. Chambolle, A., and Pock, T. (2011). A First-Order Primal-Dual Algorithm for Convex Problems with Applications to Imaging.
  2. Pock, T., and Chambolle, A. (2011). Diagonal preconditioning for first order primal-dual algorithms in convex optimization.

Proximal operator for inner product in R^n

Let h: \mathcal{X} \mapsto \mathbb{R} be a lower semi-continuous convex function with \mathcal{X} \subseteq \mathbb{R}^n being a convex set. The proximal operator for h is defined by

\begin{aligned} \text{prox}_h (x) = \text{argmin}_{\beta \in \mathcal{X}} \left\{ \frac{1}{2} \| x - \beta \|_2^2 + h(\beta) \right\}. \end{aligned}

In this post, we derive the value of the proximal operator for the function h: \mathcal{X} \mapsto \mathbb{R} where h(x) = c^\top x for some c \in \mathbb{R}^n.

\begin{aligned} \text{prox}_h (x) &= \text{argmin}_{\beta \in \mathcal{X}} \left\{ \frac{1}{2} \| x - \beta \|_2^2 + c^\top \beta \right\} \\  &= \text{argmin}_{\beta \in \mathcal{X}} \left\{ \frac{1}{2} \left( \beta^\top \beta - 2x^\top \beta \right)  + c^\top \beta \right\} \\  &= \text{argmin}_{\beta \in \mathcal{X}} \left\{ \frac{1}{2} \left(\beta^\top \beta - 2(x - c)^\top \beta \right) \right\} \\  &= \text{argmin}_{\beta \in \mathcal{X}} \left\{ \frac{1}{2} \left\| \beta - (x - c) \right\|_2^2 \right\} \\  &= \text{proj}_{\mathcal{X}} (x - c), \end{aligned}

where \text{proj}_{\mathcal{X}}(x) is the Euclidean projection of x onto \mathcal{X}. That is, the proximal operator is a translation followed by a Euclidean projection.

Strong duality for linear programs (LPs)

I keep forgetting the possible scenarios for strong duality of linear programs (LPs), so I’m putting it here as a reference.

Theorem (Strong duality of LPs). Consider a primal LP and its dual. There are 4 possibilities:
1. Both primal and dual have no feasible solutions.
2. The primal is infeasible and the dual is unbounded.
3. The dual is infeasible and the primal is unbounded.
4. Both primal and dual have feasible solutions and their values are equal.

Examples of each possibility can be found in Table 4.2 of Reference 1. A proof of this theorem can be found in multiple places (e.g. Reference 2).

References:

  1. Duality in Linear Programming.
  2. Williamson, D. P., and Kircher, K. (2014). Lecture 8: Strong duality.