0% found this document useful (0 votes)
23 views65 pages

Lecture 02 With Notes

This document covers the concepts of backpropagation and stochastic gradient descent in the context of machine learning for image analysis. It discusses multilayer perceptrons, supervised learning, and various activation functions, highlighting the importance of gradient descent methods for optimizing neural networks. Key challenges such as saturation and vanishing gradients in activation functions like sigmoid and ReLU are also addressed, along with practical strategies for improving convergence during training.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
23 views65 pages

Lecture 02 With Notes

This document covers the concepts of backpropagation and stochastic gradient descent in the context of machine learning for image analysis. It discusses multilayer perceptrons, supervised learning, and various activation functions, highlighting the importance of gradient descent methods for optimizing neural networks. Key challenges such as saturation and vanishing gradients in activation functions like sigmoid and ReLU are also addressed, along with practical strategies for improving convergence during training.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 65

Machine Learning for Image

Analysis
Lecture 2: Backpropagation, Stochastic Gradient Descent

Dagmar Kainmueller, 15.04.2025


Recap: Multil-Layer Perceptrons
Input Layer Hidden Layers Ouput Layer
θ(1)
1,1
Perceptron /
x1 f
θ(1)
θ(2)
1,1 Neuron
1,2

θ(1)
1,0

x2 θ(2) ŷ1
1,0

θ(1) θ(3)
2,0 1,0

x3 θ(2) ŷ2
2,0

θ(3)
θ (1) 2,0
3,0

θ(2)
x4 3,0

θ(1)
4,0
Recap: Supervised Learning

◆ Given n training data instances


◆ Randomly initialized weights
Inputs Predictions Ground Truth

Model
function

Weight Loss
update Function
Gradient
Descent
Recap: Gradient Descent

Credits to Andrew Ng (Stanford Machine Learning, 2017, cs229)


Cross-Entropy Loss for Logistic Regression

Recap: Gradient Descent

Problems
- Explicit formula of gradient often too complex -> Use Backpropagation to calculate gradient
- For large training data the weight update is slow -> Use Batch Gradient Descent

Credits to Andrew Ng (Stanford Machine Learning, 2017, cs229)


Backpropagation
1
Compute graph f(θ, x) =
1 + e−(θ0x0+θ1x1+θ2)
Turn (loss) function f into a compute graph
θ0
*
x0

θ1
*
x1 + *-1 exp +1

θ2
1
Forward pass f(θ, x) =
1 + e−(θ0x0+θ1x1+θ2)
Calculation of the function value is called a forward pass
θ0
*
x0

θ1
*
x1 + *-1 exp +1

θ2
1
Backpropagation f(θ, x) =
1 + e−(θ0x0+θ1x1+θ2)
Starting from the output node we iteratively calculate gradients (Backpropagation)
θ0
*
x0

θ1
*
x1 + *-1 exp +1

θ2
1
Backpropagation f(θ, x) =
1 + e−(θ0x0+θ1x1+θ2)
Starting from the output node we iteratively calculate gradients (Backpropagation)
θ0
*
x0
Numerical example in
+ Exercise Session
θ1
*
x1 + *-1 exp +1

θ2
Local vs Global Gradients

z
Chain rule
New global = Old global * Local
Local vs Global Gradients

Local gradient
∂z w.r.t. the node
∂x
z
∂z ∂L
∂y ∂z
Incoming global
Outgoing global gradient
gradient w.r.t. the loss
w.r.t. the loss
Gradients add at branches
Multivariable chain rule

Credits to FeiFei Li & Justin Johnson & Serena Yeung (Stanford CNNs for Visual Recognition, 2017, cs231)
Example

-4 *

+ *2
2
max
-1
Patterns in backward flow
Another example
3.00
-8.00
*
add gate: gradient distributor -4.00
6.00
-10.00 -20.00
+ *2
2.00 1.00
2.00
2.00
max
-1.00
0.00

Credits to FeiFei Li & Justin Johnson & Serena Yeung (Stanford CNNs for Visual Recognition, 2017, cs231)
Patterns in backward flow
Another example
3.00
-8.00
*
add gate: gradient -4.00
distributor 6.00
-10.00 -20.00
max gate: gradient router + *2
2.00 1.00
2.00
2.00
max
-1.00
0.00

Credits to FeiFei Li & Justin Johnson & Serena Yeung (Stanford CNNs for Visual Recognition, 2017, cs231)
Patterns in backward flow
Another example
3.00
-8.00
*
add gate: gradient -4.00
distributor 6.00
-10.00 -20.00
max gate: gradient router + *2
2.00 1.00
2.00
mult gate: input switcher
2.00
max
-1.00
0.00

Credits to FeiFei Li & Justin Johnson & Serena Yeung (Stanford CNNs for Visual Recognition, 2017, cs231)
Cross-Entropy Loss for Logistic Regression

Recap: Gradient Descent

Problems
- Explicit formula of gradient often too complex -> Use Backpropagation to calculate gradient
- For large training data the weight update is slow -> Use Batch Gradient Descent

Credits to Andrew Ng (Stanford Machine Learning, 2017, cs229)


Batch Gradient Descent
Full-Batch Gradient Descent
Update rule:

„Vanilla“ GD

Repeat:
Stochastic Gradient Descent
Update rule:

„Vanilla“ GD Stochastic GD

Repeat: 1. Shuffle training data


2. Repeat:
for each instance i:
Mini-Batch Gradient Descent
Update rule:

„Vanilla“ GD Stochastic GD Mini-batch GD

Repeat: 1. Shuffle training data 1. Shuffle training data


2. Repeat: 2. Partition into batches
for each instance i: of size b
3. Repeat:
for each batch B:
1 Epoch =
Batch Gradient Descent Number of iterations until the
full training set passed
Update rule: through the algorithm

„Vanilla“ GD Stochastic GD Mini-batch GD

Repeat: 1. Shuffle training data 1. Shuffle training data


2. Repeat: 2. Partition into batches
for each instance i: of size b
3. Repeat:
for each batch B:
Batch GD convergence

◆ GD has a smooth trajectory


◆ SGD has a jittery trajectory
◆ In practice one weight update of SGD
is much faster than one of GD
MNIST
Batch GD in practice
- Shallow neural network with 800 hidden units
- RELU hidden activations
- Trained on MNIST without regularization
- Constant learning rate of 1.0

[Smith et al. 2018]


A Bayesian Perspective on Generalization and Stochastic Gradient Descent
Batch GD in practice
Rule of thumb
• Start with small batch sizes for better generalization
• Increase batch size gradually during training if needed
• Especially effective with adaptive optimizers like Adam or AdaGrad
(see future lecture)
Activation Functions

41
• Step function
Recap
Need to be
non-linear!
Neuron Inputs

• Sigmoid
x1
θ1
x2 Activation functions:


ReLU

θn
x θ
n 0

Weights (incl. Bias)


Activation Functions

Credits to FeiFei Li & Justin Johnson & Serena Yeung (Stanford CNNs for Visual Recognition, 2017, cs231)
Activation Functions:
Sigmoid • Squashes numbers to range [0,1]
• Output can be interrpreted as probability
• Historically popular since they have nice
1 interpretation as a saturating “firing rate” of a
σ(x) =
1 + e −x neuron

Sigmoid

Credits to Fei-Fei Li, Justin Johnson & Serena Yeung (Stanford CNNs for Visual Recognition, 2017, cs231)
Problem: Saturation
x ∂σ sigmoid σ(x) = 1/(1 + e−x)
∂x gate
∂L ∂σ ∂L ∂L
= ∂σ
∂x ∂x ∂σ

What happens when x = -10?


What happens when x = 0?
What happens when x = 10?

Credits to Fei-Fei Li, Justin Johnson & Serena Yeung (Stanford CNNs for Visual Recognition, 2017, cs231)
Problem: Vanishing
Gradients
∂σ sigmoid σ(x) = 1/(1 + e−x)
∂x gate
∂L ∂σ ∂L ∂L
= ∂σ
∂x ∂x ∂σ

What happens when x = -10?


Backprop: Global gradient is multiplied with local
What happens when x = 0?
gradient in each layer
What
E.g. happens
best case when
for 10 layers: x == 0.000001
0.2510 10?
Local gradient ∈ (0, 0.25]

Credits to Fei-Fei Li, Justin Johnson & Serena Yeung (Stanford CNNs for Visual Recognition, 2017, cs231)
Problem: Not zero-centered
Not zero-centered = the activation output is only positive
(or negative)

Sigmoid
x
𝜽
Sigmoid f

Sigmoid

What can we say about the local gradient


of the linear function f w.r.t. 𝜽 ?

Credits to Fei-Fei Li, Justin Johnson & Serena Yeung (Stanford CNNs for Visual Recognition, 2017, cs231)
Problem: Not zero-centered
Not zero-centered = the activation output is only positive
(or negative)
allowed
Sigmoid
x gradient
𝜽 update
directions
Sigmoid f

Sigmoid allowed
gradient
update
What can we say about the local gradient directions

of the linear function f w.r.t. 𝜽 ?


Global gradient always all positive or all
negative! Credits to Fei-Fei Li, Justin Johnson & Serena Yeung (Stanford CNNs for Visual Recognition, 2017, cs231)
Problem: Not zero-centered
Not zero-centered = the activation output is only positive
(or negative)
allowed
Sigmoid
x gradient
𝜽 update
directions
Sigmoid f

Sigmoid allowed
gradient
update
What can we say about the local gradient directions

of the linear function f w.r.t. 𝜽 ? hypothetical


Global gradient always all positive or all negative! optimal 𝜽 vector

Credits to Fei-Fei Li, Justin Johnson & Serena Yeung (Stanford CNNs for Visual Recognition, 2017, cs231)
Problem: Not zero-centered
Not zero-centered = the activation output is only positive
(or negative)
allowed
Sigmoid
x gradient
𝜽 update
directions
Sigmoid f

Sigmoid allowed zig zag path


gradient (= inefficient
update optimization)
What can we say about the local gradient directions

of the linear function f w.r.t. 𝜽 ? hypothetical


Global gradient always all positive or all negative! optimal 𝜽
vector
Credits to Fei-Fei Li, Justin Johnson & Serena Yeung (Stanford CNNs for Visual Recognition, 2017, cs231)
Activation Functions:
Sigmoid • Squashes numbers to range [0,1]
• Output can be interrpreted as probability
• Historically popular since they have nice
1 interpretation as a saturating “firing rate” of a
σ(x) =
1 + e −x neuron
Problems:
1. Saturated neurons “kill” the gradients
2. Vanishing gradients due to max gradient 1/4
Sigmoid 3. Sigmoid outputs are not zero-centered
4. exp() is a bit compute expensive

Credits to Fei-Fei Li, Justin Johnson & Serena Yeung (Stanford CNNs for Visual Recognition, 2017, cs231)
Activation Functions:
tanh

• Squishes numbers to range (-


1,1)
• zero-centered
• still kills gradients when saturated

tanh(x)

[LeCun et al.,
1991]
Credits to Fei-Fei Li, Justin Johnson & Serena Yeung (Stanford CNNs for Visual Recognition, 2017, cs231)
Activation Functions

Credits to FeiFei Li & Justin Johnson & Serena Yeung (Stanford CNNs for Visual Recognition, 2017, cs231)
Activation Functions:
ReLU • Computes f(x) = max(0,x)
• Does not saturate (in + region)
• Very computationally efficient
• Converges much faster than
sigmoid/tanh in practice (e.g. 6x)

ReLU
(Rectified Linear Unit)
[Krizhevsky et al.,
2012]
Credits to Fei-Fei Li, Justin Johnson & Serena Yeung (Stanford CNNs for Visual Recognition, 2017, cs231)
Activation Functions:
ReLU
x ∂f ReLU
∂x node
∂L ∂f ∂L ∂L
= ∂σ
∂x ∂x ∂f

What happens when x = -10?


What happens when x = 0?
What happens when x = 10?
Credits to Fei-Fei Li, Justin Johnson & Serena Yeung (Stanford CNNs for Visual Recognition, 2017, cs231)
Activation Functions:
ReLU • Computes f(x) = max(0,x)
• Does not saturate (in + region)
• Very computationally efficient
• Converges much faster than
sigmoid/tanh in practice (e.g. 6x)
• (more “biologically plausible” than sigmoid)

ReLU • Not zero-centered


• Gradient =0 when x < 0
(Rectified Linear Unit)

Credits to Fei-Fei Li, Justin Johnson & Serena Yeung (Stanford CNNs for Visual Recognition, 2017, cs231)
Activation Functions

Credits to FeiFei Li & Justin Johnson & Serena Yeung (Stanford CNNs for Visual Recognition, 2017, cs231)
Activation
Functions: Leaky
ReLU • Does not saturate
• Computationally efficient
• Closer to zero-mean outputs
• Converges much faster than
sigmoid/tanh in practice (e.g. 6x)
• will not “kill” gradient
• Used in GANs, Res-Nets
Leaky ReLU Parametric Rectifier (PReLU)
f(x) = max(0.1x, x) f(x) = max(αx, x)
[Mass et al., 2013]
backprop into 𝜶 (parameter) [He et al., 2015]

Credits to Fei-Fei Li, Justin Johnson & Serena Yeung (Stanford CNNs for Visual Recognition, 2017, cs231) 55
Activation Functions

Credits to FeiFei Li & Justin Johnson & Serena Yeung (Stanford CNNs for Visual Recognition, 2017, cs231)
Activation Functions:
ELU • All benefits of ReLU
• Closer to zero-mean outputs
• Negative saturation regime compared
with Leaky ReLU adds some robustness to
noise

• Computation requires exp()

Exponential Linear Units (ELU)

ELU(x) =
[Clevert et al., 2015]

Credits to Fei-Fei Li, Justin Johnson & Serena Yeung (Stanford CNNs for Visual Recognition, 2017, cs231)
Activation Functions

SiLU
Activation Functions:
SiLU / GELU

Sigmoid Linear Units (SiLU) Gaussian Erorr Linear Units (GELU)

SiLU(x) = GELU(x) =
[Hendrycks et al., 2016]
Credits to Fei-Fei Li, Justin Johnson & Serena Yeung (Stanford CNNs for Visual Recognition, 2017, cs231)
• Similar benefits as ELU
Activation Functions: • Used in GPT-3, BERT and Transformers

SiLU / GELU • Computation requires exp() or approx.

Sigmoid Linear Units (SiLU) Gaussian Erorr Linear Units (GELU)

SiLU(x) = GELU(x) =
[Hendrycks et al., 2016]
Credits to Fei-Fei Li, Justin Johnson & Serena Yeung (Stanford CNNs for Visual Recognition, 2017, cs231)
Activation Functions:
Maxout
max(θ1T x + b1 , 2x
T
+ b2
θ )
• Does not have the basic form of dot product ➡ Nonlinearity
• Generalizes ReLU and Leaky ReLU
• Linear Regime! Does not saturate! Does not die!
• Problem: doubles the number of parameters per neuron

Credits to Fei-Fei Li, Justin Johnson & Serena Yeung (Stanford CNNs for Visual Recognition, 2017, cs231) 57
Activation Functions
TLDR: In practice:

• Use ReLU for starters


• Try out Leaky ReLU / ELU / SiLU / GELU
• Don’t expect much from historic tanh or sigmoid

Credits to Fei-Fei Li, Justin Johnson & Serena Yeung (Stanford CNNs for Visual Recognition, 2017, cs231)
Classification
Training Data
Binary Classification
New input

Class 1 Class 0

. Interpretation
.
. Sigmoid 0.76 76% : Class 1
.
. 24% : Class 0

n
CIFAR-10
Collection of images

10 classes
50000 training images
10000 test images
3 channel (RGB)
images

Each image is 32x32x3

Dagmar Kainmueller, Lecture 4 Credits to FeiFei Li & Justin Johnson & Serena Yeung (Stanford CNNs for Visual Recognition, 2017, cs231)
Training Data
Classification
New input

0.6
0.0
. -0.2
. . -0.3
. .
. 0.0
. .
. . 2.5
0.7
2.0
-0.3
n
-0.2
Training Data
Classification
New input

0.6
0.0
. -0.2
. . -0.3
. .
0.0
.
.
.
.
. 2.5
0.7
2.0
-0.3
n
-0.2
Types of
Classfication Tasks
• Multi-Class classification:
Each image belongs to exactly one of C
classes. This is a single classification problem.
The ground truth (labels) can be one-hot encoded

• Multi-label classification
Each image can belong to more than one class.
This problem can be treated as C different binary classification
problems

Credit to Raul Gomez


Multi-label classification

Let zc denote the score obtained


from the network for class c.
Sigmoid (aka logistic function): z p
1
σ(zc) = =: pc
1 + e−zc

Squishes each score into (0,1) range


— > can be interpreted as class probability
Only depends on a single score
— > appropriate for multi-label classification.

Credit to Raul Gomez


Multi-label classification

Let zc denote the score obtained


from the network for class c.
Sigmoid (aka logistic function): z p
1
σ(zc) = =: pc
1 + e−zc
Binary cross-entropy loss per class ∈ C:
Jc = − tcln(pc) − (1 − tc)ln(1 − pc)

aka “sigmoid cross entropy


loss”
Credit to Raul Gomez (https://youtu.be/635cmrp4z40)
Multi-label classification

Let zc denote the score obtained


from the network for class c.
Sigmoid (aka logistic function): z p
1
σ(zc) = =: pc
1 + e−zc

Problem with sigmoid for multi-


class classification problems:
Mutual exclusivity of labels is
not taken into account

Credit to Raul Gomez (https://youtu.be/635cmrp4z40)


Multi-class classification

Let zc denote the score obtained


from the network for class c.
Softmax:
z p
ezc
softmax(z)c = C =: pc
∑ c′ e zc′

Squishes each score into (0,1) range


And all scores sum up to 1
—> can be interpreted as probability
distribution on classes Each output depends on all scores
—> appropriate for multi-class
classification Credit to Raul Gomez (https://youtu.be/635cmrp4z40)
Multi-class classification

Let zc denote the score obtained


from the network for class c.
Softmax:
z p
ezc
softmax(z)c = C =: pc
∑ c′ e zc′

For two classes (i.e. binary classification),


softmax is equivalent to sigmoid:
ez1 1
softmax(z)1 = = =
ez1 + ez2 1+e z2 −z1

Credit to Raul Gomez (https://youtu.be/635cmrp4z40)


Multi-class
classification

Let zc denote the score obtained


from the network for class c.
Softmax:
z p
ezc
softmax(z)c = C =: pc
∑ c′ e zc′

Cross-entropy loss:

aka “softmax cross entropy loss” or “softmax loss”

(The binary cross-entropy loss is just the cross-entropy loss for two classes)
Credit to Raul Gomez (https://youtu.be/635cmrp4z40)
Multi-class
classification

Let zc denote the score obtained


from the network for class c.
Softmax:
z p
ezc
softmax(z)c = C =: pc
∑ c′ e zc′

Cross-entropy loss:

has a simple gradient w.r.t. to z

(see future assignment)


Credit to Raul Gomez
Next Lecture: April 29

Convolutional Neural Networks for image data


Receptive field of CNNs

You might also like