Lecture 02 With Notes
Lecture 02 With Notes
Analysis
Lecture 2: Backpropagation, Stochastic Gradient Descent
θ(1)
1,0
x2 θ(2) ŷ1
1,0
θ(1) θ(3)
2,0 1,0
x3 θ(2) ŷ2
2,0
θ(3)
θ (1) 2,0
3,0
θ(2)
x4 3,0
θ(1)
4,0
Recap: Supervised Learning
Model
function
Weight Loss
update Function
Gradient
Descent
Recap: Gradient Descent
Problems
- Explicit formula of gradient often too complex -> Use Backpropagation to calculate gradient
- For large training data the weight update is slow -> Use Batch Gradient Descent
θ1
*
x1 + *-1 exp +1
θ2
1
Forward pass f(θ, x) =
1 + e−(θ0x0+θ1x1+θ2)
Calculation of the function value is called a forward pass
θ0
*
x0
θ1
*
x1 + *-1 exp +1
θ2
1
Backpropagation f(θ, x) =
1 + e−(θ0x0+θ1x1+θ2)
Starting from the output node we iteratively calculate gradients (Backpropagation)
θ0
*
x0
θ1
*
x1 + *-1 exp +1
θ2
1
Backpropagation f(θ, x) =
1 + e−(θ0x0+θ1x1+θ2)
Starting from the output node we iteratively calculate gradients (Backpropagation)
θ0
*
x0
Numerical example in
+ Exercise Session
θ1
*
x1 + *-1 exp +1
θ2
Local vs Global Gradients
z
Chain rule
New global = Old global * Local
Local vs Global Gradients
Local gradient
∂z w.r.t. the node
∂x
z
∂z ∂L
∂y ∂z
Incoming global
Outgoing global gradient
gradient w.r.t. the loss
w.r.t. the loss
Gradients add at branches
Multivariable chain rule
Credits to FeiFei Li & Justin Johnson & Serena Yeung (Stanford CNNs for Visual Recognition, 2017, cs231)
Example
-4 *
+ *2
2
max
-1
Patterns in backward flow
Another example
3.00
-8.00
*
add gate: gradient distributor -4.00
6.00
-10.00 -20.00
+ *2
2.00 1.00
2.00
2.00
max
-1.00
0.00
Credits to FeiFei Li & Justin Johnson & Serena Yeung (Stanford CNNs for Visual Recognition, 2017, cs231)
Patterns in backward flow
Another example
3.00
-8.00
*
add gate: gradient -4.00
distributor 6.00
-10.00 -20.00
max gate: gradient router + *2
2.00 1.00
2.00
2.00
max
-1.00
0.00
Credits to FeiFei Li & Justin Johnson & Serena Yeung (Stanford CNNs for Visual Recognition, 2017, cs231)
Patterns in backward flow
Another example
3.00
-8.00
*
add gate: gradient -4.00
distributor 6.00
-10.00 -20.00
max gate: gradient router + *2
2.00 1.00
2.00
mult gate: input switcher
2.00
max
-1.00
0.00
Credits to FeiFei Li & Justin Johnson & Serena Yeung (Stanford CNNs for Visual Recognition, 2017, cs231)
Cross-Entropy Loss for Logistic Regression
Problems
- Explicit formula of gradient often too complex -> Use Backpropagation to calculate gradient
- For large training data the weight update is slow -> Use Batch Gradient Descent
„Vanilla“ GD
Repeat:
Stochastic Gradient Descent
Update rule:
„Vanilla“ GD Stochastic GD
41
• Step function
Recap
Need to be
non-linear!
Neuron Inputs
• Sigmoid
x1
θ1
x2 Activation functions:
•
…
ReLU
θn
x θ
n 0
Credits to FeiFei Li & Justin Johnson & Serena Yeung (Stanford CNNs for Visual Recognition, 2017, cs231)
Activation Functions:
Sigmoid • Squashes numbers to range [0,1]
• Output can be interrpreted as probability
• Historically popular since they have nice
1 interpretation as a saturating “firing rate” of a
σ(x) =
1 + e −x neuron
Sigmoid
Credits to Fei-Fei Li, Justin Johnson & Serena Yeung (Stanford CNNs for Visual Recognition, 2017, cs231)
Problem: Saturation
x ∂σ sigmoid σ(x) = 1/(1 + e−x)
∂x gate
∂L ∂σ ∂L ∂L
= ∂σ
∂x ∂x ∂σ
Credits to Fei-Fei Li, Justin Johnson & Serena Yeung (Stanford CNNs for Visual Recognition, 2017, cs231)
Problem: Vanishing
Gradients
∂σ sigmoid σ(x) = 1/(1 + e−x)
∂x gate
∂L ∂σ ∂L ∂L
= ∂σ
∂x ∂x ∂σ
Credits to Fei-Fei Li, Justin Johnson & Serena Yeung (Stanford CNNs for Visual Recognition, 2017, cs231)
Problem: Not zero-centered
Not zero-centered = the activation output is only positive
(or negative)
Sigmoid
x
𝜽
Sigmoid f
Sigmoid
Credits to Fei-Fei Li, Justin Johnson & Serena Yeung (Stanford CNNs for Visual Recognition, 2017, cs231)
Problem: Not zero-centered
Not zero-centered = the activation output is only positive
(or negative)
allowed
Sigmoid
x gradient
𝜽 update
directions
Sigmoid f
Sigmoid allowed
gradient
update
What can we say about the local gradient directions
Sigmoid allowed
gradient
update
What can we say about the local gradient directions
Credits to Fei-Fei Li, Justin Johnson & Serena Yeung (Stanford CNNs for Visual Recognition, 2017, cs231)
Problem: Not zero-centered
Not zero-centered = the activation output is only positive
(or negative)
allowed
Sigmoid
x gradient
𝜽 update
directions
Sigmoid f
Credits to Fei-Fei Li, Justin Johnson & Serena Yeung (Stanford CNNs for Visual Recognition, 2017, cs231)
Activation Functions:
tanh
tanh(x)
[LeCun et al.,
1991]
Credits to Fei-Fei Li, Justin Johnson & Serena Yeung (Stanford CNNs for Visual Recognition, 2017, cs231)
Activation Functions
Credits to FeiFei Li & Justin Johnson & Serena Yeung (Stanford CNNs for Visual Recognition, 2017, cs231)
Activation Functions:
ReLU • Computes f(x) = max(0,x)
• Does not saturate (in + region)
• Very computationally efficient
• Converges much faster than
sigmoid/tanh in practice (e.g. 6x)
ReLU
(Rectified Linear Unit)
[Krizhevsky et al.,
2012]
Credits to Fei-Fei Li, Justin Johnson & Serena Yeung (Stanford CNNs for Visual Recognition, 2017, cs231)
Activation Functions:
ReLU
x ∂f ReLU
∂x node
∂L ∂f ∂L ∂L
= ∂σ
∂x ∂x ∂f
Credits to Fei-Fei Li, Justin Johnson & Serena Yeung (Stanford CNNs for Visual Recognition, 2017, cs231)
Activation Functions
Credits to FeiFei Li & Justin Johnson & Serena Yeung (Stanford CNNs for Visual Recognition, 2017, cs231)
Activation
Functions: Leaky
ReLU • Does not saturate
• Computationally efficient
• Closer to zero-mean outputs
• Converges much faster than
sigmoid/tanh in practice (e.g. 6x)
• will not “kill” gradient
• Used in GANs, Res-Nets
Leaky ReLU Parametric Rectifier (PReLU)
f(x) = max(0.1x, x) f(x) = max(αx, x)
[Mass et al., 2013]
backprop into 𝜶 (parameter) [He et al., 2015]
Credits to Fei-Fei Li, Justin Johnson & Serena Yeung (Stanford CNNs for Visual Recognition, 2017, cs231) 55
Activation Functions
Credits to FeiFei Li & Justin Johnson & Serena Yeung (Stanford CNNs for Visual Recognition, 2017, cs231)
Activation Functions:
ELU • All benefits of ReLU
• Closer to zero-mean outputs
• Negative saturation regime compared
with Leaky ReLU adds some robustness to
noise
ELU(x) =
[Clevert et al., 2015]
Credits to Fei-Fei Li, Justin Johnson & Serena Yeung (Stanford CNNs for Visual Recognition, 2017, cs231)
Activation Functions
SiLU
Activation Functions:
SiLU / GELU
SiLU(x) = GELU(x) =
[Hendrycks et al., 2016]
Credits to Fei-Fei Li, Justin Johnson & Serena Yeung (Stanford CNNs for Visual Recognition, 2017, cs231)
• Similar benefits as ELU
Activation Functions: • Used in GPT-3, BERT and Transformers
SiLU(x) = GELU(x) =
[Hendrycks et al., 2016]
Credits to Fei-Fei Li, Justin Johnson & Serena Yeung (Stanford CNNs for Visual Recognition, 2017, cs231)
Activation Functions:
Maxout
max(θ1T x + b1 , 2x
T
+ b2
θ )
• Does not have the basic form of dot product ➡ Nonlinearity
• Generalizes ReLU and Leaky ReLU
• Linear Regime! Does not saturate! Does not die!
• Problem: doubles the number of parameters per neuron
Credits to Fei-Fei Li, Justin Johnson & Serena Yeung (Stanford CNNs for Visual Recognition, 2017, cs231) 57
Activation Functions
TLDR: In practice:
Credits to Fei-Fei Li, Justin Johnson & Serena Yeung (Stanford CNNs for Visual Recognition, 2017, cs231)
Classification
Training Data
Binary Classification
New input
Class 1 Class 0
. Interpretation
.
. Sigmoid 0.76 76% : Class 1
.
. 24% : Class 0
n
CIFAR-10
Collection of images
10 classes
50000 training images
10000 test images
3 channel (RGB)
images
Dagmar Kainmueller, Lecture 4 Credits to FeiFei Li & Justin Johnson & Serena Yeung (Stanford CNNs for Visual Recognition, 2017, cs231)
Training Data
Classification
New input
0.6
0.0
. -0.2
. . -0.3
. .
. 0.0
. .
. . 2.5
0.7
2.0
-0.3
n
-0.2
Training Data
Classification
New input
0.6
0.0
. -0.2
. . -0.3
. .
0.0
.
.
.
.
. 2.5
0.7
2.0
-0.3
n
-0.2
Types of
Classfication Tasks
• Multi-Class classification:
Each image belongs to exactly one of C
classes. This is a single classification problem.
The ground truth (labels) can be one-hot encoded
• Multi-label classification
Each image can belong to more than one class.
This problem can be treated as C different binary classification
problems
Cross-entropy loss:
(The binary cross-entropy loss is just the cross-entropy loss for two classes)
Credit to Raul Gomez (https://youtu.be/635cmrp4z40)
Multi-class
classification
Cross-entropy loss: