0% found this document useful (0 votes)

23 views65 pages

Lecture 02 With Notes

This document covers the concepts of backpropagation and stochastic gradient descent in the context of machine learning for image analysis. It discusses multilayer perceptrons, supervised learning, and various activation functions, highlighting the importance of gradient descent methods for optimizing neural networks. Key challenges such as saturation and vanishing gradients in activation functions like sigmoid and ReLU are also addressed, along with practical strategies for improving convergence during training.

Uploaded by

armanbeykmohammadi

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

23 views65 pages

Lecture 02 With Notes

Uploaded by

armanbeykmohammadi

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 65

Machine Learning for Image

Analysis
Lecture 2: Backpropagation, Stochastic Gradient Descent

Dagmar Kainmueller, 15.04.2025

Recap: Multil-Layer Perceptrons
Input Layer Hidden Layers Ouput Layer
θ(1)
1,1
Perceptron /
x1 f
θ(1)
θ(2)
1,1 Neuron
1,2

θ(1)
1,0

x2 θ(2) ŷ1
1,0

θ(1) θ(3)
2,0 1,0

x3 θ(2) ŷ2
2,0

θ(3)
θ (1) 2,0
3,0

θ(2)
x4 3,0

θ(1)
4,0
Recap: Supervised Learning

◆ Given n training data instances

◆ Randomly initialized weights
Inputs Predictions Ground Truth

Model
function

Weight Loss
update Function
Gradient
Descent
Recap: Gradient Descent

Credits to Andrew Ng (Stanford Machine Learning, 2017, cs229)

Cross-Entropy Loss for Logistic Regression

Recap: Gradient Descent

Problems
- Explicit formula of gradient often too complex -> Use Backpropagation to calculate gradient
- For large training data the weight update is slow -> Use Batch Gradient Descent

Credits to Andrew Ng (Stanford Machine Learning, 2017, cs229)

Backpropagation
1
Compute graph f(θ, x) =
1 + e−(θ0x0+θ1x1+θ2)
Turn (loss) function f into a compute graph
θ0
*
x0

θ1
*
x1 + *-1 exp +1

θ2
1
Forward pass f(θ, x) =
1 + e−(θ0x0+θ1x1+θ2)
Calculation of the function value is called a forward pass
θ0
*
x0

θ1
*
x1 + *-1 exp +1

θ2
1
Backpropagation f(θ, x) =
1 + e−(θ0x0+θ1x1+θ2)
Starting from the output node we iteratively calculate gradients (Backpropagation)
θ0
*
x0

θ1
*
x1 + *-1 exp +1

θ2
1
Backpropagation f(θ, x) =
1 + e−(θ0x0+θ1x1+θ2)
Starting from the output node we iteratively calculate gradients (Backpropagation)
θ0
*
x0
Numerical example in
+ Exercise Session
θ1
*
x1 + *-1 exp +1

θ2
Local vs Global Gradients

z
Chain rule
New global = Old global * Local
Local vs Global Gradients

Local gradient
∂z w.r.t. the node
∂x
z
∂z ∂L
∂y ∂z
Incoming global
Outgoing global gradient
gradient w.r.t. the loss
w.r.t. the loss
Gradients add at branches
Multivariable chain rule

Credits to FeiFei Li & Justin Johnson & Serena Yeung (Stanford CNNs for Visual Recognition, 2017, cs231)
Example

-4 *

+ *2
2
max
-1
Patterns in backward flow
Another example
3.00
-8.00
*
add gate: gradient distributor -4.00
6.00
-10.00 -20.00
+ *2
2.00 1.00
2.00
2.00
max
-1.00
0.00

Credits to FeiFei Li & Justin Johnson & Serena Yeung (Stanford CNNs for Visual Recognition, 2017, cs231)
Patterns in backward flow
Another example
3.00
-8.00
*
add gate: gradient -4.00
distributor 6.00
-10.00 -20.00
max gate: gradient router + *2
2.00 1.00
2.00
2.00
max
-1.00
0.00

Credits to FeiFei Li & Justin Johnson & Serena Yeung (Stanford CNNs for Visual Recognition, 2017, cs231)
Cross-Entropy Loss for Logistic Regression

Recap: Gradient Descent

Problems
- Explicit formula of gradient often too complex -> Use Backpropagation to calculate gradient
- For large training data the weight update is slow -> Use Batch Gradient Descent

Credits to Andrew Ng (Stanford Machine Learning, 2017, cs229)

Batch Gradient Descent
Full-Batch Gradient Descent
Update rule:

„Vanilla“ GD

Repeat:
Stochastic Gradient Descent
Update rule:

„Vanilla“ GD Stochastic GD

Repeat: 1. Shuffle training data

2. Repeat:
for each instance i:
Mini-Batch Gradient Descent
Update rule:

„Vanilla“ GD Stochastic GD Mini-batch GD

Repeat: 1. Shuffle training data 1. Shuffle training data

2. Repeat: 2. Partition into batches
for each instance i: of size b
3. Repeat:
for each batch B:
1 Epoch =
Batch Gradient Descent Number of iterations until the
full training set passed
Update rule: through the algorithm

„Vanilla“ GD Stochastic GD Mini-batch GD

Repeat: 1. Shuffle training data 1. Shuffle training data

2. Repeat: 2. Partition into batches
for each instance i: of size b
3. Repeat:
for each batch B:
Batch GD convergence

◆ GD has a smooth trajectory

◆ SGD has a jittery trajectory
◆ In practice one weight update of SGD
is much faster than one of GD
MNIST
Batch GD in practice
- Shallow neural network with 800 hidden units
- RELU hidden activations
- Trained on MNIST without regularization
- Constant learning rate of 1.0

[Smith et al. 2018]

A Bayesian Perspective on Generalization and Stochastic Gradient Descent
Batch GD in practice
Rule of thumb
• Start with small batch sizes for better generalization
• Increase batch size gradually during training if needed
• Especially effective with adaptive optimizers like Adam or AdaGrad
(see future lecture)
Activation Functions

41
• Step function
Recap
Need to be
non-linear!
Neuron Inputs

• Sigmoid
x1
θ1
x2 Activation functions:

•
…

ReLU

θn
x θ
n 0

Weights (incl. Bias)

Activation Functions

Credits to FeiFei Li & Justin Johnson & Serena Yeung (Stanford CNNs for Visual Recognition, 2017, cs231)
Activation Functions:
Sigmoid • Squashes numbers to range [0,1]
• Output can be interrpreted as probability
• Historically popular since they have nice
1 interpretation as a saturating “firing rate” of a
σ(x) =
1 + e −x neuron

Sigmoid

Credits to Fei-Fei Li, Justin Johnson & Serena Yeung (Stanford CNNs for Visual Recognition, 2017, cs231)
Problem: Saturation
x ∂σ sigmoid σ(x) = 1/(1 + e−x)
∂x gate
∂L ∂σ ∂L ∂L
= ∂σ
∂x ∂x ∂σ

What happens when x = -10?

What happens when x = 0?
What happens when x = 10?

Credits to Fei-Fei Li, Justin Johnson & Serena Yeung (Stanford CNNs for Visual Recognition, 2017, cs231)
Problem: Vanishing
Gradients
∂σ sigmoid σ(x) = 1/(1 + e−x)
∂x gate
∂L ∂σ ∂L ∂L
= ∂σ
∂x ∂x ∂σ

What happens when x = -10?

Backprop: Global gradient is multiplied with local
What happens when x = 0?
gradient in each layer
What
E.g. happens
best case when
for 10 layers: x == 0.000001
0.2510 10?
Local gradient ∈ (0, 0.25]

Credits to Fei-Fei Li, Justin Johnson & Serena Yeung (Stanford CNNs for Visual Recognition, 2017, cs231)
Problem: Not zero-centered
Not zero-centered = the activation output is only positive
(or negative)

Sigmoid
x
𝜽
Sigmoid f

Sigmoid

What can we say about the local gradient

of the linear function f w.r.t. 𝜽 ?

Sigmoid allowed
gradient
update
What can we say about the local gradient directions

of the linear function f w.r.t. 𝜽 ?

Global gradient always all positive or all
negative! Credits to Fei-Fei Li, Justin Johnson & Serena Yeung (Stanford CNNs for Visual Recognition, 2017, cs231)
Problem: Not zero-centered
Not zero-centered = the activation output is only positive
(or negative)
allowed
Sigmoid
x gradient
𝜽 update
directions
Sigmoid f

Sigmoid allowed
gradient
update
What can we say about the local gradient directions

of the linear function f w.r.t. 𝜽 ? hypothetical

Global gradient always all positive or all negative! optimal 𝜽 vector

Sigmoid allowed zig zag path

gradient (= inefficient
update optimization)
What can we say about the local gradient directions

of the linear function f w.r.t. 𝜽 ? hypothetical

Global gradient always all positive or all negative! optimal 𝜽
vector
Credits to Fei-Fei Li, Justin Johnson & Serena Yeung (Stanford CNNs for Visual Recognition, 2017, cs231)
Activation Functions:
Sigmoid • Squashes numbers to range [0,1]
• Output can be interrpreted as probability
• Historically popular since they have nice
1 interpretation as a saturating “firing rate” of a
σ(x) =
1 + e −x neuron
Problems:
1. Saturated neurons “kill” the gradients
2. Vanishing gradients due to max gradient 1/4
Sigmoid 3. Sigmoid outputs are not zero-centered
4. exp() is a bit compute expensive

Credits to Fei-Fei Li, Justin Johnson & Serena Yeung (Stanford CNNs for Visual Recognition, 2017, cs231)
Activation Functions:
tanh

• Squishes numbers to range (-

1,1)
• zero-centered
• still kills gradients when saturated

tanh(x)

[LeCun et al.,
1991]
Credits to Fei-Fei Li, Justin Johnson & Serena Yeung (Stanford CNNs for Visual Recognition, 2017, cs231)
Activation Functions

Credits to FeiFei Li & Justin Johnson & Serena Yeung (Stanford CNNs for Visual Recognition, 2017, cs231)
Activation Functions:
ReLU • Computes f(x) = max(0,x)
• Does not saturate (in + region)
• Very computationally efficient
• Converges much faster than
sigmoid/tanh in practice (e.g. 6x)

ReLU
(Rectified Linear Unit)
[Krizhevsky et al.,
2012]
Credits to Fei-Fei Li, Justin Johnson & Serena Yeung (Stanford CNNs for Visual Recognition, 2017, cs231)
Activation Functions:
ReLU
x ∂f ReLU
∂x node
∂L ∂f ∂L ∂L
= ∂σ
∂x ∂x ∂f

What happens when x = -10?

What happens when x = 0?
What happens when x = 10?
Credits to Fei-Fei Li, Justin Johnson & Serena Yeung (Stanford CNNs for Visual Recognition, 2017, cs231)
Activation Functions:
ReLU • Computes f(x) = max(0,x)
• Does not saturate (in + region)
• Very computationally efficient
• Converges much faster than
sigmoid/tanh in practice (e.g. 6x)
• (more “biologically plausible” than sigmoid)

ReLU • Not zero-centered

• Gradient =0 when x < 0
(Rectified Linear Unit)

Credits to Fei-Fei Li, Justin Johnson & Serena Yeung (Stanford CNNs for Visual Recognition, 2017, cs231)
Activation Functions

Credits to FeiFei Li & Justin Johnson & Serena Yeung (Stanford CNNs for Visual Recognition, 2017, cs231)
Activation
Functions: Leaky
ReLU • Does not saturate
• Computationally efficient
• Closer to zero-mean outputs
• Converges much faster than
sigmoid/tanh in practice (e.g. 6x)
• will not “kill” gradient
• Used in GANs, Res-Nets
Leaky ReLU Parametric Rectifier (PReLU)
f(x) = max(0.1x, x) f(x) = max(αx, x)
[Mass et al., 2013]
backprop into 𝜶 (parameter) [He et al., 2015]

Credits to Fei-Fei Li, Justin Johnson & Serena Yeung (Stanford CNNs for Visual Recognition, 2017, cs231) 55
Activation Functions

Credits to FeiFei Li & Justin Johnson & Serena Yeung (Stanford CNNs for Visual Recognition, 2017, cs231)
Activation Functions:
ELU • All benefits of ReLU
• Closer to zero-mean outputs
• Negative saturation regime compared
with Leaky ReLU adds some robustness to
noise

• Computation requires exp()

Exponential Linear Units (ELU)

ELU(x) =
[Clevert et al., 2015]

Credits to Fei-Fei Li, Justin Johnson & Serena Yeung (Stanford CNNs for Visual Recognition, 2017, cs231)
Activation Functions

SiLU
Activation Functions:
SiLU / GELU

Sigmoid Linear Units (SiLU) Gaussian Erorr Linear Units (GELU)

SiLU(x) = GELU(x) =
[Hendrycks et al., 2016]
Credits to Fei-Fei Li, Justin Johnson & Serena Yeung (Stanford CNNs for Visual Recognition, 2017, cs231)
• Similar benefits as ELU
Activation Functions: • Used in GPT-3, BERT and Transformers

SiLU / GELU • Computation requires exp() or approx.

Sigmoid Linear Units (SiLU) Gaussian Erorr Linear Units (GELU)

SiLU(x) = GELU(x) =
[Hendrycks et al., 2016]
Credits to Fei-Fei Li, Justin Johnson & Serena Yeung (Stanford CNNs for Visual Recognition, 2017, cs231)
Activation Functions:
Maxout
max(θ1T x + b1 , 2x
T
+ b2
θ )
• Does not have the basic form of dot product ➡ Nonlinearity
• Generalizes ReLU and Leaky ReLU
• Linear Regime! Does not saturate! Does not die!
• Problem: doubles the number of parameters per neuron

Credits to Fei-Fei Li, Justin Johnson & Serena Yeung (Stanford CNNs for Visual Recognition, 2017, cs231) 57
Activation Functions
TLDR: In practice:

• Use ReLU for starters

• Try out Leaky ReLU / ELU / SiLU / GELU
• Don’t expect much from historic tanh or sigmoid

Credits to Fei-Fei Li, Justin Johnson & Serena Yeung (Stanford CNNs for Visual Recognition, 2017, cs231)
Classification
Training Data
Binary Classification
New input

Class 1 Class 0

. Interpretation
.
. Sigmoid 0.76 76% : Class 1
.
. 24% : Class 0

n
CIFAR-10
Collection of images

10 classes
50000 training images
10000 test images
3 channel (RGB)
images

Each image is 32x32x3

Dagmar Kainmueller, Lecture 4 Credits to FeiFei Li & Justin Johnson & Serena Yeung (Stanford CNNs for Visual Recognition, 2017, cs231)
Training Data
Classification
New input

0.6
0.0
. -0.2
. . -0.3
. .
. 0.0
. .
. . 2.5
0.7
2.0
-0.3
n
-0.2
Training Data
Classification
New input

0.6
0.0
. -0.2
. . -0.3
. .
0.0
.
.
.
.
. 2.5
0.7
2.0
-0.3
n
-0.2
Types of
Classfication Tasks
• Multi-Class classification:
Each image belongs to exactly one of C
classes. This is a single classification problem.
The ground truth (labels) can be one-hot encoded

• Multi-label classification
Each image can belong to more than one class.
This problem can be treated as C different binary classification
problems

Credit to Raul Gomez

Multi-label classification

Let zc denote the score obtained

from the network for class c.
Sigmoid (aka logistic function): z p
1
σ(zc) = =: pc
1 + e−zc

Squishes each score into (0,1) range

— > can be interpreted as class probability
Only depends on a single score
— > appropriate for multi-label classification.

Credit to Raul Gomez

Multi-label classification

Let zc denote the score obtained

from the network for class c.
Sigmoid (aka logistic function): z p
1
σ(zc) = =: pc
1 + e−zc
Binary cross-entropy loss per class ∈ C:
Jc = − tcln(pc) − (1 − tc)ln(1 − pc)

aka “sigmoid cross entropy

loss”
Credit to Raul Gomez (https://youtu.be/635cmrp4z40)
Multi-label classification

Let zc denote the score obtained

from the network for class c.
Sigmoid (aka logistic function): z p
1
σ(zc) = =: pc
1 + e−zc

Problem with sigmoid for multi-

class classification problems:
Mutual exclusivity of labels is
not taken into account

Credit to Raul Gomez (https://youtu.be/635cmrp4z40)

Multi-class classification

Let zc denote the score obtained

from the network for class c.
Softmax:
z p
ezc
softmax(z)c = C =: pc
∑ c′ e zc′

Squishes each score into (0,1) range

And all scores sum up to 1
—> can be interpreted as probability
distribution on classes Each output depends on all scores
—> appropriate for multi-class
classification Credit to Raul Gomez (https://youtu.be/635cmrp4z40)
Multi-class classification

Let zc denote the score obtained

from the network for class c.
Softmax:
z p
ezc
softmax(z)c = C =: pc
∑ c′ e zc′

For two classes (i.e. binary classification),

softmax is equivalent to sigmoid:
ez1 1
softmax(z)1 = = =
ez1 + ez2 1+e z2 −z1

Credit to Raul Gomez (https://youtu.be/635cmrp4z40)

Multi-class
classification

Let zc denote the score obtained

from the network for class c.
Softmax:
z p
ezc
softmax(z)c = C =: pc
∑ c′ e zc′

Cross-entropy loss:

aka “softmax cross entropy loss” or “softmax loss”

(The binary cross-entropy loss is just the cross-entropy loss for two classes)
Credit to Raul Gomez (https://youtu.be/635cmrp4z40)
Multi-class
classification

Let zc denote the score obtained

from the network for class c.
Softmax:
z p
ezc
softmax(z)c = C =: pc
∑ c′ e zc′

Cross-entropy loss:

has a simple gradient w.r.t. to z

(see future assignment)

Credit to Raul Gomez
Next Lecture: April 29

Convolutional Neural Networks for image data

Receptive field of CNNs

Lecture 09 Slides - After
No ratings yet
Lecture 09 Slides - After
57 pages
06 AIS302 ANN Backpropagation
No ratings yet
06 AIS302 ANN Backpropagation
83 pages
AlexNet and Other Pretrained Models - Presentation
No ratings yet
AlexNet and Other Pretrained Models - Presentation
182 pages
Ece18898g Neural Networks
No ratings yet
Ece18898g Neural Networks
47 pages
Lecture 3
No ratings yet
Lecture 3
24 pages
Deep Learning
No ratings yet
Deep Learning
19 pages
Autoencoders in Deep Learning
No ratings yet
Autoencoders in Deep Learning
73 pages
Neural Networks & Gradient Descent
No ratings yet
Neural Networks & Gradient Descent
77 pages
Unit 5 (Second Half)
No ratings yet
Unit 5 (Second Half)
10 pages
A) Explanation of Two Tensor Operations With Examp
No ratings yet
A) Explanation of Two Tensor Operations With Examp
11 pages
Dl-Module 2
No ratings yet
Dl-Module 2
138 pages
Deep Learning Module-02 Search Creators
No ratings yet
Deep Learning Module-02 Search Creators
15 pages
6 Lecture CNN
No ratings yet
6 Lecture CNN
45 pages
9.b Handout-4-Activation Functions
No ratings yet
9.b Handout-4-Activation Functions
4 pages
Different Activation Functions With The Equations
No ratings yet
Different Activation Functions With The Equations
6 pages
AML 03 Dense Neural Networks
No ratings yet
AML 03 Dense Neural Networks
20 pages
Notes Chapter8
No ratings yet
Notes Chapter8
4 pages
DL Answers
No ratings yet
DL Answers
11 pages
L4 Training Neural Networks en
No ratings yet
L4 Training Neural Networks en
48 pages
Lecture 7
No ratings yet
Lecture 7
118 pages
CS445 - Neural Networks and Deep Learning - Lecture Notes
No ratings yet
CS445 - Neural Networks and Deep Learning - Lecture Notes
5 pages
Module 2
No ratings yet
Module 2
13 pages
Neural Networks: A Deep Dive
No ratings yet
Neural Networks: A Deep Dive
34 pages
CNN Activation Functions Explained
No ratings yet
CNN Activation Functions Explained
5 pages
Backpropagation in CNNs Explained
No ratings yet
Backpropagation in CNNs Explained
28 pages
A Imprimer 4
No ratings yet
A Imprimer 4
4 pages
Lesson 2 Neural Network Architectures
No ratings yet
Lesson 2 Neural Network Architectures
35 pages
Ad3451 ML Unit 4 Notes
No ratings yet
Ad3451 ML Unit 4 Notes
36 pages
Activation Function
No ratings yet
Activation Function
6 pages
Unit Ii DNN
No ratings yet
Unit Ii DNN
24 pages
FDL Module1
No ratings yet
FDL Module1
102 pages
Assignment - 4
No ratings yet
Assignment - 4
24 pages
Machine Learning (CSO851) - Lecture 08
No ratings yet
Machine Learning (CSO851) - Lecture 08
27 pages
08 Practical Aspects of Deep Learning 2
No ratings yet
08 Practical Aspects of Deep Learning 2
100 pages
ANN Viva Prep
No ratings yet
ANN Viva Prep
66 pages
Training Neural Networks
No ratings yet
Training Neural Networks
109 pages
Lecture 6 Part 2
No ratings yet
Lecture 6 Part 2
136 pages
Module 2
No ratings yet
Module 2
12 pages
Neural Network Implementation Guide
No ratings yet
Neural Network Implementation Guide
21 pages
Artificial Intelligence: Outline
No ratings yet
Artificial Intelligence: Outline
35 pages
Module II
No ratings yet
Module II
152 pages
Module 2
No ratings yet
Module 2
126 pages
Gen Ai Mynotes
No ratings yet
Gen Ai Mynotes
12 pages
Unit-1 and 2 and 3
No ratings yet
Unit-1 and 2 and 3
212 pages
Deep Learning Basics at TUM
No ratings yet
Deep Learning Basics at TUM
76 pages
Ch2-Training, Optimization and Regularization of DNN-new
No ratings yet
Ch2-Training, Optimization and Regularization of DNN-new
114 pages
CNN Activation Functions Explained
No ratings yet
CNN Activation Functions Explained
137 pages
Artificial Neural Networks
No ratings yet
Artificial Neural Networks
100 pages
ML PPT Activation Functions
100% (1)
ML PPT Activation Functions
12 pages
DL U-I Introduction Part-2
No ratings yet
DL U-I Introduction Part-2
48 pages
Neural Network Training Guide
No ratings yet
Neural Network Training Guide
138 pages
DeepLearing Theory
No ratings yet
DeepLearing Theory
51 pages
Deep Learning Unit 2
No ratings yet
Deep Learning Unit 2
4 pages
Activation Functions
No ratings yet
Activation Functions
23 pages
ML Prep For Samsung
No ratings yet
ML Prep For Samsung
73 pages
Introduction to Neural Networks Basics
No ratings yet
Introduction to Neural Networks Basics
96 pages
Probability Neuron Network
No ratings yet
Probability Neuron Network
84 pages
Lec14 CNNRNNModels
No ratings yet
Lec14 CNNRNNModels
64 pages
Sentiment Analysis of Imbalanced Datasets Using BERT and Ensemble Stacking For Deep Learning
No ratings yet
Sentiment Analysis of Imbalanced Datasets Using BERT and Ensemble Stacking For Deep Learning
12 pages
Chapter 01 Notes
No ratings yet
Chapter 01 Notes
11 pages
Data Collection Methods
No ratings yet
Data Collection Methods
3 pages
Prediction of Zero-Dose Children Using Supervised
No ratings yet
Prediction of Zero-Dose Children Using Supervised
9 pages
Unit 4 & Unit 5
0% (1)
Unit 4 & Unit 5
59 pages
MLPerf Inferencing Benchmark
No ratings yet
MLPerf Inferencing Benchmark
23 pages
Ai ML Mnnit
No ratings yet
Ai ML Mnnit
186 pages
Data Science Revision Sheet Question Bank
100% (1)
Data Science Revision Sheet Question Bank
6 pages
2019 - FEMa - A Finite Element Machine For Fast Learning
No ratings yet
2019 - FEMa - A Finite Element Machine For Fast Learning
12 pages
NLP Unit-1 Notes
No ratings yet
NLP Unit-1 Notes
59 pages
Pavan Kumar Masters Thesis - Compressed
No ratings yet
Pavan Kumar Masters Thesis - Compressed
94 pages
Automatic Hadith Isnad Processing
No ratings yet
Automatic Hadith Isnad Processing
3 pages
Predictive Analytics PDF PDF
No ratings yet
Predictive Analytics PDF PDF
19 pages
F21DL 2024-25 Coursework-1 - 240918 - 110502
No ratings yet
F21DL 2024-25 Coursework-1 - 240918 - 110502
7 pages
UNIT I - Introduction
No ratings yet
UNIT I - Introduction
76 pages
Logic Gates and Their Role in Neural Networks
No ratings yet
Logic Gates and Their Role in Neural Networks
10 pages
Elicit - Predictive Maintenance in Refrigeration Systems - Report
No ratings yet
Elicit - Predictive Maintenance in Refrigeration Systems - Report
7 pages
Machine Learning: Regression & Classification
No ratings yet
Machine Learning: Regression & Classification
16 pages
Design and Implementation of Different Machine Learning Algorithms For Credit Card Fraud Detection
No ratings yet
Design and Implementation of Different Machine Learning Algorithms For Credit Card Fraud Detection
6 pages
Unit 3.5 & 5 ML
No ratings yet
Unit 3.5 & 5 ML
16 pages
Fe CAM
No ratings yet
Fe CAM
18 pages
WEKA: Comprehensive Data Mining Guide
No ratings yet
WEKA: Comprehensive Data Mining Guide
41 pages
Lab Manual - MACHINE LEARNING LABORATORY
No ratings yet
Lab Manual - MACHINE LEARNING LABORATORY
42 pages
A Review On Facial Expression Recognition Techniques
No ratings yet
A Review On Facial Expression Recognition Techniques
8 pages
Machine Learning for Engineers
No ratings yet
Machine Learning for Engineers
3 pages
AI and ML Notes
No ratings yet
AI and ML Notes
17 pages
NLP Text Classification Models Analysis
No ratings yet
NLP Text Classification Models Analysis
3 pages
Not Everything You Read Is True Fake News Detection Using Machine Learning Algorithms
No ratings yet
Not Everything You Read Is True Fake News Detection Using Machine Learning Algorithms
4 pages
Data Mining and Warehousing Overview
No ratings yet
Data Mining and Warehousing Overview
1 page

Lecture 02 With Notes

Uploaded by

Lecture 02 With Notes

Uploaded by

Machine Learning for Image

Dagmar Kainmueller, 15.04.2025

◆ Given n training data instances

Credits to Andrew Ng (Stanford Machine Learning, 2017, cs229)

Recap: Gradient Descent

Credits to Andrew Ng (Stanford Machine Learning, 2017, cs229)

Recap: Gradient Descent

Credits to Andrew Ng (Stanford Machine Learning, 2017, cs229)

Repeat: 1. Shuffle training data

„Vanilla“ GD Stochastic GD Mini-batch GD

Repeat: 1. Shuffle training data 1. Shuffle training data

„Vanilla“ GD Stochastic GD Mini-batch GD

Repeat: 1. Shuffle training data 1. Shuffle training data

◆ GD has a smooth trajectory

[Smith et al. 2018]

Weights (incl. Bias)

What happens when x = -10?

What happens when x = -10?

What can we say about the local gradient

of the linear function f w.r.t. 𝜽 ?

of the linear function f w.r.t. 𝜽 ? hypothetical

Sigmoid allowed zig zag path

of the linear function f w.r.t. 𝜽 ? hypothetical

• Squishes numbers to range (-

What happens when x = -10?

ReLU • Not zero-centered

• Computation requires exp()

Exponential Linear Units (ELU)

Sigmoid Linear Units (SiLU) Gaussian Erorr Linear Units (GELU)

SiLU / GELU • Computation requires exp() or approx.

Sigmoid Linear Units (SiLU) Gaussian Erorr Linear Units (GELU)

• Use ReLU for starters

Each image is 32x32x3

Credit to Raul Gomez

Let zc denote the score obtained

Squishes each score into (0,1) range

Credit to Raul Gomez

Let zc denote the score obtained

aka “sigmoid cross entropy

Let zc denote the score obtained

Problem with sigmoid for multi-

Credit to Raul Gomez (https://youtu.be/635cmrp4z40)

Let zc denote the score obtained

Squishes each score into (0,1) range

Let zc denote the score obtained

For two classes (i.e. binary classification),

Credit to Raul Gomez (https://youtu.be/635cmrp4z40)

Let zc denote the score obtained

aka “softmax cross entropy loss” or “softmax loss”

Let zc denote the score obtained

has a simple gradient w.r.t. to z

(see future assignment)

Convolutional Neural Networks for image data

You might also like