0% found this document useful (0 votes)

17 views152 pages

Module II

The document discusses various aspects of training Convolutional Neural Networks (CNNs), focusing on activation functions, dataset preparation, data preprocessing, and weight initialization. It highlights the advantages and disadvantages of different activation functions like Sigmoid, tanh, ReLU, and others, as well as the importance of proper dataset splitting and preprocessing techniques. Additionally, it covers optimization strategies, including gradient descent methods and their challenges.

Uploaded by

aryan.karthik2004

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

17 views152 pages

Module II

Uploaded by

aryan.karthik2004

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

This Session

Training Aspects of CNN

Activation Functions

Dataset Preparation

Data Preprocessing

Weight Initialization
Activation Functions
Non-linearity Layer

Source: cs231n, Stanford University

Activation Functions: Sigmoid

Source: cs231n, Stanford University

Activation Functions: Sigmoid

Sigmoids saturate and kill gradients.

Source: cs231n, Stanford University

Activation Functions: Sigmoid

Source: cs231n, Stanford University

Activation Functions: Sigmoid

Sigmoids saturate and kill gradients.

Sigmoid outputs are not zero-centered.

Source: cs231n, Stanford University

Activation Functions: Sigmoid

Source: cs231n, Stanford University

Activation Functions: Sigmoid

Always all positive or all negative

(this is also why you want zero-mean data!)
Source: cs231n, Stanford University
Activation Functions: Sigmoid

Sigmoids saturate and kill gradients.

Sigmoid outputs are not zero-centered.

Exp() is a bit compute expensive.

Source: cs231n, Stanford University
Activation Functions: tanh

[LeCun et al., 1991]

Source: [Link]
Activation Functions: tanh

tanh neuron is simply a scaled

sigmoid neuron

Sigmoid

[LeCun et al., 1991]

Source: [Link]
Activation Functions: tanh

tanh neuron is simply a scaled

sigmoid neuron

Sigmoid

Like the sigmoid neuron, its activations saturate.

Unlike the sigmoid neuron its output is zero-centered.

In practice the tanh non-linearity is always preferred to the

sigmoid nonlinearity. [LeCun et al., 1991]
Source: [Link]
Activation Functions: ReLU

[Krizhevsky et al., 2012]

Source: [Link]
Activation Functions: ReLU

ReLU is 6 times faster in the convergence of stochastic gradient descent

compared to the sigmoid/tanh (Krizhevsky et al.).

ReLU is simple as compared to tanh/sigmoid that involve expensive

operations (exponentials, etc.)

[Krizhevsky et al., 2012]

Source: [Link]
Activation Functions: ReLU

ReLU is 6 times faster in the convergence of stochastic gradient descent

compared to the sigmoid/tanh (Krizhevsky et al.).

ReLU is simple as compared to tanh/sigmoid that involve expensive

operations (exponentials, etc.)

Dying ReLU problem: a large gradient flowing through a ReLU neuron could
cause the weights to update in such a way that the neuron will never activate
on any datapoint again.

[Krizhevsky et al., 2012]

Source: [Link]
Activation Functions: ReLU

Source: cs231n, Stanford University

Activation Functions: Leaky ReLU

[Mass et al., 2013]

Source: [Link]
Activation Functions: Leaky ReLU

Succeeded in some cases, but the results are not always

consistent.

Source: [Link]
Activation Functions: Parametric ReLU

In PReLU, the slope in the negative region is considered as a

parameter of each neuron and learnt from data.

He, K., Zhang, X., Ren, S., & Sun, J. (2015). Delving deep into rectifiers: Surpassing human-level performance on imagenet
classification. IEEE international conference on computer vision (CVPR).
Source: [Link]
Activation Functions: Maxout
Maxout neuron (introduced by Goodfellow et al.) generalizes
the ReLU and its leaky version.

The Maxout neuron computes the function:

[Goodfellow et al., 2013]

Source: [Link]
Activation Functions: Maxout
Maxout neuron (introduced by Goodfellow et al.) generalizes
the ReLU and its leaky version.

The Maxout neuron computes the function:

Both ReLU and Leaky ReLU are a special case of this form
(for example, for ReLU, we have w1=0,b1=0, w2=identity, and
b2=0).

[Goodfellow et al., 2013]

Source: [Link]
Activation Functions: Maxout
Maxout neuron (introduced by Goodfellow et al.) generalizes
the ReLU and its leaky version.

The Maxout neuron computes the function:

Both ReLU and Leaky ReLU are a special case of this form
(for example, for ReLU, we have w1=0,b1=0, w2=identity, and
b2=0).

Unlike the ReLU neurons it doubles the number of

parameters.

[Goodfellow et al., 2013]

Source: [Link]
Activation Functions: ELU

- Exponential Linear Unit

Clevert, Djork-Arné, Thomas Unterthiner, and Sepp Hochreiter. "Fast and accurate deep network learning by
exponential linear units (elus)." International Conference on Learning Representations (ICLR) 2016.
Activation Functions: ELU

- Exponential Linear Unit

- All benefits of ReLU
- Negative saturation regime compared with Leaky ReLU
adds some robustness to noise

- Computation requires exp()

SELU induces self-normalization to automatically converge towards zero

mean and unit variance

Klambauer, Günter, Thomas Unterthiner, Andreas Mayr, and Sepp Hochreiter. "Self-normalizing neural networks."
In Advances in Neural Information Processing Systems (NIPS), 2017.
Activation Functions: Swish

- ReLU is special case of Swish

Ramachandran et al. "Swish: a self-gated activation function." ICLR Workshops, 2018.

Activation Functions: Swish

- ReLU is special case of Swish

CIFAR-10 accuracy

Ramachandran et al. "Swish: a self-gated activation function." ICLR Workshops, 2018.

Activation Functions: abReLU
Average Biased Rectified Linear Unit (abReLU)

A is input map average

abReLU makes sure that all the neurons producing the values more than
the average of all values in that layer must not be dead.

Dubey, S.R. and Chakraborty, S., Average Biased ReLU Based CNN Descriptor for Improved Face
Retrieval. arXiv preprint arXiv:1804.02051, 2018.
Activation Functions: In Practice

- Use ReLU. Be careful with your learning rates

- Try out Swish/Leaky ReLU / Maxout / ELU

- Try out tanh

- use sigmoid
Dataset Preparation
Train/Val/Test sets
In General People Do: Train/Test
- Split data into train and test,
- Choose hyperparameters that work best on test
data
In General People Do: Train/Test
- Split data into train and test,
- Choose hyperparameters that work best on test
data

BAD: No idea how algorithm will perform on

new data
K-Fold Validation
- Split data into folds,
- Try each fold as validation and average the results
K-Fold Validation
- Split data into folds,
- Try each fold as validation and average the results

Useful for small datasets, but not used too

frequently in deep learning
Better Approach: Train/Val/Test sets
- Split data into train, val, and test;
- Choose hyperparameters on val and evaluate on
test
Better Approach: Train/Val/Test sets
- Split data into train, val, and test;
- Choose hyperparameters on val and evaluate on
test

Division can be done based on the size of dataset:

Roughly 10k or 10% whichever is less for val and test sets.
Rest in train set.
Data Preprocessing
Data Preprocessing

Source: cs231n
Data Preprocessing

Always all positive or all negative

(this is also why you want zero-mean data!)
Source: cs231n, Stanford University
Data Preprocessing

In practice for Images: only centering is preferred

e.g. consider CIFAR-10 example with [32,32,3] images
- Subtract the mean image (e.g. AlexNet)
(mean image = [32,32,3] array)
- Subtract per-channel mean (e.g. VGGNet, ResNet, etc.)
(mean along each channel = 3 numbers)
Source: cs231n
Weight Initialization
Weight Initialization: Constant

Q: what happens when W=Constant init is used?

Weight Initialization: Constant

Q: what happens when W=Constant init is used?

- Every neuron will

compute the same output
and undergo the exact
same parameter
updates.
- There is no source of
asymmetry between
neurons if their weights
are initialized to be the
same.
Source: cs231n
Weight Initialization: Gaussian
First idea: Small random numbers
(Gaussian with zero mean and 1e-2 standard deviation)
Symmetry breaking: Weights are different for
different neurons

Source: cs231n
Weight Initialization: Gaussian
First idea: Small random numbers
(Gaussian with zero mean and 1e-2 standard deviation)
Symmetry breaking: Weights are different for
different neurons
Works ~okay for small networks, but problems with
deeper networks,
i.e. Almost all neurons will become zero
-> gradient diminishing problem.
Increase the standard deviation to 1
Almost all neurons completely saturated, either -1
or 1. Gradients will be all zero.
-> gradient diminishing problem.
Source: cs231n
Weight Initialization: Gaussian

Source: cs231n
Weight Initialization: Gaussian

Source: cs231n
[Glorot et al., 2010]
Weight Initialization: Xavier
Calibrating the variances with 1/sqrt(fan_in)

Reasonable initialization.
(Mathematical derivation assumes linear activations)

Source: cs231n
Weight Initialization: Xavier

Source: cs231n
Weight Initialization: XavierImproved

Source: cs231n
Understanding the difficulty of training deep feedforward neural
networks by Glorot and Bengio, 2010

Exact solutions to the nonlinear dynamics of learning in deep linear

neural networks by Saxe et al, 2013

Random walk initialization for training very deep feedforward

networks by Sussillo and Abbott, 2014

Delving deep into rectifiers: Surpassing human-level performance

on ImageNet classification by He et al., 2015

Data-dependent Initializations of Convolutional Neural Networks by

Krähenbühl et al., 2015

All you need is a good init by Mishkin and Matas, 2015

Source: cs231n
Things to remember
Training CNN
Activation Functions: ReLU is common, Swish can
be tried
Data Preparation: Train/Val/Test
Data preprocessing: Centering is common
Weight initialization: XavierImproved works well
Training
Forward it
Sample Back- Update the
through the
labeled data propagate network
network, get
(batch) the errors weights
predictions

Optimize (min. or max.) objective/cost function

Generate error signal that measures difference

between predictions and target values

Use error signal to change the weights and get

more accurate predictions
Subtracting a fraction of the gradient moves you
towards the (local) minimum of the cost function
[Link]
Optimization
Optimization is the process of finding the set of
parameters W that minimize the loss function.

Source: [Link]
Optimization
Optimization is the process of finding the set of
parameters W that minimize the loss function.
Strategy #1:First very bad idea solution: Random search:
Simply try out many different random weights and
keep track of what works best.
Strategy #2: Random local search:
Start out with a random W, generate random
changes to it and if the loss at the
changed is lower, we will perform an update.
Strategy #3: Following the gradients:
There is no need to randomly search for a good
direction: this direction is related to the gradient of the
loss function.
Source: [Link]
Gradient Descent
The procedure of repeatedly evaluating the gradient of
loss function and then performing a parameter update.

Source: [Link]
Gradient Descent
The procedure of repeatedly evaluating the gradient of
loss function and then performing a parameter update.

Vanilla (Original) Gradient Descent:

Source: [Link]
Gradient Descent
The procedure of repeatedly evaluating the gradient of
loss function and then performing a parameter update.

Vanilla (Original) Gradient Descent:

Mini-batch Gradient Descent (MGD):

Source: [Link]
Gradient Descent
The procedure of repeatedly evaluating the gradient of
loss function and then performing a parameter update.

Vanilla (Original) Gradient Descent:

Mini-batch Gradient Descent (MGD):

Stochastic Gradient Descent (SGD):

Special case of MGD when mini-batch contains only a single example
Source: [Link]
Optimization
Source: cs231n
Mini-batch SGD

Loop:
1. Sample a batch of data
2. Forward prop it through the graph
(network), get loss
3. Backprop to calculate the gradients
4. Update the parameters using the gradient

Source: cs231n
Stochastic Gradient Descent (SGD)
The procedure of repeatedly evaluating the
gradient of loss function and then performing a
parameter update.

Vanilla (Original) Gradient Descent:

Source: cs231n
SGD: Problems
What if loss changes quickly in one direction and
slowly in another?

Very slow progress along shallow dimension, jitter

along steep direction

Source: cs231n
SGD: Problems
What if the loss function has
a local minima or saddle
point?

Zero gradient,
gradient descent
gets stuck

Source: cs231n
SGD: Problems
What if the loss function has
a local minima or saddle
point?

Zero gradient,
gradient descent
gets stuck

Saddle points much more

common in high
dimension
point problem in high-dimensional non-convex
Source: cs231n
SGD: Problems
Our gradients come from
minibatches so they can
be noisy!

Source: cs231n
SGD + Momentum

- Build
consistent gradient
- Rho 0.99
Source: cs231n
SGD + Momentum

Source: cs231n
SGD + Momentum

Source: cs231n
Nesterov Momentum

Sutskever
and momentum in deel Source: cs231n
Nesterov Momentum

Sutskever
and momentum in deel Source: cs231n
AdaGrad

Added element-wise scaling of the gradient based on

the historical sum of squares in each dimension

Duchi subgradient methods for online learning

Source: cs231n
AdaGrad

What happens to the step size over long time?

Duchi subgradient methods for online learning

Source: cs231n
AdaGrad

What happens to the step size over long time?

Effective learning rate diminishing

problem
Duchi subgradient methods for online learning
Source: cs231n
RMSProp
AdaGrad

RMSProp

Tieleman and Hinton, 2012 Source: cs231n

Kingma
Adam A method for stochastic

Source: cs231n
Kingma
Adam A method for stochastic

Sort of like RMSProp with Momentum

Source: cs231n
Kingma
Adam A method for stochastic

Sort of like RMSProp with Momentum

Problem:
Initially, second_moment=0 and beta2=0.999
After 1st iteration, second_moment -> close to zero
So, very large step for update of x
Source: cs231n
Kingma
Adam (with Bias correction) A method for stochastic

AdaGrad/ Bias Correction Momentum

RMSProp Bias correction for the fact that first and second
moment estimates start at zero

Source: cs231n
Kingma
Adam (with Bias correction) A method for stochastic

AdaGrad/ Bias Correction Momentum

RMSProp Bias correction for the fact that first and second
moment estimates start at zero
Adam with beta1 = 0.9,
beta2 = 0.999, and learning_rate = 1e-3 or 5e-4
is a great starting point for many models!
Source: cs231n
AMSGrad
Some minibatches (rarely occur) provide large and informative
gradients, exponential averaging diminishes their influence,
which leads to poor convergence.

Reddi, S.J., Kale, S. and Kumar, S., On the convergence of adam and beyond, ICLR 2018.
[Link]
AMSGrad
Some minibatches (rarely occur) provide large and informative
gradients, exponential averaging diminishes their influence,
which leads to poor convergence.

AMSGrad that uses the maximum of past squared gradients vt

Reddi, S.J., Kale, S. and Kumar, S., On the convergence of adam and beyond, ICLR 2018.
[Link]
Learning Rate
SGD, SGD+Momentum, Adagrad, RMSProp, Adam
all have learning rate as a hyperparameter.

Q: Which one of these

learning rates is best to
use?

Source: cs231n
Learning Rate
SGD, SGD+Momentum, Adagrad, RMSProp, Adam
all have learning rate as a hyperparameter.

Source: cs231n
Optimizer and Learning Rate
In Practice:

- Adam is a good default choice in most

cases

- Learning rate with step decay is

commonly used

More Optimizer: [Link]

Regularization

Image Source: [Link]

on-a-simple-feed-forward-network/44985765
Regularization

Data loss: Model predictions

should match training data

Source: cs231n
Regularization

Data loss: Model predictions Regularization: Prevent the model

should match training data from doing too well on training data

Source: cs231n
Regularization

Data loss: Model predictions Regularization: Prevent the model

should match training data from doing too well on training data

Source: cs231n
Regularization

Data loss: Model predictions Regularization: Prevent the model

should match training data from doing too well on training data

Source: cs231n
Regularization

Data loss: Model predictions Regularization: Prevent the model

should match training data from doing too well on training data

Why regularize?
- Express preferences over weights
- Make the model simple so it works on test data
- Improve optimization by adding curvature

Source: cs231n
Regularization

Which W to consider?
Source: cs231n
Regularization

Source: cs231n
Regularization

L2 regularization likes to

Source: cs231n
Dropout
Dropout
In each forward pass, randomly set some neurons to zero
Probability of dropping is a hyperparameter; 0.5 is common

Dropout: A simple way to prevent neural networks from overfitting [Srivastava JMLR 2014]
Dropout
How can this possibly be a good idea?

Source: cs231n
Dropout
How can this possibly be a good idea?

Source: cs231n
Dropout
How can this possibly be a good idea?
Dropout is training a large ensemble of
models (that share parameters).

Intuition: successful conspiracies

50 people planning a conspiracy

Strategy A: plan a big conspiracy involving 50

people
Likely to fail. 50 people need to play their
parts correctly.

Strategy B: plan 10 conspiracies each involving 5

people
Likely to succeed!

Source: cs231n & JB Huang

Dropout: Test Time

Source: cs231n
We drop and scale at train time and don't do anything at test time.

Source: cs231n
DropConnect
Dropping some connections

Networks using DropConnect Source: cs231n

Batch Normalization
Batch Normalization
want zero-mean unit-variance activations? lets

Batch Normalization: Accelerating Deep Network Training by

Reducing Internal Covariate Shift [Ioffe and Szegedy 2015] Source: cs231n
Batch Normalization
want zero-mean unit-variance activations? lets

consider a batch of activations at some layer. To make

each dimension zero-mean unit-variance, apply:

Batch Normalization: Accelerating Deep Network Training by

Reducing Internal Covariate Shift [Ioffe and Szegedy 2015] Source: cs231n
Batch Normalization

Usually inserted after Fully

Connected or Convolutional layers,
and before nonlinearity.

Batch Normalization: Accelerating Deep Network Training by

Reducing Internal Covariate Shift [Ioffe and Szegedy 2015] Source: cs231n
Batch Normalization

Usually inserted after Fully

Connected or Convolutional layers,
and before nonlinearity.

Problem:
do we necessarily want a
zero-mean unit-variance input?

Batch Normalization: Accelerating Deep Network Training by

Reducing Internal Covariate Shift [Ioffe and Szegedy 2015] Source: cs231n
Batch Normalization
Normalize:

And then allow the

network to squash
the range if it wants to:

Batch Normalization: Accelerating Deep Network Training by

Reducing Internal Covariate Shift [Ioffe and Szegedy 2015] Source: cs231n
Batch Normalization
Normalize:

And then allow the

network to squash
the range if it wants to:

Batch Normalization: Accelerating Deep Network Training by

Reducing Internal Covariate Shift [Ioffe and Szegedy 2015] Source: cs231n
Batch Normalization

Batch Normalization: Accelerating Deep Network Training by

Reducing Internal Covariate Shift [Ioffe and Szegedy 2015] Source: cs231n
Batch Normalization
- Improves gradient flow through the network

- Allows higher learning rates

- Reduces the strong dependence on initialization

- Acts as a form of regularization

Batch Normalization: Accelerating Deep Network Training by

Reducing Internal Covariate Shift [Ioffe and Szegedy 2015] Source: cs231n
Batch Normalization
Note: at test time BatchNorm layer functions differently:

The mean/std are not computed based on the batch.

Instead, a single fixed empirical mean of activations

during training is used.

(e.g. can be estimated during training with running

averages)

Batch Normalization: Accelerating Deep Network Training by

Reducing Internal Covariate Shift [Ioffe and Szegedy 2015] Source: cs231n
Batch Normalization: Recent Trends
Layer Normalization:
Ba, Kiros arXiv 2016

Instance Normalization:
Ulyanov et al, Improved Texture Networks: Maximizing
Quality and Diversity in Feed-forward Stylization and Texture
Synthesis, CVPR 2017

Group Normalization:
arXiv 2018
(Appeared 3/22/2018)

Decorrelated Normalization:
Decorrelated arXiv 2018
(Appeared 4/23/2018)
Data Augmentation
Data Augmentation (Jittering)

Source: cs231n
Data Augmentation (Jittering)
Horizontal Flips

Source: cs231n
Data Augmentation (Jittering)
Random crops and scales

Source: cs231n
Data Augmentation (Jittering)
Create virtual training samples
Get creative for your problem!
Horizontal flip
Random crop
Color casting
Randomize contrast
Randomize brightness
Geometric distortion
Rotation
Photometric changes

Deep Image [Wu et al. 2015]

Overfitting, Underfitting, Bias-
Variance Trade off,
Hyper-parameter Tuning

Dr. Dipti Mishra, Mahindra University Ecole centrale School of

2/18/2025 1
Engineering, Hyderabad
Overfitting
• Overfitting occurs when a machine learning model becomes too complex and starts
fitting the training data too closely. This causes the model to learn the noise and
random fluctuations in the training data instead of the underlying patterns and
relationships that are relevant to the problem being solved. So as a result, the model
may perform very well on the training data but poorly on new, unseen data.

• The best way to identify if a model is being overfitted is to compare the training and
test error. Generally, training error is lower than the test error. And the goal of any
machine learning model is to minimize both (a) training error, as well as, (b) the gap
between training and test error. If the training error is low, and the difference between
training and test error is significant then the model might be in the overfitting zone.

• One can think of “overfitting” as a form of “memorization” of the training data, where
the model becomes too specialized to the training data and loses its ability to
generalize to new data.
Dr. Dipti Mishra, Mahindra University Ecole centrale School of
2/18/2025 2
Engineering, Hyderabad
Dr. Dipti Mishra, Mahindra University Ecole centrale School of
2/18/2025 3
Engineering, Hyderabad
Causes of Overfitting and how it can be mitigated
• There are several causes of overfitting.
• Using a complex model for the given dataset. For example, a decision tree with
too many levels or a neural network with too many layers and hidden units may
be more complex than is necessary to solve the problem at hand. Similarly, having
too many features can also make a model complex.
• Using too few examples to train the model, as this can make it more difficult to
find the underlying patterns in the data. In that case getting more training data
can help.
• Overfitting can also occur when the model is not regularized properly. L1 or L2
regularization add a penalty term to the model’s loss function that discourages it
from fitting the training data too closely.
• Dropout, which is another regularization techniques can also be used that
randomly drops out some of the neurons in a neural network during training,
which can help prevent the network from memorizing the training data.

Dr. Dipti Mishra, Mahindra University Ecole centrale School of

2/18/2025 4
Engineering, Hyderabad
Dr. Dipti Mishra, Mahindra University Ecole centrale School of
2/18/2025 5
Engineering, Hyderabad
Underfitting

Dr. Dipti Mishra, Mahindra University Ecole centrale School of

2/18/2025 6
Engineering, Hyderabad
Underfitting

Dr. Dipti Mishra, Mahindra University Ecole centrale School of

2/18/2025 7
Engineering, Hyderabad
Dr. Dipti Mishra, Mahindra University Ecole centrale School of
2/18/2025 8
Engineering, Hyderabad
Bias Variance Trade Off

Dr. Dipti Mishra, Mahindra University Ecole centrale School of

2/18/2025 9
Engineering, Hyderabad
Dr. Dipti Mishra, Mahindra University Ecole centrale School of
2/18/2025 10
Engineering, Hyderabad
Dr. Dipti Mishra, Mahindra University Ecole centrale School of
2/18/2025 11
Engineering, Hyderabad
Dr. Dipti Mishra, Mahindra University Ecole centrale School of
2/18/2025 12
Engineering, Hyderabad
Dr. Dipti Mishra, Mahindra University Ecole centrale School of
2/18/2025 13
Engineering, Hyderabad
Hyper-Parameter Tuning
• Hyperparameter tuning is the process of selecting the optimal values for a machine learning model’s
hyperparameters. Hyperparameters are settings that control the learning process of the model, such as the
learning rate, the number of neurons in a neural network, or the kernel size in a support vector machine. The
goal of hyperparameter tuning is to find the values that lead to the best performance on a given task.
• Hyperparameters in Neural Networks
• Neural networks have several essential hyperparameters that need to be adjusted, including:
• Learning rate: This hyperparameter controls the step size taken by the optimizer during each iteration of
training. Too small a learning rate can result in slow convergence, while too large a learning rate can lead to
instability and divergence.
• Epochs: This hyperparameter represents the number of times the entire training dataset is passed through the
model during training. Increasing the number of epochs can improve the model’s performance but may lead to
overfitting if not done carefully.
• Number of layers: This hyperparameter determines the depth of the model, which can have a significant impact
on its complexity and learning ability.
• Number of nodes per layer: This hyperparameter determines the width of the model, influencing its capacity to
represent complex relationships in the data.
• Architecture: This hyperparameter determines the overall structure of the neural network, including the
number of layers, the number of neurons per layer, and the connections between layers. The optimal
architecture depends on the complexity of the task and the size of the dataset
• Activation function: This hyperparameter introduces non-linearity into the model, allowing it to learn complex
decision boundaries. Common activation functions include sigmoid, tanh, and Rectified Linear Unit (ReLU).
Dr. Dipti Mishra, Mahindra University Ecole centrale School of
2/18/2025 14
Engineering, Hyderabad
This is a topic to understand
yourself, with
lots and lots and lots………….
of experimentation

Hyperparameter

“Settings + Tuning”
Dr. Dipti Mishra, Mahindra University Ecole centrale School of
2/18/2025 15
Engineering, Hyderabad
Neural networks: Pros and cons
Pros
Flexible and general function approximation
framework
Can build extremely powerful models by adding
more layers

Cons
Hard to analyze theoretically (e.g., training is
prone to local optima)
Huge amount of training data, computing power
may be required to get good performance
The space of implementation choices are huge
(network architectures, parameters)

Module 2
No ratings yet
Module 2
126 pages
9.b Handout-4-Activation Functions
No ratings yet
9.b Handout-4-Activation Functions
4 pages
L4 Training Neural Networks en
No ratings yet
L4 Training Neural Networks en
48 pages
Training Neural Networks
No ratings yet
Training Neural Networks
109 pages
Object Classification Using CNN
No ratings yet
Object Classification Using CNN
9 pages
Autoencoders in Deep Learning
No ratings yet
Autoencoders in Deep Learning
73 pages
06 AIS302 ANN Backpropagation
No ratings yet
06 AIS302 ANN Backpropagation
83 pages
Activation Functions
No ratings yet
Activation Functions
34 pages
Unit - 1 - Part - II-nn
No ratings yet
Unit - 1 - Part - II-nn
13 pages
4 - DNN Tip
No ratings yet
4 - DNN Tip
52 pages
CS490 Advanced Topics in Computing (Deep Learning)
No ratings yet
CS490 Advanced Topics in Computing (Deep Learning)
37 pages
Unit Ii DNN
No ratings yet
Unit Ii DNN
24 pages
What Are The Activation Functions, How Do I Deter...
No ratings yet
What Are The Activation Functions, How Do I Deter...
3 pages
Artificial Neural Networks - DL
No ratings yet
Artificial Neural Networks - DL
55 pages
Activation Function
No ratings yet
Activation Function
10 pages
Gianluca Maguolo Et Al - 2021 - Ensemble of Convolutional Neural Networks Trained With Different Activation
No ratings yet
Gianluca Maguolo Et Al - 2021 - Ensemble of Convolutional Neural Networks Trained With Different Activation
8 pages
Lec08-1Activation Functions
No ratings yet
Lec08-1Activation Functions
19 pages
Understanding Deep Learning Concepts
No ratings yet
Understanding Deep Learning Concepts
78 pages
DL Module II Till7thAug
No ratings yet
DL Module II Till7thAug
131 pages
Training Deep Neural Networks
No ratings yet
Training Deep Neural Networks
55 pages
DeepLearing Theory
No ratings yet
DeepLearing Theory
51 pages
Neural Networks Activation Functions 1694135997
No ratings yet
Neural Networks Activation Functions 1694135997
7 pages
Machine Learning (CSO851) - Lecture 08
No ratings yet
Machine Learning (CSO851) - Lecture 08
27 pages
Deeplearning Shreiyans
No ratings yet
Deeplearning Shreiyans
18 pages
Neural Network Activation Functions
No ratings yet
Neural Network Activation Functions
15 pages
Activation F
No ratings yet
Activation F
4 pages
Lecture 6 Part 2
No ratings yet
Lecture 6 Part 2
136 pages
ML PPT Activation Functions
100% (1)
ML PPT Activation Functions
12 pages
ANN Notes
No ratings yet
ANN Notes
7 pages
Deep Learning Meets Sparse Regularization: A Signal Processing Perspective
No ratings yet
Deep Learning Meets Sparse Regularization: A Signal Processing Perspective
23 pages
Activation Functions - Ipynb - Colaboratory
No ratings yet
Activation Functions - Ipynb - Colaboratory
10 pages
Fundamentals of Deep Learning Overview
No ratings yet
Fundamentals of Deep Learning Overview
51 pages
Winter1516 Lecture53
No ratings yet
Winter1516 Lecture53
20 pages
Neural Network Training Guide
No ratings yet
Neural Network Training Guide
138 pages
Deep Learning: MLPs and Regularization Techniques
No ratings yet
Deep Learning: MLPs and Regularization Techniques
44 pages
ANN Viva Prep
No ratings yet
ANN Viva Prep
66 pages
Artificial Neural Networks (ANN)
No ratings yet
Artificial Neural Networks (ANN)
67 pages
NN Unit - 1
100% (1)
NN Unit - 1
27 pages
CNN Activation Functions Explained
No ratings yet
CNN Activation Functions Explained
5 pages
Ceng403 - Week 6b
No ratings yet
Ceng403 - Week 6b
51 pages
AI Course Overview for Students
No ratings yet
AI Course Overview for Students
86 pages
Deep Learning Activation Functions
No ratings yet
Deep Learning Activation Functions
10 pages
f8194544 Microsoft PowerPoint DeepLearning
No ratings yet
f8194544 Microsoft PowerPoint DeepLearning
28 pages
Lecture 2.1.2activation Function
No ratings yet
Lecture 2.1.2activation Function
15 pages
Dense Neural Nets
No ratings yet
Dense Neural Nets
68 pages
L11 Introduction To Neural Network AI&ML CS877
No ratings yet
L11 Introduction To Neural Network AI&ML CS877
24 pages
9.b Handout-5-Weight Init
No ratings yet
9.b Handout-5-Weight Init
4 pages
Deep Learning
No ratings yet
Deep Learning
40 pages
DL Exp-3 16010422230
No ratings yet
DL Exp-3 16010422230
9 pages
ML Lec-22
No ratings yet
ML Lec-22
25 pages
Building Your Deep Neural Network - Step by Step v8 PDF
No ratings yet
Building Your Deep Neural Network - Step by Step v8 PDF
44 pages
Different Activation Functions With The Equations
No ratings yet
Different Activation Functions With The Equations
6 pages
Lesson 3 Artificial Neural Network
No ratings yet
Lesson 3 Artificial Neural Network
77 pages
Activation Function Layer Implementation
No ratings yet
Activation Function Layer Implementation
17 pages
26 - Netinput Activation Function Forward and Back Propogation
No ratings yet
26 - Netinput Activation Function Forward and Back Propogation
41 pages
Pr1 ANN Writeup
No ratings yet
Pr1 ANN Writeup
7 pages
25 Bet Form Completion
No ratings yet
25 Bet Form Completion
25 pages
Optimal Airline Ticket Purchasing Using Automated User-Guided Feature Selection
No ratings yet
Optimal Airline Ticket Purchasing Using Automated User-Guided Feature Selection
7 pages
Yearly Periodization Templates
No ratings yet
Yearly Periodization Templates
27 pages
Student Accommodation Reservations 2024
No ratings yet
Student Accommodation Reservations 2024
2 pages
Scheme of Work - Cambridge IGCSE Physics (0625) : Unit 5: Electromagnetism
No ratings yet
Scheme of Work - Cambridge IGCSE Physics (0625) : Unit 5: Electromagnetism
5 pages
Cuestionario Comprensión Lectora y Gramatical Del Nivel GA3-240202501-AA2-EV01.
No ratings yet
Cuestionario Comprensión Lectora y Gramatical Del Nivel GA3-240202501-AA2-EV01.
6 pages
DISTRESS TOLERANCE SKILLS MANUAL E-Version
No ratings yet
DISTRESS TOLERANCE SKILLS MANUAL E-Version
13 pages
PhD Thesis Conceptual Framework Help
100% (3)
PhD Thesis Conceptual Framework Help
4 pages
Jamb Past Questions
0% (1)
Jamb Past Questions
81 pages
Santa Clara Newspaper (6!01!73)
No ratings yet
Santa Clara Newspaper (6!01!73)
4 pages
Format Analisis Jurnal
No ratings yet
Format Analisis Jurnal
3 pages
PBD Transit Form Eng Y2 (Version 2)
No ratings yet
PBD Transit Form Eng Y2 (Version 2)
5 pages
Life Instructions
No ratings yet
Life Instructions
2 pages
Report Card-Class 1
No ratings yet
Report Card-Class 1
44 pages
Honors Physics Syllabus
No ratings yet
Honors Physics Syllabus
2 pages
Mumbai Museums-Quick Guide
No ratings yet
Mumbai Museums-Quick Guide
2 pages
SIMS Result 2
No ratings yet
SIMS Result 2
1 page
2.1 Measurement Techniques-Cie Ial Physics-Theory QP
No ratings yet
2.1 Measurement Techniques-Cie Ial Physics-Theory QP
16 pages
Guide To Curriculum Development
No ratings yet
Guide To Curriculum Development
6 pages
Chapter 6 - Behavioral Dimensions of The Consumer Market
100% (3)
Chapter 6 - Behavioral Dimensions of The Consumer Market
16 pages
How To Make A Forensic Psychological Report Ok
No ratings yet
How To Make A Forensic Psychological Report Ok
5 pages
Benazir Taleemi Wazaif Enrollment Guide
No ratings yet
Benazir Taleemi Wazaif Enrollment Guide
2 pages
Leadership Concepts & Philosophies
No ratings yet
Leadership Concepts & Philosophies
32 pages
Minor Project
No ratings yet
Minor Project
8 pages
Final Questionnaire of Project
No ratings yet
Final Questionnaire of Project
4 pages
Class 12 English Core Questions
No ratings yet
Class 12 English Core Questions
5 pages
Presbyterian Church of Korea Overview
No ratings yet
Presbyterian Church of Korea Overview
2 pages
CD Unitwise Imp Questions
100% (2)
CD Unitwise Imp Questions
5 pages
Java Object-Oriented Programming Lab Manual
No ratings yet
Java Object-Oriented Programming Lab Manual
95 pages
Not 0142024 4362024
No ratings yet
Not 0142024 4362024
3 pages