Practical Issues in
Neural Network
Training
Mr. Sivadasan E T
Associate Professor
Vidya Academy of Science and Technology, Thrissur
Overfitting
Overfitting happens when a model is trained too closely
to the specific patterns in the training data, including
noise or irrelevant details.
This makes the model highly accurate on the training data
but less effective at predicting outcomes for new, unseen
test data.
Even if the model perfectly predicts the training targets, it
does not guarantee good performance on test data.
Overfitting
In other words, there is always a gap between
the training and test data performance, which
is particularly large when the models are
complex and the data set is small.
Overfitting
Increasing the number of training instances
improves the generalization power of the model.
Whereas increasing the complexity of the
model reduces its generalization power.
Overfitting
A good rule of thumb is that the total number of
training data points should be at least 2 to 3 times
the number of parameters in the neural network.
The exact number of required data points varies
based on the specific model.
Overfitting
In general, models with a larger number of
parameters are said to have high capacity.
They require a larger amount of data in order to
gain generalization power to unseen test data.
Overfitting trade-off b/w bias and variance
The notion of overfitting is often understood in the
trade-off between bias and variance in machine
learning.
The key take-away from the notion of bias-variance
trade-off is that one does not always win with more
powerful (i.e., less biased) models when working
with limited training data, because of the higher
variance of these models.
The Vanishing and Exploding Gradient Problems
While increasing depth often leads to different types
of practical issues.
Propagating backwards using the chain rule has its
drawbacks in networks with a large number of layers
in terms of the stability of the updates.
The Vanishing and Exploding Gradient Problems
In particular, the updates in earlier layers can either be
negligibly small (vanishing gradient) or they can be
increasingly large (exploding gradient) in certain types
of neural network architectures.
The vanishing and exploding gradient problems are
rather natural to deep networks, which makes their
training process unstable.
Difficulties in Convergence
Achieving fast convergence in optimization is
challenging with very deep networks.
Greater depth increases resistance to smooth gradient
flow during training.
This issue is somewhat related to the vanishing
gradient problem but has distinct characteristics.
Local Optima
The optimization function of a neural network is
highly nonlinear, which has lots of local optima.
When the parameter space is large, and there are
many local optima, it makes sense to spend some
effort in picking good initialization points.
Local Optima
One such method for improving neural network
initialization is referred to as pretraining.
The basic idea is to use either supervised or
unsupervised training on shallow sub-networks of the
original network in order to create the initial weights.
Local Optima
Pretraining is done in a greedy, layer-wise fashion,
meaning one layer is trained at a time.
This process helps identify good initialization points
for each layer, avoiding irrelevant parts of the
parameter space.
Spurious Optima
Some of the minima in the loss function are spurious
optima because they are exhibited only in the training
data and not in the test data.
Unsupervised pretraining often tends to avoid
problems associated with overfitting.
Using unsupervised pretraining tends to move the
initialization point closer to the basin of “good” optima
in the test data.
Thank You!