AUTO ENCODERS
Dr. R. Shiva Shankar, Assistant Professor, Dept of CSE, SRKREC(A)
AUTO ENCODERS
Autoencoder is a type of neural network where the output layer has the same dimensionality as
the input layer.
In simpler words, the number of output units in the output layer is equal to the number of
input units in the input layer.
An autoencoder replicates the data from the input to the output in an unsupervised manner and
is therefore sometimes referred to as a replicator neural network.
The autoencoders reconstruct each dimension of the input by passing it through the network.
It may seem trivial to use a neural network for the purpose of replicating the input, but during
the replication process, the size of the input is reduced into its smaller representation.
The middle layers of the neural network have a fewer number of units as compared to that of
input or output layers.
Therefore, the middle layers hold the reduced representation of the input. The output is
reconstructed from this reduced representation of the input.
Dr. R. Shiva Shankar, Dept of CSE, SRKREC(A)
Architecture of Auto Encoders
An autoencoder consists of three components:
Encoder: An encoder is a feed forward, fully connected neural network that compresses the input
into a latent space representation and encodes the input image as a compressed representation in
a reduced dimension. The compressed image is the distorted version of the original image.
Code: This part of the network contains the reduced representation of the input that is fed into the
decoder.
Decoder: Decoder is also a feed forward network like the encoder and has a similar structure to
the encoder. This network is responsible for reconstructing the input back to the original
dimensions from the code.
Dr. R. Shiva Shankar, Dept of CSE, SRKREC(A)
First, the input goes through the encoder where it is compressed and stored in the layer called
Code, then the decoder decompresses the original input from the code.
The main objective of the autoencoder is to get an output identical to the input.
Note that the decoder architecture is the mirror image of the encoder.
This is not a requirement but it‟s typically the case.
The only requirement is the dimensionality of the input and output must be the same.
Dr. R. Shiva Shankar, Dept of CSE, SRKREC(A)
Under Complete
The simplest architecture for constructing an autoencoder is to constrain the number of nodes
present in the hidden layer(s) of the network, limiting the amount of information that can flow
through the network.
By penalizing the network according to the reconstruction error, our model can learn the most
important attributes of the input data and how to best reconstruct the original input from an
"encoded" state. Ideally, this encoding will learn and describe latent attributes of the input data.
Dr. R. Shiva Shankar, Dept of CSE, SRKREC(A)
Because neural networks are capable of learning nonlinear relationships, this can be thought of as
a more powerful (nonlinear) generalization of PCA.
Whereas PCA attempts to discover a lower dimensional hyperplane which describes the original
data, autoencoders are capable of learning nonlinear manifolds (a manifold is defined in simple
terms as a continuous, non-intersecting surface).
An undercomplete autoencoder has no explicit regularization term - we simply train our model
according to the reconstruction loss.
Thus, our only way to ensure that the model isn't memorizing the input data is the ensure that
we've sufficiently restricted the number of nodes in the hidden layer(s).
For deep autoencoders, we must also be aware of the capacity of our encoder and decoder models.
Even if the "bottleneck layer" is only one hidden node, it's still possible for our model to memorize
the training data provided that the encoder and decoder models have sufficient capability to learn
some arbitrary function which can map the data to an index.
Dr. R. Shiva Shankar, Dept of CSE, SRKREC(A)
The objective of under complete auto encoder is to capture the most important features
present in the data.
Undercomplete autoencoders have a smaller dimension for hidden layer compared to the input
layer.
This helps to obtain important features from the data. It minimizes the loss function by penalizing
the g(f(x)) for being different from the input x.
Advantages
Undercomplete autoencoders do not need any regularization as they maximize the probability of
data rather than copying the input to the output.
Drawbacks
Using an overparameterized model due to lack of sufficient training data can create overfitting.
Dr. R. Shiva Shankar, Dept of CSE, SRKREC(A)
Regularized
Undercomplete autoencoders, with code dimension less than the input dimension, can learn the
most salient features of the data distribution.
We have seen that these autoencoders fail to learn anything useful if the encoder and decoder are
given too much capacity.
A similar problem occurs if the hidden code is allowed to have dimension equal to the input, and in
the overcomplete case in which the hidden code has dimension greater than the input.
In these cases, even a linear encoder and a linear decoder can learn to copy the input to the
output without learning anything useful about the data distribution.
Ideally, one could train any architecture of autoencoder successfully, choosing the code dimension
and the capacity of the encoder and decoder based on the complexity of distribution to be modeled.
Regularized autoencoders provide the ability to do so.
Rather than limiting the model capacity by keeping the encoder and decoder shallow and the code
size small, regularized auto encoders use a loss function that encourages the model to have other
properties besides the ability to copy its input to its output.
Dr. R. Shiva Shankar, Dept of CSE, SRKREC(A)
These other properties include sparsity of the representation, smallness of the derivative of the
representation, and robustness to noise or to missing inputs.
A regularized autoencoder can be nonlinear and overcomplete but still learn something useful
about the data distribution, even if the model capacity is great enough to learn a trivial identity
function.
In addition to the methods described here, which are most naturally interpreted as regularized
autoencoders, nearly any generative model with latent variables and equipped with an inference
procedure (for computing latent representations given input) may be viewed as a particular form of
autoencoder.
Two generative modeling approaches that emphasize this connection with auto encoders are the
descendants of the Helmholtz machine, such as the variational auto encoder and the generative
stochastic networks.
These models naturally learn high-capacity, over complete encodings of the input and do not
require regularization for these encodings to be useful.
Their encodings are naturally useful because the models were trained to approximately maximize
the probability of the training data rather than to copy the input to the output
Dr. R. Shiva Shankar, Dept of CSE, SRKREC(A)
Stochastic
Autoencoders are just feedforward networks.
The same loss functions and output unit types that can be used for traditional feedforward
networks are also used for autoencoders
A general strategy for designing the output units and the loss function of a feedforward network is
to define an output distribution p(y | x) and minimize the negative log-likelihood−log p(y | x).
In that setting, y is a vector of targets, such as class labels
In an autoencoder, x is now the target as well as the input.
Given a hidden code h, we may think of the decoder as providing a conditional distribution
pdecoder(x | h).
We may then train the autoencoder by minimizing −log pdecoder(x | h).
The exact form of this loss function will change depending on the form of pdecoder.
Dr. R. Shiva Shankar, Dept of CSE, SRKREC(A)
As with traditional feedforward networks, we usually use linear output units to parametrize the
mean of a Gaussian distribution if x is real valued.
In that case, the negative log-likelihood yields a mean squared error criterion.
Similarly, binary x values correspond to a Bernoulli distribution whose parameters are given by a
sigmoid output unit, discrete x values correspond to a softmax distribution, and so on.
Typically, the output variables are treated as being conditionally independent given h so that this
probability distribution is inexpensive to evaluate, but some techniques, such as mixture density
outputs, allow tractable modeling of outputs with correlations.
Dr. R. Shiva Shankar, Dept of CSE, SRKREC(A)
To make a more radical departure from the feedforward networks we have seen previously,
we can also generalize the notion of an encoding function f(x) to an encoding distribution
pencoder(h | x), as illustrated in figure
The structure of a stochastic autoencoder, in
which both the encoder and the decoder are
not simple functions but instead involve some
noise injection, meaning that their output can
be seen as sampled from a distribution,
pencoder(h | x) for the encoder and
pdecoder(x | h) for the decoder.
Any latent variable model pmodel(h, x) defines a stochastic encoder
pencoder(h | x) = pmodel(h | x) and a stochastic decoder
pdecoder(x | h) = pmodel(x | h).
In general, the encoder and decoder distributions are not necessarily conditional
distributions compatible with a unique joint distribution pmodel (x, h).
Dr. R. Shiva Shankar, Dept of CSE, SRKREC(A)
Denoising
Denoising autoencoders create a corrupted copy of the input by introducing some noise.
This helps to avoid the autoencoders to copy the input to the output without learning features
about the data.
These autoencoders take a partially corrupted input while training to recover the original
undistorted input.
The model learns a vector field for mapping the input data towards a lower dimensional manifold
which describes the natural data to cancel out the added noise.
Dr. R. Shiva Shankar, Dept of CSE, SRKREC(A)
Advantages
It was introduced to achieve good representation. Such a representation is one that can be
obtained robustly from a corrupted input and that will be useful for recovering the corresponding
clean input.
Corruption of the input can be done randomly by making some of the input as zero. Remaining
nodes copy the input to the noised input.
Minimizes the loss function between the output node and the corrupted input.
Setting up a single-thread denoising autoencoder is easy.
Drawbacks
To train an auto encoder to denoise data, it is necessary to perform preliminary stochastic
mapping in order to corrupt the data and use as input.
This model isn't able to develop a mapping which memorizes the training data because our input
and target output are no longer the same.
Dr. R. Shiva Shankar, Dept of CSE, SRKREC(A)
Contractive
The objective of a contractive autoencoder is to
have a robust learned representation which is less
sensitive to small variation in the data.
Robustness of the representation for the data is
done by applying a penalty term to the loss
function.
Contractive autoencoder is another regularization
technique just like sparse and denoising
autoencoders.
However, this regularizer corresponds to the
Frobenius norm of the Jacobian matrix of the
encoder activations with respect to the input.
Frobenius norm of the Jacobian matrix for the
hidden layer is calculated with respect to input
and it is basically the sum of square of all
elements.
Dr. R. Shiva Shankar, Dept of CSE, SRKREC(A)
Advantages
Contractive autoencoder is a better choice than denoising autoencoder to learn useful
feature extraction.
This model learns an encoding in which similar inputs have similar encodings.
Hence, we're forcing the model to learn how to contract a neighborhood of inputs into a
smaller neighborhood of outputs
Dr. R. Shiva Shankar, Dept of CSE, SRKREC(A)
Optimization for Deep Learning
Optimizers are algorithms or methods used to change the attributes of your neural network such
as weights and learning rate in order to reduce the losses.
Optimizers help to get results faster.
Dr. R. Shiva Shankar, Dept of CSE, SRKREC(A)
How you should change your weights or learning rates of your neural network to reduce the
losses is defined by the optimizers you use.
Optimization algorithms or strategies are responsible for reducing the losses and to provide the
most accurate results possible.
We‟ll learn about different types of optimizers and their advantages:
Gradient Descent
It is the most basic but most used optimization algorithm. It‟s used heavily in linear regression
and classification algorithms. Backpropagation in neural networks also uses a gradient descent
algorithm.
It is a first-order optimization algorithm which is dependent on the first order derivative of a loss
function.
It calculates that which way the weights should be altered so that the function can reach a
minima.
Through backpropagation, the loss is transferred from one layer to another and the model‟s
parameters also known as weights and they are modified depending on the losses so that the
loss can be minimized.
θ = θ−α⋅∇J(θ)
Dr. R. Shiva Shankar, Dept of CSE, SRKREC(A)
Advantages
Easy computation.
Easy to implement.
Easy to understand.
Disadvantages
May trap at local minima.
Weights are changed after calculating gradient on the whole dataset. So, if the dataset is too
large than this may take years to converge to the minima.
Requires large memory to calculate gradient on the whole dataset.
Dr. R. Shiva Shankar, Dept of CSE, SRKREC(A)
Back Propagation
Backprop is an abbreviation for “backward propagation of errors”.
It is used in conjunction with gradient descent which means the practical implementation of the
gradient computation.
Backpropagation (BP) was an important finding in the history of neural networks.
This method calculates the gradient loss function taking weights in the network into
account.
This gradient is fed to the optimisation method, which updates the weights
of the existing ones to minimise the loss function.
Backpropagation has been used to calculate the loss function and to do that it requires a known
output or the desired output for each input value.
Backpropagation has found its applications in areas like classification problems, function
approximation, time-space approximation, time-series prediction, face recognition, ALVINN-
Enhancing training, etc.
Dr. R. Shiva Shankar, Dept of CSE, SRKREC(A)
Convergence Theorem
A neural network model in which each neuron performs a threshold logic function, the model
always converges to a state of stability while operating in a serial mode and to a cycle of the
length of the two while operating in full parallel mode.
So, there are mainly two types of convergence results for gradient descent. If all the iterates are
bounded, then GD with a proper constant step size converges.
There are many types of convergence theorems, like perceptron convergence theorem — a multi-
layered convergence theorem also known as neural network;
Convergence theory for deep learning via over-parameterisation and convergence results for neural
networks via electrodynamics etc.
Learning Rate
Dr. R. Shiva Shankar, Dept of CSE, SRKREC(A)
For a neural network optimisation, there are all but two goals, one is to converge faster, and the
other one is to improve a particular metric of interest.
A faster method doesn‟t generalise better and doesn‟t really enhance the metric of interest, which is
different from optimisation loss.
Due to this, one has to try optimising idea in order to improve the convergence speed and accept
that idea only if it passes a specific „performance check‟.
The learning rate is a tuning parameter in an optimisation algorithm that is responsible for
determining the step size at each iteration while moving towards a minimum function loss.
It represents the speed at which a machine learning model learns since it influences the amount of
old information which is vetoed by the newly acquired knowledge.
The learning rate is denoted by η or α.
A learning rate schedule also keeps changing the step size during learning and changes between
iterations/epochs, mainly done with two parameters: decay and momentum. The other types are
time-based, step-based and exponential.
Dr. R. Shiva Shankar, Dept of CSE, SRKREC(A)
Initialisation
The initialisation is one of the significant tricks for training deep neural networks.
Because of the exploding gradients or vanishing gradient regions, there exists a large portion of the
whole space, and initialising in these regions will fail an algorithm.
Thus this makes it ideal for picking the initial point in an excellent region to start with.
Types of initialisation
Naive initialisation: The suitable region to pick the initial point is unknown, so the first step is to
find a simple initial point.
One choice is the all-zero initial point, and the other one is a sparse initial point that is only a
small portion of the weights which are non-zero or drawing weights end up forming certain random
distribution.
LeCun initialisation and Xavier initialisation which is designed for sigmoid activation functions.
Kaiming initialisation for ReLU activation.
Layered-Sequential Unit-Variance (LSUV), which shows empirical benefits for some problems.
Dr. R. Shiva Shankar, Dept of CSE, SRKREC(A)
Normalisation
Normalisation can be viewed as the extension to initialisation, so instead of merely modifying the
initial point, this method changes the network for all the next iterates that follow.
Batch Normalisation is again a standard technique in today‟s time. It reduces the covariance shift.
Covariance shift happens is when an algorithm has learned X to Y mapping, then if the distribution
of X changes, then the model has to be retrained.
Another thing about BatchNorm can do is allowing each layer of a network to learn by itself
independent of other layers.
The benefit of BatchNorm is to allow larger learning rate.
The networks which do not have BatchNorm have larger isolated eigenvalues, while those with
BatchNorm have no such issues with isolated eigenvalues.
BatchNorm, however, does not work very well with mini-batches which do not have similar
statistics because the mean/variance for each mini-batch is computed as an approximation of the
mean /variance for all samples.
Dr. R. Shiva Shankar, Dept of CSE, SRKREC(A)
Stochastic Gradient Descent
It‟s a variant of Gradient Descent. It tries to update the model‟s parameters more frequently.
In this, the model parameters are altered after computation of loss on each training example.
So, if the dataset contains 1000 rows SGD will update the model parameters 1000 times in one
cycle of dataset instead of one time as in Gradient Descent.
θ=θ−α⋅∇J(θ;x(i);y(i)) , where {x(i) ,y(i)} are the training examples.
As the model parameters are frequently updated parameters have high variance and fluctuations in
loss functions at different intensities.
Advantages
Frequent updates of model parameters hence, converges in less time.
Requires less memory as no need to store values of loss functions.
May get new minima‟s.
Disadvantages
High variance in model parameters.
May shoot even after achieving global minima.
To get the same convergence as gradient descent needs to slowly reduce the value of learning rate.
Dr. R. Shiva Shankar, Dept of CSE, SRKREC(A)
Momentum
Momentum was invented for reducing high variance in SGD and softens the convergence.
It accelerates the convergence towards the relevant direction and reduces the fluctuation to the
irrelevant direction.
One more hyperparameter is used in this method known as momentum symbolized by ‘γ’.
V(t)=γV(t−1)+α.∇J(θ)
Now, the weights are updated by θ=θ−V(t).
The momentum term γ is usually set to 0.9 or a similar value.
Advantages
Reduces the oscillations and high variance of the parameters.
Converges faster than gradient descent.
Disadvantages
One more hyper-parameter is added which needs to be selected manually and accurately.
Dr. R. Shiva Shankar, Dept of CSE, SRKREC(A)
Adam
Adam (Adaptive Moment Estimation) works with momentums of first and second order.
The intuition behind the Adam is that we don‟t want to roll so fast just because we can jump over
the minimum, we want to decrease the velocity a little bit for a careful search.
In addition to storing an exponentially decaying average of past squared gradients like AdaDelta,
Adam also keeps an exponentially decaying average of past gradients M(t).
M(t) and V(t) are values of the first moment which is the Mean and the second moment which
is the uncentered variance of the gradients respectively.
First and second order of momentum
Dr. R. Shiva Shankar, Dept of CSE, SRKREC(A)
Here, we are taking mean of M(t) and V(t) so that E[m(t)] can be equal to E[g(t)] where, E[f(x)] is
an expected value of f(x).
To update the parameter:
Update the parameters
The values for β1 is 0.9 , 0.999 for β2, and (10 x exp(-8)) for ‘ϵ’.
Advantages
The method is too fast and converges rapidly.
Rectifies vanishing learning rate, high variance.
Disadvantages
Computationally costly.
Dr. R. Shiva Shankar, Dept of CSE, SRKREC(A)
Thank You
Dr. R. Shiva Shankar, Dept of CSE, SRKREC(A)