Unit 5 - Autoencoders and Generative Models
Unit 5 - Autoencoders and Generative Models
1
since they don’t need explicit labels to train on. But to be more precise they are self-
supervised because they generate their own labels from the training data.
Autoencoder Architecture:
The network architecture for autoencoders can vary between a simple FeedForward network,
LSTM network or Convolutional Neural Network depending on the use case.
The layer between the encoder and decoder, ie. the code is also known as Bottleneck /
latent-space representation.
This is a well-designed approach to decide which aspects of observed data are relevant
information and what aspects can be discarded.
2
It does this by balancing two criteria :
o Compactness of representation, measured as the compressibility.
o It retains some behaviourally relevant variables from the input.
Applications of Autoencoders
3
Image Coloring
Autoencoders are used for converting any black and white picture into a colored image.
Depending on what is in the picture, it is possible to tell what the color should be.
Feature variation
It extracts only the required features of an image and generates the output by removing any noise
or unnecessary interruption.
Dimensionality Reduction
The reconstructed image is the same as our input but with reduced dimensions. It helps in
providing the similar image with a reduced pixel value.
4
Denoising Image
The input seen by the autoencoder is not the raw input but a stochastically corrupted version. A
denoising autoencoder is thus trained to reconstruct the original input from the noisy version.
Watermark Removal
It is also used for removing watermarks from images or to remove any object while filming a
video or a movie.
Implementation
Now let’s implement an autoencoder for the following architecture, 1 hidden layer in the encoder
and decoder.
We will use the extremely popular MNIST dataset as input. It contains black-and-white images
of handwritten digits.
5
They’re of size 28x28 and we use them as a vector of 784 numbers between [0, 1]
We will now implement the autoencoder with Keras. The hyperparameters are: 128 nodes in
the hidden layer, code size is 32, and binary crossentropy is the loss function.
Code:
Let’s import the required libraries
import numpy as np
from keras.layers import Input, Dense
from keras.models import Model
from keras.datasets import mnist
import matplotlib.pyplot as plt
Declaration of Hidden Layers and Variables
# this is the size of our encoded representations
encoding_dim = 32 # 32 floats -> compression of factor 24.5, assuming the input is 784 floats
6
# create the decoder model
decoder = Model(encoded_input, decoder_layer(encoded_input))
# configure our model to use a per-pixel binary crossentropy loss, and the Adadelta optimizer:
autoencoder.compile(optimizer='adadelta', loss='binary_crossentropy')
print x_train.shape
print x_test.shape
Training Autoencoders for 50 epochs
autoencoder.fit(x_train, x_train,
epochs=50,
batch_size=256,
shuffle=True,
validation_data=(x_test, x_test))
# encode and decode some digits
# note that we take them from the *test* set
encoded_imgs = encoder.predict(x_test)
decoded_imgs = decoder.predict(encoded_imgs)
Visualizing the reconstructed inputs and the encoded representations using Matplotlib
7
n = 20 # how many digits we will display
plt.figure(figsize=(20, 4))
for i in range(n):
# display original
ax = plt.subplot(2, n, i + 1)
plt.imshow(x_test[i].reshape(28, 28))
plt.gray()
ax.get_xaxis().set_visible(False)
ax.get_yaxis().set_visible(False)
# display reconstruction
ax = plt.subplot(2, n, i + 1 + n)
plt.imshow(decoded_imgs[i].reshape(28, 28))
plt.gray()
ax.get_xaxis().set_visible(False)
ax.get_yaxis().set_visible(False)
plt.show()
====
2. Under complete autoencoders
Under complete autoencoders is an unsupervised neural network that you can use to generate
a compressed version of the input data.
It is done by taking in an image and trying to predict the same image as output, thus
reconstructing the image from its compressed bottleneck region.
The primary use for autoencoders like these is generating a latent space or bottleneck, which
forms a compressed substitute of the input data and can be easily decompressed back with
the help of the network when needed.
Undercomplete autoencoders learn features by minimizing the same loss function:
8
Where L is the loss function penalizing g(f(x)) from diverging from the original input x. L can
be a mean squared error or even a mean absolute error.
Goal of the Autoencoder is to capture the most important features present in the data.
Undercomplete autoencoders have a smaller dimension for hidden layer compared to the input
layer. This helps to obtain important features from the data.
Objective is to minimize the loss function by penalizing the g(f(x)) for being different from
the input x.
When decoder is linear and we use a mean squared error loss function then undercomplete
autoencoder generates a reduced feature space similar to PCA
We get a powerful nonlinear generalization of PCA when encoder function f and decoder
function g are non linear.
Undercomplete autoencoders do not need any regularization as they maximize the probability
of data rather than copying the input to the output.
Advantages
Undercomplete autoencoders, with code dimension less than the input dimension, can learn
the most salient features of the data distribution.
Disadvantages
We have seen that these autoencoders fail to learn anything useful if the encoder and decoder
are given too much capacity.
A similar problem occurs if the hidden code is allowed to have dimension equal to the input,
and in the overcomplete case in which the hidden code has dimension greater than the input.
In these cases, even a linear encoder and a linear decoder can learn to copy the input to the
output without learning anything useful about the data distribution.
Ideally, one could train any architecture of autoencoder successfully, choosing the code
dimension and the capacity of the encoder and decoder based on the complexity of
distribution to be modelled.
===
3. Regularized autoencoders
Regularized autoencoders provide the ability to do so. Rather than limiting the model capacity by
keeping the encoder and decoder shallow and the code size small, regularized autoencoders use a
loss function that encourages the model to have other properties besides the ability to copy its
input to its output. These other properties include sparsity of the representation, smallness of the
derivative of the representation, and robustness to noise or to missing inputs. A regularized
autoencoder can be nonlinear and overcomplete but still learn something useful about the data
distribution, even if the model capacity is great enough to learn a trivial identity function.
9
In addition to the methods described here, which are most naturally interpreted as regularized
autoencoders, nearly any generative model with latent variables and equipped with an inference
procedure (for computing latent representations given input) may be viewed as a particular form
of autoencoder.
In practice, we usually find two types of regularized autoencoder: the sparse autoencoder and
the denoising autoencoder.
(i) Sparse autoencoder : Sparse autoencoders are typically used to learn features for another task
such as classification. An autoencoder that has been regularized to be sparse must respond to
unique statistical features of the dataset it has been trained on, rather than simply acting as an
identity function. In this way, training to perform the copying task with a sparsity penalty can
yield a model that has learned useful features as a byproduct.
Another way we can constraint the reconstruction of autoencoder is to impose a constraint in its
loss. We could, for example, add a regularization term in the loss function. Doing this will make
our autoencoder learn sparse representation of data.
There are actually two different ways to construct our sparsity penalty: L1
regularization and KL-divergence.
10
Why L1 Regularization Sparse
L1 regularization and L2 regularization are widely used in machine learning and deep learning.
L1 regularization adds “absolute value of magnitude” of coefficients as penalty term while L2
regularization adds “squared magnitude” of coefficient as a penalty term.
Although L1 and L2 can both be used as regularization term, the key difference between them is
that L1 regularization tends to shrink the penalty coefficient to zero while L2 regularization
would move coefficients towards zero but they will never reach. Thus L1 regularization is often
used as a method of feature extraction. But why L1 regularization leads to sparsity?
Consider that we have two loss functions L1 and L2 which represent L1 regularization and L2
regularization respectively.
Gradient descent is always used in optimizing neural networks. If we plot these two loss
functions and their derivatives, it looks like this:
Loss Function
Finally, after the above analysis, we get the idea of using L1 regularization in sparse autoencoder
and the loss function is as below:
Except for the first two terms, we add the third term which penalizes the absolute value of the
vector of activations a in layer h for sample i. Then we use a hyperparameter to control its effect
on the whole loss function. And in this way, we do build a sparse autoencoder.
11
L2 regularization and its derivative
We can notice that for L1 regularization, the gradient is either 1 or -1 except when w=0, which
means that L1 regularization will always move w towards zero with same step size (1 or -1)
regardless of the value of w. And when w=0, the gradient becomes zero and no update will be
made anymore. However, for L2 regularization things are different. L2 regularization will also
move w towards zero but the step size becomes smaller and smaller which means that w
will never reach zero.
Visualization
We tried to build a deep autoencoder and train it on MNIST dataset without L1 regularization
and with regularization. The structure of this deep autoencoder is plotted as below:
Code:
input_size = 784
hidden_size = 64
output_size = 784
x = Input(shape=(input_size,))
# Encoder
12
h = Dense(hidden_size, activation='relu', activity_regularizer=regularizers.l1(10e-5))(x)
# Decoder
r = Dense(output_size, activation='sigmoid')(h)
Notice in our hidden layer, we added an l1 activity regularizer, that will apply a penalty to the
loss function during the optimization phase. As a result, the representation is now sparser
compared to the vanilla autoencoder.
And after 100 epochs of training using 128 batch size and Adam as the optimizer, we got below
results:
13
Decoder: This component returns the encoding to the original data space.
During the training phase, present the autoencoder with a set of clean input examples along with
their corresponding noisy versions. The objective is to learn a task using an encoder-decoder
architecture that efficiently transforms noisy input into clean output.
Architecture of DAE
The denoising autoencoder (DAE) architecture is similar to a standard autoencoder. It consists of
two main components:
Encoder
The encoder creates a neural network equipped with one or more hidden layers.
Its purpose is to receive noisy input data and generate an encoding, which represents a
low-dimensional representation of the data.
Understand an encoder as a compression function because the encoding has fewer
parameters than the input data.
Decoder
Decoder acts as an expansion function, which is responsible for reconstructing the
original data from the compressed encoding.
It takes as input the encoding generated by the encoder and reconstructs the original data.
Like encoders, decoders are implemented as neural networks featuring one or more
hidden layers.
During the training phase, present the denoising autoencoder (DAE) with a collection of clean
input examples along with their respective noisy counterparts. The objective is to acquire a
function that maps a noisy input to a relatively clean output using an encoder-decoder
architecture. To achieve this, a reconstruction loss function is typically employed to evaluate the
disparity between the clean input and the reconstructed output. A DAE is trained by minimizing
this loss through the use of backpropagation, which involves updating the weights of both
encoder and decoder components.
Code:
x = Input(shape=(28, 28, 1))
14
# Encoder
conv1_1 = Conv2D(32, (3, 3), activation='relu', padding='same')(x)
pool1 = MaxPooling2D((2, 2), padding='same')(conv1_1)
conv1_2 = Conv2D(32, (3, 3), activation='relu', padding='same')(pool1)
h = MaxPooling2D((2, 2), padding='same')(conv1_2)
# Decoder
conv2_1 = Conv2D(32, (3, 3), activation='relu', padding='same')(h)
up1 = UpSampling2D((2, 2))(conv2_1)
conv2_2 = Conv2D(32, (3, 3), activation='relu', padding='same')(up1)
up2 = UpSampling2D((2, 2))(conv2_2)
r = Conv2D(1, (3, 3), activation='sigmoid', padding='same')(up2)
OUTPUT:
15
Data Imputation: To reconstruct missing values from available data by learning, DAEs
can facilitate data imputation in datasets with incomplete information.
Data Compression: DAEs can compress data by obtaining a concise representation of
the data in the encoding space.
Anomaly Detection: Using DAEs, anomalies in a dataset can be detected by training a
model to reconstruct normal data and then flag challenging inputs as potentially
abnormal.
==
Generative Models
16
Loss function for Stochastic Decoder
17
- -----------------------------------------------------------------------------------------------==
18
One notable advantage of VAEs is their ability to generate new data samples resembling the
training data. Because the VAE’s latent space is continuous, the decoder can generate new data
points that seamlessly interpolate among the training data points. VAEs find applications in
various domains like density estimation and text generation.
A VAE comprises an encoder network that maps input data to a latent code and a decoder
network that conducts the inverse operation by translating the latent code back to the
reconstruction data. By undergoing this training process, the VAE learns an optimized latent
representation that captures the fundamental characteristics of the data, enabling precise
reconstruction.
It achieves this by doing something that seems rather surprising at first: making its encoder not
output an encoding vector of size n, rather, outputting two vectors of size n: a vector of means, μ,
and another vector of standard deviations, σ.
19
Variational Autoencoder
They form the parameters of a vector of random variables of length n, with the i th element
of μ and σ being the mean and standard deviation of the i th random variable, X i, from which we
sample, to obtain the sampled encoding which we pass onward to the decoder:
20
Stochastically generating encoding vectors
This stochastic generation means, that even for the same input, while the mean and standard
deviations remain the same, the actual encoding will somewhat vary on every single pass simply
due to sampling.
Code:
# build your encoder upto here. It can simply be a series of dense layers, a convolutional network
# or even an LSTM decoder. Once made, flatten out the final layer of the encoder, call it hidden.
latent_size = 5
21
mean = Dense(latent_size)(hidden)
Output
=======
22
6. Learning with autoencoders; Deep Generative Models: Generative Adversarial
Networks
Generative Adversarial Networks (GANs) were introduced in 2014 by Ian J. Goodfellow and co-
authors. GANs perform unsupervised learning tasks in machine learning. It consists of 2 models
that automatically discover and learn the patterns in input data.
The two models are known as Generator and Discriminator.
They compete with each other to scrutinize, capture, and replicate the variations within a dataset.
GANs can be used to generate new examples that plausibly could have been drawn from the
original dataset.
Shown below is an example of a GAN. There is a database that has real 100 rupee notes. The
generator neural network generates fake 100 rupee notes. The discriminator network will help
identify the real and fake notes.
What is a Generator?
A Generator in GANs is a neural network that creates fake data to be trained on the
discriminator. It learns to generate plausible data. The generated examples/instances become
negative training examples for the discriminator. It takes a fixed-length random vector carrying
noise as input and generates a sample.
23
The main aim of the Generator is to make the discriminator classify its output as real. The part of
the GAN that trains the Generator includes:
noisy input vector
generator network, which transforms the random input into a data instance
discriminator network, which classifies the generated data
generator loss, which penalizes the Generator for failing to dolt the discriminator
The backpropagation method is used to adjust each weight in the right direction by calculating
the weight's impact on the output. It is also used to obtain gradients and these gradients can help
change the generator weights.
Let’s see the next topic in this article on what GANs are, i.e., a Discriminator.
Want to Get Paid The Big Bucks?! Join AI & ML
Professional Certificate Program in AI and MLEXPLORE PROGRAM
What is a Discriminator?
The Discriminator is a neural network that identifies real data from the fake data created by the
Generator. The discriminator's training data comes from different two sources:
The real data instances, such as real pictures of birds, humans, currency notes, etc., are
used by the Discriminator as positive samples during training.
The fake data instances created by the Generator are used as negative examples during
the training process.
24
While training the discriminator, it connects to two loss functions. During discriminator training,
the discriminator ignores the generator loss and just uses the discriminator loss.
In the process of training the discriminator, the discriminator classifies both real data and fake
data from the generator. The discriminator loss penalizes the discriminator for misclassifying a
real data instance as fake or a fake data instance as real.
The discriminator updates its weights through backpropagation from the discriminator loss
through the discriminator network.
25
Below is an example of a GAN trying to identify if the 100 rupee notes are real or fake. So, first,
a noise vector or the input vector is fed to the Generator network. The generator creates fake 100
rupee notes. The real images of 100 rupee notes stored in a database are passed to the
discriminator along with the fake notes. The Discriminator then identifies the notes as classifying
them as real or fake.
We train the model, calculate the loss function at the end of the discriminator network, and
backpropagate the loss into both discriminator and generator models.
Mathematical Equation
26
z = sample from P(z)
D(x) = Discriminator network
G(z) = Generator network
Code:
Building the Generative Adversarial Network
Python3
# Loss function
adversarial_loss = nn.BCELoss()
# Optimizers
optimizer_G = optim.Adam(generator.parameters()\
, lr=lr, betas=(beta1, beta2))
optimizer_D = optim.Adam(discriminator.parameters()\
, lr=lr, betas=(beta1, beta2))
27
Check the results
Let’s plot the generated images at different epochs to see that after how many epochs the
generator was capable to extract some information.
No information is extracted from the generator and the discriminator is intelligent enough to
identify it as fake.
Plot Image Generated after training on 1000 epoch
from skimage.io import imread
a = imread('gan_images/10000.png')
plt.imshow(a)
Now Generator is slowly being capable to extract some information that can be observed.
Plot Image Generated after training on 10000 Epochs
28
Now Generator is capable to build as it is an image as of MNIST dataset and there are high
chances of the Discriminator being Fool.
29
Click on Subject/Paper under Semester to enter.
Professional English Discrete Mathematics Environmental Sciences
Professional English - - II - HS3252 - MA3354 and Sustainability -
I - HS3152 GE3451
Digital Principles and
Statistics and Probability and
Computer Organization
Matrices and Calculus Numerical Methods - Statistics - MA3391
MA3251 - CS3351
- MA3151
3rd Semester
1st Semester
4th Semester
2nd Semester
Deep Learning -
AD3501
Embedded Systems
Data and Information Human Values and
and IoT - CS3691
5th Semester
7th Semester
8th Semester
Open Elective-1
Distributed Computing Open Elective 2
- CS3551 Project Work /
Elective-3
Open Elective 3 Intership
Big Data Analytics -
Elective-4
CCS334 Open Elective 4
Elective-5
Elective 1 Management Elective
Elective-6
Elective 2
All Computer Engg Subjects - [ B.E., M.E., ] (Click on Subjects to enter)
Programming in C Computer Networks Operating Systems
Programming and Data Programming and Data Problem Solving and Python
Structures I Structure II Programming
Database Management Systems Computer Architecture Analog and Digital
Communication
Design and Analysis of Microprocessors and Object Oriented Analysis
Algorithms Microcontrollers and Design
Software Engineering Discrete Mathematics Internet Programming
Theory of Computation Computer Graphics Distributed Systems
Mobile Computing Compiler Design Digital Signal Processing
Artificial Intelligence Software Testing Grid and Cloud Computing
Data Ware Housing and Data Cryptography and Resource Management
Mining Network Security Techniques
Service Oriented Architecture Embedded and Real Time Multi - Core Architectures
Systems and Programming
Probability and Queueing Theory Physics for Information Transforms and Partial
Science Differential Equations
Technical English Engineering Physics Engineering Chemistry
Engineering Graphics Total Quality Professional Ethics in
Management Engineering
Basic Electrical and Electronics Problem Solving and Environmental Science and
and Measurement Engineering Python Programming Engineering