Deep Learning
BITS Pilani Dr. Pratik Narang
Pilani Campus Department of CSIS
BITS Pilani
Pilani Campus
Dimensionality reduction
Dimensionality reduction
• In machine learning, dimensionality reduction is the process of
reducing the number of features that describe some data.
• This reduction is done either by selection (only some existing
features are conserved) or by extraction (a reduced number of new
features are created based on the old features)
• Useful in many situations that require low dimensional data (data
visualisation, data storage, heavy computation…).
• Commonly used approaches: PCA, ICA
BITS Pilani, Pilani Campus
Dimensionality reduction
Let’s call encoder the process that produce the “new features”
representation from the “old features” representation (by selection or
by extraction) and decoder the reverse process.
Dimensionality reduction can then be interpreted as data compression
where the encoder compress the data (from the initial space to the
encoded space, also called latent space) whereas the decoder
decompress them.
Source: [Link] BITS Pilani, Pilani Campus
Principal components analysis (PCA)
The idea of PCA is to build n_e new independent features
that are linear combinations of the n_d old features and
so that the projections of the data on the subspace
defined by these new features are as close as possible to
the initial data (in term of euclidean distance).
In other words, PCA is looking for the best linear subspace
of the initial space (described by an orthogonal basis of
new features) such that the error of approximating the
data by their projections on this subspace is as small as
possible.
BITS Pilani, Pilani Campus
BITS Pilani
Pilani Campus
Latent variable models
What is a latent variable?
Myth of the Cave
Plato, Republic
BITS Pilani, Pilani Campus
BITS Pilani
Pilani Campus
Autoencoders
Typical DNNs characteristics
So far, Deep Learning Models have things in common:
• Input Layer: (maybe vectorized), quantitative
representation
• Hidden Layer(s): Apply transformations with nonlinearity
• Output Layer: Result for classification, regression,
translation, segmentation, etc.
• Models used for supervised learning
BITS Pilani, Pilani Campus
Example
Source: [Link] BITS Pilani, Pilani Campus
Changing the objective!
Now we will talk about unsupervised learning with Deep Neural Networks
Source: [Link] BITS Pilani, Pilani Campus
Autoencoders: definition
Autoencoders are neural networks that are trained to copy their inputs to
their outputs.
• Usually constrained in particular ways to make this task more difficult.
• They compress the input into a lower-dimensional code and then
reconstruct the output from this representation. The code is a compact
“summary” or “compression” of the input, also called the latent-space
representation.
• Structure is almost always organized into encoder network, f, and
decoder network, g : model = g(f(x))
BITS Pilani, Pilani Campus
Autoencoders
A special type of feed forward neural
network which does the following
Encodes its input xi into a hidden
representation h
Decodes the input again from this hidden representation
The model is trained to minimize a loss function which will
ensure that ˆxi is close to xi
BITS Pilani, Pilani Campus
Autoencoders for representation learning
BITS Pilani, Pilani Campus
Autoencoders
Autoencoders are mainly a dimensionality reduction (or compression)
algorithm with a couple of important properties:
• Data-specific: They are only able to meaningfully compress data
similar to what they have been trained on. Since they learn features
specific for the given training data, they are different than a standard
data compression algorithm like gzip.
• Lossy: The output of the autoencoder will not be exactly the same
as the input, it will be a close but degraded representation.
• Unsupervised: Autoencoders are considered an unsupervised
learning technique since they don’t need explicit labels to train on.
BITS Pilani, Pilani Campus
Reconstruction quality
BITS Pilani, Pilani Campus
Autoencoders
• Let us consider the case where
dim(h) < dim(xi)
• If we are still able to reconstruct ˆxi
perfectly from h, then what does it
say about h?
• h is a loss-free encoding of xi. It
captures all the important
characteristics of xi
• Do you see an analogy with PCA?
• An autoencoder where dim(h) < dim(xi) is called an
under-complete autoencoder
BITS Pilani, Pilani Campus
Autoencoders
• Let us consider the case when
dim(h) ≥ dim(xi)
• In such a case the autoencoder
could learn a trivial encoding by
simply copying xi into h and then
copying h into ˆxi
BITS Pilani, Pilani Campus
Autoencoders
• Let us consider the case when
dim(h) ≥ dim(xi)
• In such a case the autoencoder
could learn a trivial encoding by
simply copying xi into h and then
copying h into ˆxi
BITS Pilani, Pilani Campus
Autoencoders
• Let us consider the case when
dim(h) ≥ dim(xi)
• In such a case the autoencoder
could learn a trivial encoding by
simply copying xi into h and then
copying h into ˆxi
BITS Pilani, Pilani Campus
Autoencoders
• Let us consider the case when
dim(h) ≥ dim(xi)
• In such a case the autoencoder
could learn a trivial encoding by An autoencoder where
simply copying xi into h and then dim(h) ≥ dim(xi) is called an
copying h into ˆxi over-complete autoencoder
• Such an identity encoding is
useless in practice as it does not Where can they be useful?
really tell us anything about the
important characteristics of the
data
BITS Pilani, Pilani Campus
Autoencoders
Further ahead:
Choice of f(xi) and g(xi)
Choice of loss function
BITS Pilani, Pilani Campus
BITS Pilani
Pilani Campus
Choice of f(xi) and g(xi)
Autoencoders
• Suppose all our inputs are binary (each xij ∈ {0, 1})
• Which function would be most apt for the decoder?
• ˆxi = tanh(W∗h + c)
• ˆxi = W∗h + c
• ˆxi = logistic(W∗h + c) g is typically chosen as the sigmoid function
BITS Pilani, Pilani Campus
Autoencoders
• Suppose all our inputs are real (each xij ∈ R)
• Which function would be most apt for the decoder?
• ˆxi = tanh(W∗h + c)
• ˆxi = W∗h + c Again, g is typically chosen as the
• ˆxi = logistic(W∗h + c) sigmoid function
What will logistic and tanh functions do?
They will restrict the reconstructed ˆxi
to lie between [0,1] or [-1,1] whereas
we want ˆxi ∈ Rn
BITS Pilani, Pilani Campus
BITS Pilani
Pilani Campus
Choice of loss function
Autoencoders
• Consider the case when the inputs
are real valued
• The objective of the autoencoder is to
reconstruct ˆxi to be as close to xi as
possible
• We can formalize this using the
objective function:
We can train the autoencoder just like a regular feedforward network using
backpropagation. All we need is a formula for ∂L(θ)/ ∂W ∗ and ∂L(θ) /∂W
BITS Pilani, Pilani Campus
Autoencoders
• Consider the case when the
inputs are binary
• We use a sigmoid decoder
which will produce outputs
between 0 and 1, and can be
interpreted as probabilities.
• For a single n-dimensional ith Again, all we need is a formula for ∂L (θ)/ ∂W∗
input we can use the following and ∂L (θ)/ ∂W to use backpropagation
loss function
What value of ˆxij will minimize this function?
The above function will be minimized when ˆxij = xij
BITS Pilani, Pilani Campus
BITS Pilani
Pilani Campus
Types of Autoencoders
Undercomplete Autoeconders
Undercomplete Autoeconders are defined to have a hidden layer h, with
smaller dimension than input layer.
• Network must model x in lower dimensional space + map latent space
accurately back to input space.
• Encoder network: function that returns a useful, compressed representation
of input.
• If network has only linear transformations, encoder learns PCA. With typical
nonlinearities, network learns generalized, more powerful version of PCA.
Source: [Link] BITS Pilani, Pilani Campus
Architecture
Source: [Link] BITS Pilani, Pilani Campus
Training
Four hyperparameters need to be set before training an autoencoder:
Code size: number of nodes in the middle layer. Smaller size results in
more compression.
Number of layers: the autoencoder can be as deep as we like
Number of nodes per layer: a stacked autoencoder is one where the
layers are stacked one after another. Usually stacked autoencoders look
like a “sandwich”. The number of nodes per layer decreases with each
subsequent layer of the encoder, and increases back in the decoder.
Also the decoder is symmetric to the encoder in terms of layer structure.
Loss function: we either use mean squared error (MSE) or binary
crossentropy. If the input values are in the range [0, 1] then we typically
use crossentropy, otherwise we use the mean squared error.
BITS Pilani, Pilani Campus
Training
• We can make the autoencoder very powerful by increasing the number
of layers, nodes per layer and most importantly the code size.
• Increasing these hyperparameters will let the autoencoder to learn
more complex codings.
• But we should be careful to not make it too powerful. Otherwise the
autoencoder will simply learn to copy its inputs to the output, without
learning any meaningful representation. It will just mimic the identity
function.
• This is why we prefer a “sandwich” architecture, and deliberately keep
the code size small.
• Since the coding layer has a lower dimensionality than the input data,
the autoencoder is said to be undercomplete. It won’t be able to
directly copy its inputs to the output, and will be forced to learn
intelligent features
BITS Pilani, Pilani Campus
Denoising autoencoders
Another way to force the autoencoder to learn useful
features is by adding random noise to its inputs and
making it recover the original noise-free data.
• This way the autoencoder can’t simply copy the input to
its output because the input also contains random noise.
• We are asking it to subtract the noise and produce the
underlying meaningful data.
• This is called a denoising autoencoder.
• Robustness: Extends well to real-world tasks such as
removing background noise from audio, or cracks/stains
in images. BITS Pilani, Pilani Campus
Example
Source: [Link] BITS Pilani, Pilani Campus
Denoising autoencoders
• We introduce a corruption process C(˜x | x), which represents a
conditional distribution over corrupted samples ˜x, given a data
sample x. The autoencoder then learns a reconstruction
distribution preconstruct (x | ˜x) estimated from training pairs (x,˜x) as
follows:
• Sample a training example x from the training data.
• Sample a corrupted version ˜x from C(˜x | x=x),
• Use (x,˜x) as a training example for estimating the autoencoder
reconstruction distribution preconstruct(x |˜x) =pdecoder (x | h) with h the
output of encoder f(˜x) and pdecoder typically defined by a decoder g(h)
BITS Pilani, Pilani Campus
Sparse autoencoders
Force the autoencoder to learn useful features using regularization
We can regularize the autoencoder by using a sparsity constraint such
that only a fraction of the nodes would have nonzero values, called
active nodes.
Add a penalty term to the loss function such that only a fraction of the
nodes become active. This forces the autoencoder to represent each
input as a combination of small number of nodes, and demands it to
discover interesting structure in the data.
This method works even if the code size is large, since only a small
subset of the nodes will be active at any time.
Biological Inspiration: Only a small percentage of neurons fire
simultaneously in the brain.
Practical Effect: The model focuses on the most salient features in the
data, often improving interpretability.
BITS Pilani, Pilani Campus
Convolutional autoencoders
• Standard autoencoders flatten images into vectors.
• Convolutional Autoencoders maintain the spatial layout, using
convolution/deconvolution (transposed convolution) layers.
• Better for image data, as the network can capture local patterns (edges,
textures).
• Often yields superior reconstruction quality vs. fully connected
autoencoders.
• Use Cases:
• Denoising of images (e.g., removing Gaussian noise, image inpainting).
• Feature Extraction: The encoder’s output can serve as a learned
representation for classification tasks.
BITS Pilani, Pilani Campus
Case study: MAEs
BITS Pilani, Pilani Campus
MAEs
• Instead of feeding the whole image into the autoencoder, a large
random subset of patches is masked (removed) from the input.
• The autoencoder only “sees” a fraction (e.g., 25%) of the original
image patches.
• Encoder: Processes the visible patches only (the unmasked
subset), encoding them into latent embeddings.
• Decoder: Takes the latent embeddings (plus positional information
for where each patch belongs) and reconstructs the entire image,
including the masked patches.
BITS Pilani, Pilani Campus
MAEs
We drop most of the input patches before encoding, then
train the network to restore the missing content.
This increases the difficulty of the reconstruction task and
encourages the model to learn rich, high-level features
of the data.
BITS Pilani, Pilani Campus
MAEs: architecture overview
• Patch Embedding: Split the image into patches (e.g., 16×16 pixels)
and embed them (like a small linear projection or convolution). Sample:
196 patches in a 224×224 image with 16×16 patches.
• Random Masking: Randomly select a large portion (e.g., 75%) of
these patches to remove. Remaining subset fed into the encoder.
• Encoder (Transformer or CNN): The encoder only processes the
visible (unmasked) tokens.
• Decoder: A smaller or simpler network that receives the latent
embeddings of the visible patches + learnable placeholders for the
masked patches. It reconstructs the full set of patches (both originally
visible and masked).
• Loss Function: Typically Mean Squared Error (MSE) between
reconstructed and original pixel patches (only for the masked patches,
to focus the learning on what was not seen).
BITS Pilani, Pilani Campus
Key contributions
• High Masking Ratio
• By masking most of the image, the autoencoder is forced to learn
global structures and semantic context, not just local textures.
• Leads to representations that generalize well to downstream tasks
(e.g., classification, detection) once you fine-tune the encoder.
• Efficiency
• The encoder only processes a fraction of the tokens, making the
training more computationally feasible than if it had to process the
entire image.
• Self-Supervised
• No labels needed. The task is purely “reconstruct missing patches,”
which can be done with any unlabeled image dataset.
BITS Pilani, Pilani Campus
Training MAEs
• Forward Pass
• Sample an image, mask out patches.
• Encode visible patches → produce latent vectors.
• Decode latent vectors + masked placeholders → reconstruct the missing
patches.
• Loss Computation
• Compare reconstructed masked patches vs. the original masked patches.
• Typical metric is L2 / MSE loss for pixel-wise difference.
• Backpropagation
• Update encoder + decoder weights to minimize reconstruction error.
• Representation Learning
• After training, we can discard the decoder and use the encoder’s latent
representation for downstream tasks (e.g., image classification). Often, we
just fine-tune the encoder on a small set of labeled images.
BITS Pilani, Pilani Campus
Visual results
Source: [Link]
BITS Pilani, Pilani Campus
Case study: U-net for image segmentation
BITS Pilani, Pilani Campus
U-net architecture
BITS Pilani, Pilani Campus
U-net architecture
• Encoder (Contraction Path)
• A series of convolution + pooling layers that capture what is in the image
(context).
• Gradually reduces the spatial dimensions while increasing feature depth.
• Decoder (Expansion Path)
• A series of transposed convolutions (or up-convolutions) that upscale the
feature maps.
• Recovers the spatial resolution needed for a segmentation map (pixel-
level predictions).
• Skip Connections
• Directly connect encoder feature maps to decoder layers at the same
scale.
• Helps retain detailed spatial information lost in pooling.
• Greatly improves segmentation accuracy, especially for medical imaging.
BITS Pilani, Pilani Campus
U-net architecture
Source: [Link] BITS Pilani, Pilani Campus
U-net is an autoencoder?
• Encoder-Decoder Concept:
• Like an autoencoder, U-Net “encodes” the input into a
lower-dimensional feature space and then “decodes” it
back to an image-sized output.
• Primary Difference:
• Instead of reconstructing the same input, U-Net is
reconstructing a segmentation mask (i.e., class labels for
each pixel).
BITS Pilani, Pilani Campus
Use cases
• Medical Image Segmentation: Segmenting organs or
tumors in MRI or CT scans (the original U-Net paper
focuses on biomedical images).
• Autonomous Driving: Segmenting roads, cars,
pedestrians in real-time from camera feeds.
• Satellite Image Analysis: Detecting buildings, roads,
vegetation from aerial imagery.
BITS Pilani, Pilani Campus
Applications of Autoencoders
Data denoising
Dimensionality reduction
Information retrieval
Anomaly detection
Content generation (Generative models) – VAEs
BITS Pilani, Pilani Campus