0% found this document useful (0 votes)
2 views52 pages

Autoencoders

The document discusses dimensionality reduction in machine learning, focusing on techniques like PCA and autoencoders. Autoencoders are neural networks that compress input data into a lower-dimensional representation and then reconstruct it, serving as an unsupervised learning method. Various types of autoencoders, including undercomplete, denoising, sparse, and convolutional autoencoders, are explored for their unique properties and applications in data representation and feature extraction.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPSX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views52 pages

Autoencoders

The document discusses dimensionality reduction in machine learning, focusing on techniques like PCA and autoencoders. Autoencoders are neural networks that compress input data into a lower-dimensional representation and then reconstruct it, serving as an unsupervised learning method. Various types of autoencoders, including undercomplete, denoising, sparse, and convolutional autoencoders, are explored for their unique properties and applications in data representation and feature extraction.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPSX, PDF, TXT or read online on Scribd

Deep Learning

BITS Pilani Dr. Pratik Narang


Pilani Campus Department of CSIS
BITS Pilani
Pilani Campus

Dimensionality reduction
Dimensionality reduction

• In machine learning, dimensionality reduction is the process of


reducing the number of features that describe some data.

• This reduction is done either by selection (only some existing


features are conserved) or by extraction (a reduced number of new
features are created based on the old features)

• Useful in many situations that require low dimensional data (data


visualisation, data storage, heavy computation…).

• Commonly used approaches: PCA, ICA

BITS Pilani, Pilani Campus


Dimensionality reduction

Let’s call encoder the process that produce the “new features”
representation from the “old features” representation (by selection or
by extraction) and decoder the reverse process.
Dimensionality reduction can then be interpreted as data compression
where the encoder compress the data (from the initial space to the
encoded space, also called latent space) whereas the decoder
decompress them.

Source: [Link] BITS Pilani, Pilani Campus


Principal components analysis (PCA)

The idea of PCA is to build n_e new independent features


that are linear combinations of the n_d old features and
so that the projections of the data on the subspace
defined by these new features are as close as possible to
the initial data (in term of euclidean distance).
In other words, PCA is looking for the best linear subspace
of the initial space (described by an orthogonal basis of
new features) such that the error of approximating the
data by their projections on this subspace is as small as
possible.

BITS Pilani, Pilani Campus


BITS Pilani
Pilani Campus

Latent variable models


What is a latent variable?

Myth of the Cave


Plato, Republic

BITS Pilani, Pilani Campus


BITS Pilani
Pilani Campus

Autoencoders
Typical DNNs characteristics

So far, Deep Learning Models have things in common:


• Input Layer: (maybe vectorized), quantitative
representation
• Hidden Layer(s): Apply transformations with nonlinearity
• Output Layer: Result for classification, regression,
translation, segmentation, etc.
• Models used for supervised learning

BITS Pilani, Pilani Campus


Example

Source: [Link] BITS Pilani, Pilani Campus


Changing the objective!

Now we will talk about unsupervised learning with Deep Neural Networks

Source: [Link] BITS Pilani, Pilani Campus


Autoencoders: definition

Autoencoders are neural networks that are trained to copy their inputs to
their outputs.

• Usually constrained in particular ways to make this task more difficult.

• They compress the input into a lower-dimensional code and then


reconstruct the output from this representation. The code is a compact
“summary” or “compression” of the input, also called the latent-space
representation.

• Structure is almost always organized into encoder network, f, and


decoder network, g : model = g(f(x))

BITS Pilani, Pilani Campus


Autoencoders

A special type of feed forward neural


network which does the following
Encodes its input xi into a hidden
representation h

Decodes the input again from this hidden representation

The model is trained to minimize a loss function which will


ensure that ˆxi is close to xi

BITS Pilani, Pilani Campus


Autoencoders for representation learning

BITS Pilani, Pilani Campus


Autoencoders

Autoencoders are mainly a dimensionality reduction (or compression)


algorithm with a couple of important properties:

• Data-specific: They are only able to meaningfully compress data


similar to what they have been trained on. Since they learn features
specific for the given training data, they are different than a standard
data compression algorithm like gzip.

• Lossy: The output of the autoencoder will not be exactly the same
as the input, it will be a close but degraded representation.

• Unsupervised: Autoencoders are considered an unsupervised


learning technique since they don’t need explicit labels to train on.

BITS Pilani, Pilani Campus


Reconstruction quality

BITS Pilani, Pilani Campus


Autoencoders

• Let us consider the case where


dim(h) < dim(xi)
• If we are still able to reconstruct ˆxi
perfectly from h, then what does it
say about h?
• h is a loss-free encoding of xi. It
captures all the important
characteristics of xi
• Do you see an analogy with PCA?

• An autoencoder where dim(h) < dim(xi) is called an


under-complete autoencoder
BITS Pilani, Pilani Campus
Autoencoders

• Let us consider the case when


dim(h) ≥ dim(xi)

• In such a case the autoencoder


could learn a trivial encoding by
simply copying xi into h and then
copying h into ˆxi

BITS Pilani, Pilani Campus


Autoencoders

• Let us consider the case when


dim(h) ≥ dim(xi)

• In such a case the autoencoder


could learn a trivial encoding by
simply copying xi into h and then
copying h into ˆxi

BITS Pilani, Pilani Campus


Autoencoders

• Let us consider the case when


dim(h) ≥ dim(xi)

• In such a case the autoencoder


could learn a trivial encoding by
simply copying xi into h and then
copying h into ˆxi

BITS Pilani, Pilani Campus


Autoencoders

• Let us consider the case when


dim(h) ≥ dim(xi)

• In such a case the autoencoder


could learn a trivial encoding by An autoencoder where
simply copying xi into h and then dim(h) ≥ dim(xi) is called an
copying h into ˆxi over-complete autoencoder

• Such an identity encoding is


useless in practice as it does not Where can they be useful?
really tell us anything about the
important characteristics of the
data
BITS Pilani, Pilani Campus
Autoencoders

Further ahead:

Choice of f(xi) and g(xi)

Choice of loss function

BITS Pilani, Pilani Campus


BITS Pilani
Pilani Campus

Choice of f(xi) and g(xi)


Autoencoders

• Suppose all our inputs are binary (each xij ∈ {0, 1})
• Which function would be most apt for the decoder?
• ˆxi = tanh(W∗h + c)
• ˆxi = W∗h + c
• ˆxi = logistic(W∗h + c) g is typically chosen as the sigmoid function

BITS Pilani, Pilani Campus


Autoencoders

• Suppose all our inputs are real (each xij ∈ R)


• Which function would be most apt for the decoder?
• ˆxi = tanh(W∗h + c)
• ˆxi = W∗h + c Again, g is typically chosen as the
• ˆxi = logistic(W∗h + c) sigmoid function

What will logistic and tanh functions do?


They will restrict the reconstructed ˆxi
to lie between [0,1] or [-1,1] whereas
we want ˆxi ∈ Rn

BITS Pilani, Pilani Campus


BITS Pilani
Pilani Campus

Choice of loss function


Autoencoders

• Consider the case when the inputs


are real valued
• The objective of the autoencoder is to
reconstruct ˆxi to be as close to xi as
possible
• We can formalize this using the
objective function:

We can train the autoencoder just like a regular feedforward network using
backpropagation. All we need is a formula for ∂L(θ)/ ∂W ∗ and ∂L(θ) /∂W

BITS Pilani, Pilani Campus


Autoencoders

• Consider the case when the


inputs are binary
• We use a sigmoid decoder
which will produce outputs
between 0 and 1, and can be
interpreted as probabilities.
• For a single n-dimensional ith Again, all we need is a formula for ∂L (θ)/ ∂W∗
input we can use the following and ∂L (θ)/ ∂W to use backpropagation
loss function

What value of ˆxij will minimize this function?

The above function will be minimized when ˆxij = xij


BITS Pilani, Pilani Campus
BITS Pilani
Pilani Campus

Types of Autoencoders
Undercomplete Autoeconders

Undercomplete Autoeconders are defined to have a hidden layer h, with


smaller dimension than input layer.
• Network must model x in lower dimensional space + map latent space
accurately back to input space.
• Encoder network: function that returns a useful, compressed representation
of input.
• If network has only linear transformations, encoder learns PCA. With typical
nonlinearities, network learns generalized, more powerful version of PCA.

Source: [Link] BITS Pilani, Pilani Campus


Architecture

Source: [Link] BITS Pilani, Pilani Campus


Training

Four hyperparameters need to be set before training an autoencoder:

Code size: number of nodes in the middle layer. Smaller size results in
more compression.
Number of layers: the autoencoder can be as deep as we like
Number of nodes per layer: a stacked autoencoder is one where the
layers are stacked one after another. Usually stacked autoencoders look
like a “sandwich”. The number of nodes per layer decreases with each
subsequent layer of the encoder, and increases back in the decoder.
Also the decoder is symmetric to the encoder in terms of layer structure.
Loss function: we either use mean squared error (MSE) or binary
crossentropy. If the input values are in the range [0, 1] then we typically
use crossentropy, otherwise we use the mean squared error.

BITS Pilani, Pilani Campus


Training

• We can make the autoencoder very powerful by increasing the number


of layers, nodes per layer and most importantly the code size.
• Increasing these hyperparameters will let the autoencoder to learn
more complex codings.
• But we should be careful to not make it too powerful. Otherwise the
autoencoder will simply learn to copy its inputs to the output, without
learning any meaningful representation. It will just mimic the identity
function.
• This is why we prefer a “sandwich” architecture, and deliberately keep
the code size small.
• Since the coding layer has a lower dimensionality than the input data,
the autoencoder is said to be undercomplete. It won’t be able to
directly copy its inputs to the output, and will be forced to learn
intelligent features

BITS Pilani, Pilani Campus


Denoising autoencoders

Another way to force the autoencoder to learn useful


features is by adding random noise to its inputs and
making it recover the original noise-free data.

• This way the autoencoder can’t simply copy the input to


its output because the input also contains random noise.
• We are asking it to subtract the noise and produce the
underlying meaningful data.
• This is called a denoising autoencoder.

• Robustness: Extends well to real-world tasks such as


removing background noise from audio, or cracks/stains
in images. BITS Pilani, Pilani Campus
Example

Source: [Link] BITS Pilani, Pilani Campus


Denoising autoencoders

• We introduce a corruption process C(˜x | x), which represents a


conditional distribution over corrupted samples ˜x, given a data
sample x. The autoencoder then learns a reconstruction
distribution preconstruct (x | ˜x) estimated from training pairs (x,˜x) as
follows:

• Sample a training example x from the training data.

• Sample a corrupted version ˜x from C(˜x | x=x),

• Use (x,˜x) as a training example for estimating the autoencoder


reconstruction distribution preconstruct(x |˜x) =pdecoder (x | h) with h the
output of encoder f(˜x) and pdecoder typically defined by a decoder g(h)

BITS Pilani, Pilani Campus


Sparse autoencoders

Force the autoencoder to learn useful features using regularization


We can regularize the autoencoder by using a sparsity constraint such
that only a fraction of the nodes would have nonzero values, called
active nodes.
Add a penalty term to the loss function such that only a fraction of the
nodes become active. This forces the autoencoder to represent each
input as a combination of small number of nodes, and demands it to
discover interesting structure in the data.
This method works even if the code size is large, since only a small
subset of the nodes will be active at any time.
Biological Inspiration: Only a small percentage of neurons fire
simultaneously in the brain.
Practical Effect: The model focuses on the most salient features in the
data, often improving interpretability.

BITS Pilani, Pilani Campus


Convolutional autoencoders

• Standard autoencoders flatten images into vectors.


• Convolutional Autoencoders maintain the spatial layout, using
convolution/deconvolution (transposed convolution) layers.
• Better for image data, as the network can capture local patterns (edges,
textures).
• Often yields superior reconstruction quality vs. fully connected
autoencoders.
• Use Cases:
• Denoising of images (e.g., removing Gaussian noise, image inpainting).
• Feature Extraction: The encoder’s output can serve as a learned
representation for classification tasks.

BITS Pilani, Pilani Campus


Case study: MAEs

BITS Pilani, Pilani Campus


MAEs

• Instead of feeding the whole image into the autoencoder, a large


random subset of patches is masked (removed) from the input.
• The autoencoder only “sees” a fraction (e.g., 25%) of the original
image patches.

• Encoder: Processes the visible patches only (the unmasked


subset), encoding them into latent embeddings.

• Decoder: Takes the latent embeddings (plus positional information


for where each patch belongs) and reconstructs the entire image,
including the masked patches.

BITS Pilani, Pilani Campus


MAEs

We drop most of the input patches before encoding, then


train the network to restore the missing content.

This increases the difficulty of the reconstruction task and


encourages the model to learn rich, high-level features
of the data.

BITS Pilani, Pilani Campus


MAEs: architecture overview

• Patch Embedding: Split the image into patches (e.g., 16×16 pixels)
and embed them (like a small linear projection or convolution). Sample:
196 patches in a 224×224 image with 16×16 patches.
• Random Masking: Randomly select a large portion (e.g., 75%) of
these patches to remove. Remaining subset fed into the encoder.
• Encoder (Transformer or CNN): The encoder only processes the
visible (unmasked) tokens.
• Decoder: A smaller or simpler network that receives the latent
embeddings of the visible patches + learnable placeholders for the
masked patches. It reconstructs the full set of patches (both originally
visible and masked).
• Loss Function: Typically Mean Squared Error (MSE) between
reconstructed and original pixel patches (only for the masked patches,
to focus the learning on what was not seen).

BITS Pilani, Pilani Campus


Key contributions

• High Masking Ratio


• By masking most of the image, the autoencoder is forced to learn
global structures and semantic context, not just local textures.
• Leads to representations that generalize well to downstream tasks
(e.g., classification, detection) once you fine-tune the encoder.

• Efficiency
• The encoder only processes a fraction of the tokens, making the
training more computationally feasible than if it had to process the
entire image.

• Self-Supervised
• No labels needed. The task is purely “reconstruct missing patches,”
which can be done with any unlabeled image dataset.

BITS Pilani, Pilani Campus


Training MAEs

• Forward Pass
• Sample an image, mask out patches.
• Encode visible patches → produce latent vectors.
• Decode latent vectors + masked placeholders → reconstruct the missing
patches.

• Loss Computation
• Compare reconstructed masked patches vs. the original masked patches.
• Typical metric is L2 / MSE loss for pixel-wise difference.

• Backpropagation
• Update encoder + decoder weights to minimize reconstruction error.

• Representation Learning
• After training, we can discard the decoder and use the encoder’s latent
representation for downstream tasks (e.g., image classification). Often, we
just fine-tune the encoder on a small set of labeled images.
BITS Pilani, Pilani Campus
Visual results

Source: [Link]
BITS Pilani, Pilani Campus
Case study: U-net for image segmentation

BITS Pilani, Pilani Campus


U-net architecture

BITS Pilani, Pilani Campus


U-net architecture

• Encoder (Contraction Path)


• A series of convolution + pooling layers that capture what is in the image
(context).
• Gradually reduces the spatial dimensions while increasing feature depth.

• Decoder (Expansion Path)


• A series of transposed convolutions (or up-convolutions) that upscale the
feature maps.
• Recovers the spatial resolution needed for a segmentation map (pixel-
level predictions).

• Skip Connections
• Directly connect encoder feature maps to decoder layers at the same
scale.
• Helps retain detailed spatial information lost in pooling.
• Greatly improves segmentation accuracy, especially for medical imaging.

BITS Pilani, Pilani Campus


U-net architecture

Source: [Link] BITS Pilani, Pilani Campus


U-net is an autoencoder?

• Encoder-Decoder Concept:
• Like an autoencoder, U-Net “encodes” the input into a
lower-dimensional feature space and then “decodes” it
back to an image-sized output.

• Primary Difference:
• Instead of reconstructing the same input, U-Net is
reconstructing a segmentation mask (i.e., class labels for
each pixel).

BITS Pilani, Pilani Campus


Use cases

• Medical Image Segmentation: Segmenting organs or


tumors in MRI or CT scans (the original U-Net paper
focuses on biomedical images).

• Autonomous Driving: Segmenting roads, cars,


pedestrians in real-time from camera feeds.

• Satellite Image Analysis: Detecting buildings, roads,


vegetation from aerial imagery.

BITS Pilani, Pilani Campus


Applications of Autoencoders

Data denoising

Dimensionality reduction

Information retrieval

Anomaly detection

Content generation (Generative models) – VAEs

BITS Pilani, Pilani Campus

You might also like