Module 7.
1: Introduction to Autoencoders
2/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
An autoencoder is a special type of
x̂i feed forward neural network which
does the following
W∗
Encodes its input xi into a hidden
h representation h
W Decodes the input again from this
hidden representation
xi
The model is trained to minimize a
certain loss function which will ensure
that x̂i is close to xi (we will see some
h = g(W xi + b)
such loss functions soon)
x̂i = f (W ∗ h + c)
3/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
Let us consider the case where
x̂i dim(h) < dim(xi )
If we are still able to reconstruct x̂i
W∗
perfectly from h, then what does it
h say about h?
h is a loss-free encoding of xi . It cap-
W
tures all the important characteristics
xi of xi
Do you see an analogy with PCA?
h = g(W xi + b)
x̂i = f (W ∗ h + c)
An autoencoder where dim(h) < dim(xi ) is
called an under complete autoencoder
4/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
Let us consider the case when
x̂i dim(h) ≥ dim(xi )
W∗ In such a case the autoencoder could
learn a trivial encoding by simply
h copying xi into h and then copying
h into x̂i
W
Such an identity encoding is useless
xi in practice as it does not really tell us
anything about the important char-
h = g(W xi + b) acteristics of the data
x̂i = f (W ∗ h + c)
An autoencoder where dim(h) ≥ dim(xi ) is
called an over complete autoencoder
5/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
The Road Ahead
Choice of f (xi ) and g(xi )
Choice of loss function
6/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
The Road Ahead
Choice of f (xi ) and g(xi )
Choice of loss function
7/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
Suppose all our inputs are binary
x̂i = f (W ∗ h + c) (each xij ∈ {0, 1})
W∗ Which of the following functions
would be most apt for the decoder?
h = g(W xi + b)
x̂i = tanh(W ∗ h + c)
W
x̂i = W ∗ h + c
xi
x̂i = logistic(W ∗ h + c)
0 1 1 0 1 (binary inputs) Logistic as it naturally restricts all
outputs to be between 0 and 1
g is typically chosen as the sigmoid
function
8/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
Suppose all our inputs are real (each
x̂i = f (W ∗ h + c) xij ∈ R)
W∗ Which of the following functions
would be most apt for the decoder?
h = g(W xi + b)
x̂i = tanh(W ∗ h + c)
W
x̂i = W ∗ h + c
xi
x̂i = logistic(W ∗ h + c)
0.25 0.5 1.25 3.5 4.5 What will logistic and tanh do?
(real valued inputs) They will restrict the reconstruc-
ted x̂i to lie between [0,1] or [-1,1]
Again, g is typically chosen as the
whereas we want x̂i ∈ Rn
sigmoid function
9/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
The Road Ahead
Choice of f (xi ) and g(xi )
Choice of loss function
10/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
Consider the case when the inputs are real
x̂i valued
The objective of the autoencoder is to recon-
W∗ struct x̂i to be as close to xi as possible
h This can be formalized using the following
objective function:
W m n
1 XX
min (x̂ij − xij )2
xi ∗
W,W ,c,b m i=1 j=1
m
1 X
h = g(W xi + b) i.e., min (x̂i − xi )T (x̂i − xi )
∗
W,W ,c,b m i=1
x̂i = f (W ∗ h + c)
We can then train the autoencoder just like
a regular feedforward network using back-
propagation
∂L (θ) ∂L (θ)
All we need is a formula for ∂W ∗
and ∂W
which we will see now
11/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
L (θ) = (x̂i − xi )T (x̂i − xi ) ∂L (θ) ∂L (θ) ∂h2 ∂a2
=
h2 = x̂i ∂W ∗ ∂h2 ∂a2 ∂W ∗
a2
∂L (θ) ∂L (θ) ∂h2 ∂a2 ∂h1 ∂a1
=
W∗ ∂W ∂h2 ∂a2 ∂h1 ∂a1 ∂W
h1
a1 We have already seen how to calculate the expres-
sion in the boxes when we learnt backpropagation
W
h0 = xi ∂L (θ) ∂L (θ)
=
∂h2 ∂x̂i
= ∇x̂i {(x̂i − xi )T (x̂i − xi )}
Note that the loss function is = 2(x̂i − xi )
shown for only one training
example.
12/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
Consider the case when the inputs are
x̂i = f (W ∗ h + c) binary
W∗ We use a sigmoid decoder which will
produce outputs between 0 and 1, and
h = g(W xi + b) can be interpreted as probabilities.
W For a single n-dimensional ith input we
can use the following loss function
xi X n
min{− (xij log x̂ij + (1 − xij ) log(1 − x̂ij ))}
j=1
0 1 1 0 1 (binary inputs) ∂L (θ)
Again we need is a formula for ∂W ∗ and
What value of x̂ij will minimize this ∂L (θ)
∂W to use backpropagation
function?
If xij = 1 ?
If xij = 0 ?
Indeed the above function will be
minimized when x̂ij = xij !
13/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
n ∂L (θ) ∂L (θ) ∂h2 ∂a2
L (θ) = − =
P
(xij log x̂ij + (1 − xij ) log(1 − x̂ij )) ∗
j=1 ∂W ∂h2 ∂a2 ∂W ∗
h2 = x̂i
a2 ∂L (θ) ∂L (θ) ∂h2 ∂a2 ∂h1 ∂a1
=
∂W ∂h2 ∂a2 ∂h1 ∂a1 ∂W
W∗
h1 We have already seen how to
a1 calculate the expressions in the
W square boxes when we learnt BP
h0 = xi The first two terms on RHS can be
computed as:
∂L (θ) xij 1 − xij
=− +
1 − x̂ij
∂L (θ)
∂h2j x̂ij
∂h
∂L 21
(θ)
∂h2j
∂L (θ)
∂h22 = σ(a2j )(1 − σ(a2j ))
= . ∂a2j
∂h2 ..
∂L (θ)
∂h2n
14/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7